84
Bioinformatics Workshop 1 Sequences and Similarity Searches • Open a web browser and type in the URL: – informatics.gurdon.cam.ac.uk/online/ workshops – Bookmark this page • Click on the link to the file: – useful-websites.html – Bookmark this page too – It also contains links to the example sequence files used in the workshop, and the presentations themselves

Bioinformatics Workshop 1 Sequences and Similarity Searches

  • Upload
    lynda

  • View
    34

  • Download
    0

Embed Size (px)

DESCRIPTION

Bioinformatics Workshop 1 Sequences and Similarity Searches. Open a web browser and type in the URL: informatics.gurdon.cam.ac.uk/online/workshops Bookmark this page Click on the link to the file: useful-websites.html Bookmark this page too - PowerPoint PPT Presentation

Citation preview

Page 1: Bioinformatics Workshop 1 Sequences and Similarity Searches

Bioinformatics Workshop 1Sequences and Similarity Searches

bull Open a web browser and type in the URLndash informaticsgurdoncamacukonlineworkshopsndash Bookmark this page

bull Click on the link to the filendash useful-websiteshtmlndash Bookmark this page toondash It also contains links to the example sequence

files used in the workshop and the presentations themselves

The Basic Questions

Where and how do I find something

How do I know itrsquos real

Exercise 0

Write a concise definition of what a gene is

Part 1 Structural Genomics

DNA arranged in chromosomes

Vertebrate ~ 109 base pairs

Chromosomes and Genes

Total of ~30000 genes on ~20 chromosomes

1000 ndash 2000 genes per chromosome

locus

Gene to Protein~ gene

mRNA

protein

genome

primary transcript

CTACCATCCATGCTAACCATTCTACTAGCATAACTGGCTA

Sequence Signals

CTACCATCCATGCTAACCATTCTACTAGCATAACTGGCTA

mRNA

MLTIL AL

Genomic Signals

transcription start site

===CGCTATAAGCG====================

===CGCAATAAAGCG===================

polyadenylation signal

===CACGATCGAGTC===================

promoters

enhancers

==ACGTAhelliphelliphelliphellipCAGTA====================

splice sites

Derivative Sequences

mRNA

capture by cloning into cDNA library

3rsquo EST

5rsquo EST

cDNA sequence

EST single pass sequence from each end of the clone

cDNA multiple pass sequencing over whole length of the clone

5rsquo 3rsquo

Gene Models

gene modelexons

Sequences and Genes(Accession Numbers and Names)

AAB229701

AAP212451

CAA415451

NP_1877592

proteins

S431051mRNAscDNAs lsquosimilar to Cyclin B1 [mus musculus]rsquo

gene

BT0064371 lsquoCyclin B1 isoform 1 [mus musculus]rsquo

X587081

NM_1119853 lsquoCCNB1 Cyclin B1 [mus musculus]rsquo

lsquoCyclin B1 isoform 2 [mus musculus]rsquo

Gene Symbols Names Etc

Gene Symbol CCNB1

Gene Name cyclin B1 [Homo sapiens]

Description G2mitotic-specific cyclin B1

Aliases CCNB CYCB1

A Gene-Centric View

Entrez Genehttpwwwncbinlmnihgov

Cyclin B1

S431051

BT0064371

X587081

NM_1119853

AAB229701

AAP212451

CAA415451

NP_1877592

Exercise 1

Go to Entrez Gene and look for your favourite gene or genes

genomic location

expression data

Sequences and Accession Numbers

NM_0010159221 gi=62860271

GATCGTTCGATTAGCTAGGGACACCACCGATCGATATGACCACAAAAA

BC0096381 gi=16307106

GTTCGATTAGCTAGGGACACCACCGATCGATATGACCACAAAA

NM_0010159222 gi=62860589

GACCGTTCGATTAGCTAGGGACACCACCGATCGATATGACCACAAA

NP_0010159221 protein translated from mRNA

XM_0011025671 predicted mRNA

XP_0010897651 predicted protein translated from predicted mRNA

mRNA Splicing Signals

gene model

genome

CTACCATCCATGCTAACCATTCTACCATTTTATACTCATGCAACGGACCGTAGCGTAGTCGCTTAGCATCCTTTATAACTGGCTA

CTACCATCCATGCTAACCATTCTAC CATTTTATACTCATGCAACGGACCGT AGCGTAGTCGCTTAGCATCCTTTATAACTGGCTA

CTACCATCCATGCTAACCATTCTACGTAAGTCATCTATATCAATATTATTTCAGCATTTTATACTCATGCAACGGACCGTGTCAGTATTACAGAGCGTAGTCGCTTAGCATCCTTTATAACTGGCTA

GTAAGdonor

TTTCAG acceptor

mRNA

exon intron exon intron exon

splice sites

Gene PredictionsGiven- coding sequence must run from ATG ndash STOP codon in-frame- introns GT AG can be spliced out

Also take a statistical approach- coding and non-coding sequence are slightly different in composition- some lsquopossiblersquo splice sites are more likely than others

CGTCGTATGGCTTCGATGTAGTACATCGGATCGGTATGGAATCATTTCAGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA

CGTCGTATGGCTTCGATGTAGTACATCGGATCGGTATGGAATCATTTCAGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA CGTCGTATGGCTTCGATGTAGTACATCGGATCGGTATGGAATCATTTCAGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA

CGTCGTATGGCTTCGATGTAGTACATCGGATCGGTATGGAATCATTTCAGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA

CGTCGTATGGCTTCGATTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA

CGTCGTATGGCTTCGATGTAGTACATCGGATCGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA CGTCGTATGGCTTCGATGTAGTACATCGGATCGGTATGGAATCATTTCAGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA

CGTCGTATGGCTTCGATTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA

CGTCGTATGGCTTCGATGTAGTACATCGGATCGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA

scan genomic sequence hellip

CGTCGTATGGCTTCGATGTAGTACATCGGATCGGTATGGAATCATTTCAGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA

most likely gene model

Supporting Evidence

EST evidence

genome

gene model

We note that in the absence of EST evidence it is only really possible to predict coding sequence with any confidence (and even thenhellip)

So predicted genes based on computational gene models alone will usually lack UTR regions which has some important consequences

exons 1 2 3 4

TheoreticalPredicted Sequences

genome

predicted gene modelexons 1 2 3 4

Wersquove now reversed the process of working out exon structure from aligning cDNA sequences against the genome sequence but we shouldnrsquot lose sight of the fact that we donrsquot really know if these predicted proteins exists ndash especially where supporting EST evidence is weak or non-existent

predicted transcript

predicted protein

Sequences for a model organism

ESTs ndash millions pound10 eachCheap to sequence ndash so we get millions per organismBut lots of errorsAnd incomplete gene sequencesCan give us relative expression levels

cDNAs ndash tens of thousands pound1000 eachExpensive ndash but only need to do one (or a small number) per geneFew errors with multipass sequencingGives us protein sequences

Genomes ndash one pound30000000Extremely expensiveBut the only way to get the whole pictureGives us gene regulation

So Whatrsquos in the Databases Now

15000000ESTs

3300000cDNAs

NCBI July 2005

2700000proteins

950000proteins

nrRefSeq

DNA

Proteins

Part 2 Comparative Genomics

ATGAAGGCTGCCTACGACTGCCGTGATGCAGGCTGCCTACGACTGCCGTGATGCAGGCTGCCAACGACTGCCGTGATGCATGCTGCCAACGACTGCCGTGATGCATGCTGCCAACGACTGCCCTGATGCATGCTGCCAACGGCTGCCCTGATGCATGCTGCCAACGGATGCCCTGATGCATGCCGCCAACGGATGCCCTGATGCATGCCGCCAACGGATGTCCTG

Imagine one mutation gets fixed every 100000 years in this gene sequencehellip

Gene sequence

Evolution by sequence mutation

Speciation

ATGAAGGCTGCCTACGACTGCCGTG

ATGCAGGCTGCCTACGACTGCCGTGATGCAGGCTGCCAACGACTGCCGTGATGCATGCTGCCAACGACTGCCGTGATGCATGCTGCCAACGACTGCCCTGATGCATGCTGCCAACGGCTGCCCTGATGCATGCTGCCAACGGATGCCCTG

Gene AATGAAGGCTGCCTACGACTGCCGTG

ATGAAGGCCGCCTACGACTGCCGTGATGAAGGCCGCCAACGACTGTCGTGATGAAAGCCGCCAACGACTGTCGTGATGAAAGCCGCCAACGACAGTCGTGATGAAAGCCGCCTACGACAGTCGTGATGAAAGCCGCCTACGACAGTCCTG

ATGCATGCTGCCAACGGATGCCCTG

ATGAAAGCCGCCTACGACAGTCCTG||| | || ||| ||| | ||||

If the genetic difference means they can no longer interbreed with fertile offspring ndash then we have a new specieshellip

Residual Similarity

ATGCATGCTGCCAACGGATGCCCTG

ATGAAAGCCGCCTACGACAGTCCTG||| | || ||| ||| | ||||

ATGCATGCTGCCAACGGATGCCCTG

ATGGAAGGCGCTTAGGATAGTCCAG||| | | || | | | || |

After longer periods of evolution homology may no longer be detectable in the DNA sequencehellip

We can still easily detect residual similarity between these sequences this is what we call homology ndash detectable similarity because of common evolutionary origin

Computers Can Detect Homology

In fact computers are very good at this task ndash the two primary challenges are

(a) performing the search fast enough to look through millions of sequence in a timescale compatible with a lab scientistrsquos attention span

(b) at low levels of similarity being able to distinguish between biologically related sequences and chance matcheshellip

ATGCATGCTGCCAACGGATGCCCTG

ATGAAAGCCGCCTACGACAGTCCTG||| | || ||| ||| | ||||

GCTGACTCGTAGCGCTTAGCTAGCT

CCAACATCTAGCCAGATTAGTTAGT | || | | | |

Orthologs

A A

A Gene duplication though speciation The two copies of Gene

A will now evolve independently but will continue to have the ~same function

They are ORTHOLOGS

Paralogs

A

Gene duplication though internal genome duplication

The two copies of Gene A will now evolve independently but will probably not continue to have exactly the same function

They are PARALOGS

A

A Arsquo

A

lsquoOtherrsquo-logsWhat about gene duplication after speciation

How can we describe the relationship(s) between the various copies of gene A in the two frogs

Bear in mind that understanding gene function is more important than semanticshellip

The two copies of A in the orange frog are sometimes called IN-PARALOGS

If they were also present in the green frog (and therefore were in the ancestor species) they would be OUT-PARALOGS

A

A

A

Arsquo A

The Essential Paradigm

1 any group of modern species can be traced back to some extinct common ancestor

A

A

2 in all likelihood they share orthologous genes which have the same function in the modern animal as in the extinct ancestor

3 If we can experimentally determine the function of a gene in one of these organisms then there is a good chance the ORTHOLOGOUS gene in another organism will have the same function

A A

cyclin b1

cyclin b1

Function Conserved Longer than Detectable Similarity

start from first self-replicating sequence

same function detectable similarity

living organisms

whole genome duplication local duplication

Redundancy in the Genetic Code

GCA A alanine GCC A GCG A GCT A

TGC C cystine TGT C

GAC D aspartate GAT D

GGA G glycine GGC G GGG G GGT G

lsquoSynonymousrsquo or lsquosilentrsquo mutations in the third position of the codon triplets have no effect on the amino acid coded for ndash so there is no evolutionary pressure against thishellip

Protein Similarity Persists Longer

CTATCACGAGAACCTGTGCTATCCCGAGAACCTGTGCTATCCCGAGAACCAGTGCTATCCCGTGAACCAGTGCTATCCCGTGAGCCAGTGCTATCCCGTGAGCCAGTTCTGTCCCGTGAGCCAGTT

CTATCACGAGAACCTGTG

CTGTCCCGTGAGCCAGTT|| || || || || ||

LSREPV

LSREPV||||||

CTATCACGAGAACCTGTG

TTGTCCCGGTCGCCAGTT | || | || ||

LSREPV

LSRFPV||| ||

67 100

44 80

Always Compare Protein Sequences

ATGAATGCAGCCTATGATTGCCGAGCCAGAATGCTAAGG MNAAYDCRARMLR ||||| || || || || || || || ||||| || || | ||||||||+||ATGAAGGCCGCATACGACTGTCGTGCTAGAATCCTGAGA MKAAYDCRARILR

DNA comparison amino acid comparison

The DNA sequence can change while the amino acid sequence stays the same so always look for similarities by comparing amino acid sequences

Exercise 1nucleotide vs amino acid search

Go to the file example-sequenceshtml and locate the section for this exercise There should be two sequences lsquosurfeit1rsquo for frog and fly

Go to NCBI Blast home page then lsquoAlign two sequencesrsquo (bottom left lsquospecialrsquo panel) paste one sequence into each window and hit lsquoAlignrsquo ndash this will do a direct DNADNA comparison

Now find the open reading frames of the two genes and translate them into amino acid protein sequences then repeat the two sequences comparison

Go to NCBI ORF Finder ndash paste sequence ndash hit OrfFind ndash identify longest ORF ndash click on it ndash next screen hit Accept ndash change View to Fasta protein ndash hit View ndash copy sequence to Blast2Seqs Do the same with the other sequence

Before you hit lsquoAlignrsquo change the lsquoProgramrsquo (top left) to blastphellip

Answers Exercise 1

The Essential Taskexperiment data mining

gene sequence what is its function

database of proteins in other species

Cyclin-AFoxA1

cdc25

alpha-tubulin

Predicted protein

Gravin-like

Sprouty-2

calmodulin

KIAA10786568

frizzled

Wint8

Troponin T3

Gravin-like

we can only do this because of implied function based on orthology

Functional Orthologs

function known annotation lsquoGravinrsquo available

Human geneXenopus genefunction unknown

sequence similarityorthologs

same function But we know that function is largely determined by shape

similar shape

Which in general we cannot determine ndash but it is probably SHAPE not SEQUENCE that is conserved

We make an assumption that the same gene function is likely to be present in the two organisms and the ones that have this function are likely to be the most similar in sequence

Finding OrthologsSo how do we find orthologs and can we know when we have

The simplest is Reciprocal Best BLAST but it implicitly relies on having all the protein sequences of you own organism and the one you wish to find an ortholog in

frog proteindatabase of human proteins

best match human protein

database of frog proteins

x

Using Synteny is Better

We know that large regions of (say) vertebrate genomes have preserved their overall organisation from one organism to another

And we find the same genes (ie orthologs) in more or less the same order in the syntenic sectionsThese of course represent chromosomal re-arrangements since these organisms diverged

Human chromosome 5

Mouse chromosome 10

Mouse chromosome 2

MetazomeFortunately someone has done all the hard work for ushellip Dan Rokhsar httpwwwmetazomenet

Metazome Exercise

Go back to Entrez Gene and look for your favourite gene again

Pick probable ortholog vertebrate genes from common organisms (human mouse rat chicken frog fish) and paste their protein sequences into a temporary space

Go to Metazome (httpwwwmetazomenet) find the blast window open two versions of it and blast your sequences against the Tetrapod or Jawed vertebrate node

See if you get the same cluster ID as best top hit and have a look at the Metazome alignment(s)hellip

Part 3 Finding Sequence Similarities

We want computer programs which will compare sequences at all possible different alignments looking for a degree of similarity greater than we would expect to find by chance

But first we have to consider the implication of gapshellip

Insertions and deletions are other possible forms of mutations and they can really mess up our simple alignments

ATGCATGCTGCCAACGGATGTCCTG

ATGAAAGCCGCCTACGAAAGTCCTG||| | || ||| ||| | ||||||

ATGCATGCTGGCCAACGGATGTCCTG

ATGAAAGCCGCCTACGAAAGTCCTG||| | || ||| | | | |

Gaps in Alignments

Consider these two obviously similar sequences

TTCCCAACTCTCCTCTTTCACCATGAAGCTCAAGGACAGATTCCACTCGCCCCAAAATCAAGCTCACCCCGTCCAAGAA | || | || |||||||||||||||||||| ||||||||| ||| ||| | ||| | | |TTCCCACCTCTCCTCTTTGCACCATGAAGCTCAAGGACAAATTCCACTCCCCCAAAATCAAGCGCACCCCGTCCCAGAA

In fact we realise that the most probable alignment (regarding biological origin) is with a small gap in each sequence

TTCCCAACTCTCCTCTTT=CACCATGAAGCTCAAGGACAGATTCCACTCGCCCCAAAATCAAGCTCACCCCGTCCAAGAA |||||| ||||||||||| |||||||||||||||||||| ||||||||| |||||||||||||| |||||||||| ||||TTCCCACCTCTCCTCTTTGCACCATGAAGCTCAAGGACAAATTCCACTC=CCCCAAAATCAAGCGCACCCCGTCCCAGAA

So in general we allow ourselves to insert gaps until we find the optimal alignment

But where should this process stop

The Downside of GapsTake two random sequences with no lsquorealrsquo similarity

GACACTAGGTCGATGCGTGGTGGCGAGA

ACGCATCCGGATGTGCACCGTGGAACTG

And allow lsquocost freersquo gaps

GAC--ACT----AGGTCGATGC---GTGG---TGGCGAGA || | | | | | ||| |||| || ACGCA-TCCGGA--T-G-TGCACCGTGGAACTG

Clearly although the alignment has no mismatches it is obviously not biologically meaningful

To prevent this we assign a cost to adding gaps which is offset against the benefit of finding matches ndash and this is the essence of lsquofinding gapped alignmentsrsquo

We want to find the lsquoalignmentrsquo between the two (or more) sequences which shows the greatest degree of similarity while introducing the fewest gaps hellip

BLAST

gtqueryAGACGAACCTAGCACAAGCGCGTCTGGAAAGACCCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGTTCAGGAGTATTTGGACTGCAATATTGGCCCTCGTTCAAGGGCGCCTACCATCACCCGACGGTCATGCCGGTCCCCAGCAGCTGCTAATAACTTCCTTCGCTACTCAAGTTACCACGCTAGCAAAACCCACGGCATACCGTTTACCCTTTAAAATCAGCTTCAACCAGCAACGAA

There are many programs used to find similarities between sequencesThey range from relatively slow programs which find the exact best matching alignment through ones which take progressively inexact shortcuts to speed things up Of this latter class the best known and easily most widely used is BLAST developed by Stephen Altschul and others and continuously refined over the last 10-15 years

The essential idea is to compare your query sequence against a collection or lsquodatabasersquo of target sequences looking for the one(s) that match the query sequence the best

gttarget1AAAACAGGAATATTTACCGGGACCGGGTAATGATGCATCTCGAGGTACACAATATACCTG GAGAACCGAATTATGAGTTGGCCACCTTACTTAACGAAACCAGCAGAGAAAATCCAACAT GGCAACACCCCTCTGACTACACTAGAAGGAACTACTATGTAAGAAAACAGCCTGTCCCTT GCAGTTTGAATGACTGGGTGATGCGAAATGGGGGTCCTGCCATAGAGCGCTTCCATGGTT TACCTTGCACATTTCAGAGAAGTCCTATGCCAGGAGTCCTTCCTACAGGGCCTTCCTGAA ACTATATATGTGCTTATTCTTGTTTGATTTGGCTTTGCAGgttarget2CTCTTAATTTATTTCTCTTCCTGCAGCTCCCTCGCTTTTTCCTTTCCCTGTTACATTCAT CTGACTTGAAGAGTTGCAAATTTTCAGTGTTTCTGTTTTTGTTGCTGATATGTTGTAAAC TTTTTAATAAAATCTATTTCTATAG gttarget3GCAGTTTGAATGACTGGGTGATGCGAAATGGGGGTCCTGCCATAGAGCGCTTCCATGGTT TACCTTGCACATTTCAGAGAAGTCCTATGCCAGGAGTCCTTCCTACAGGGCCTTCCTGAA ACTATATATGTGCTTATTCTTGTTTGATTTGGCTTTGCAGCTAGGGTTTTCACCTTTTCT GGAAAAAAAAATACTGGCTTCC gttarget4CTGCTATTAATGGGCAAAACAACTCAAATAAAGTCCCTCTGCCACCCTCAGACACTGCCC CTGGCCCCCAGCTGCCCGCTGATCCTTGTAGCCAGAGCAGTAAAGTTTTGAAAGTGGAGC CCAAGGAGAATAAAGTTATTAAAGAAACTGGCTTTGAACAAGGTGAAAAGTCTTGTGCAG CACCTCTAGATCATACTGTGAAGGAAAATCTTGGACAAACTTCTAAAGAACAGGTGGTAG

query

database

COMPARE

LIST MATCHES

Flavours of BLAST

ACGATAGATCCCATCCATAAAT ATGACGATAGATCCCATCAT CGATAGGACCACCACA GATAGACCAGGATACATAGGATAATTA AGCTCGCTTGGCTCGATGGCT

MKJLSPWERSYTRGHYTWER MGHTVNBZY MKLPWRHGDBKJGMNDFD MBKLRPIUHDFRTASGSLKWWRTVBN

MKJLSPWERSYTRGHYTWER MGHTVNBZY MKLPWRHGDBKJGMNDFD MBKLRPIUHDFRTASGSLKWWRTVBN

ATGACGATAGATCCCATCAT CGATAGGACCACCACA GATAGACCAGGATACATAGGATAATTA AGCTCGCTTGGCTCGATGGCT

query sequence other operation database sequences

ATGACGATAGATCCCATCAT CGATAGGACCACCACA GATAGACCAGGATACATAGGATAATTA AGCTCGCTTGGCTCGATGGCT

BLASTn

BLASTp

BLASTx

tBLASTn

tBLASTx

ACGATAGATCCCATCCATAAAT

ACGATAGATCCCATCCATAAAT

MQWCGYRWTYQGYRW

MQWCGYRWTYQGYRW

FAST

FAST

SLOW

SLOWER

HORRIBLY

SLOW

6 fra

me

trans

latio

n

How does it work The main task of any sequence comparison program is to test all possible mutual alignments of two sequence and see how good the match is

CTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGT

CCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGTTC | | | | | ||||||||||||||||||||||||| CTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGTCTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGT || | | | | | | | | |||||||||||||||||||||||| | | | | | |

CCGAGCTTCTCATTGCTCTTCCTAACAGTG=TGATAGGCTAACCGTAATGGCGTTC||||||||||||||||||||||||| ||||||||||||||||||||||||

query

1st database sequence

This would actually be a very slow search process if implemented like thishellip

BLAST achieves its speed through two strategies

- it takes a WORD based approach- it pre-INDEXES database sequences

BLAST WORDS and INDEXING1 GACAAATCCAAACCCCTGAAGTTCTCCACCAGCAAAGCCA

2 TAAGCAAATTTAATTTTGTTTACATTTTC

3 GTTAAGACCTTCCCTGACATTTGCAGCAGTTTCAAATGTA

AAAAAAAA 00001 AAAAAAAC 00002 AAAAAAAG 00003 ACAAATCC 07967 ACAAATCC 07968 ACAAATCC 07979 GACAAATC 33568 GACAAATG 33569 TCCAAACC 64321 TCCAAACC 64322

sequence position word

1 1 33658

1 2 07967

1 3 16210

3 15 33568

3 16 07967

Database of sequences

Numbered list of all possible lsquowordsrsquo

Build a position index of all words in the database

Analyse the Query Sequence gtquery AGACAAATCCAAACCCCTGAAGTTCTCCACCAGCAAAGCCA

AAAAAAAA 00001 AAAAAAAC 00002 AAAAAAAG 00003 ACAAATCC 07967 ACAAATCC 07968 ACAAATCC 07979 GACAAATC 33568 GACAAATG 33569 TCCAAACC 64321 TCCAAACC 64322

QUERY SEQUENCE

Numbered list of all possible lsquowordsrsquo

position word

1 14236

2 33658

3 07967

Analyse QUERY SEQUENCE

sequence position word

1 1 33658

1 2 07967

1 3 16210

3 15 33568

3 16 07967

Index of database

Expand from Word Based Matches We lsquoinstantlyrsquo know which sequences in the database have at least a word length match with our query sequence and at what relative position

Next the potential alignments are expanded adding up a score for (total matches ndash mismatches ndash gap penalties) to make the best possible alignment But this is usually for a tiny proportion of the sequences in the database ndash so overall it is much quicker

The highest scoring alignments are reported

But we can potentially miss alignments with no word-size bits in common consider BLASTn with a default word-size of 11

TCGGAAGTGGAAGCTGAACCTGATTGTAGAGTTGGAGGCCAGTGTTCTGGCTGAGC||||||||| ||||| |||||||||| |||||||||| |||| ||||| ||||||| TCGGAAGTGTAAGCTCAACCTGATTGCAGAGTTGGAGTCCAGAGTTCTAGCTGAGC

Care is sometimes neededhellip

BLAST ndashTypical OutputINPUT

gtpartial cDNA sequence Xenopus tropicalisCGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGTTCCCACCTCTCCTCTTTCACCATGAAGCTCAAGGACAAATTCCACTCCCCCAAAATCAAGCGCACCCCGTCCAAGAAGGGGAAGCCGGCCGACCTCACCGTCAAAACAGAAGAGAAACCCGTCAACAAAACCTTAAGCCGCTTGGAGGAACAGGAGAAAGAAGTCGTTAATGCCTTGCGTTACTTTAAGACAATTGTTGACAAGATGGCGGTGGACAAGATGGTGCTGGTGATGCTGCCAGGGTCGGCGA

OUTPUTQuery= (311 letters) Database NCBI Protein Reference Sequences 954378 sequences 347895532 total letters

gtgi|41055060|ref|NP_9574201| similar to guanine nucleotide-releasing factor 2 (specific for crk proto-oncogene) [Danio rerio]

Length=691

Score = 133 bits (335)Expect = 6e-31 Identities = 7698 (77) Positives = 8298 (83) Gaps = 498 (4) Frame = +2

Query 26 MSGKIE-KADSQRSHLSSFTMKLKDKFHSPKIKRTPSKKGKPA--DLTVKTEEKPVNKTL 196 MSGKIE K +SQ+SHLSSFTMKL KFHSPKIKRTPSKKGK + VKT EKPVNK + Sbjct 1 MSGKIESKHESQKSHLSSFTMKLM-KFHSPKIKRTPSKKGKQLQPEPAVKTPEKPVNKKV 59

Query 197 SRLEEQEKEVVNALRYFKTIVDKMAVDKMVLVMLPGSA 310 SRLEEQEK+VV+ALRYFKTIVDKM VD VL MLPGSA Sbjct 60 SRLEEQEKDVVSALRYFKTIVDKMNVDTKVLQMLPGSA 97

When is a match significant

RFKISDCQHPCTYSHNQYMTNHMRECPYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV

NFSWKKTSEKETNCQFDYPNDYNEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFNMCWLEVNSS

RF---KISDCQHPCTYSH-NQYMTNHMREC----PYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV F K S+ + C + + N Y N +C P+ + +W +P + D I N M ++ NFSWKKTSEKETNCQFDYPNDY--NEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFN------MCWLEVNSS

Here is a lsquotypicalrsquo weak alignment from BLASTp

In fact the sequences were randomly generated so there is no biologically significant alignmenthellip

E-values

The number of matches like the discovered match that I would expect to find by chance

An E-value of 00 implies that I would expect no matches like this to arise by chance thereforehellip

An E-value of 1 implies I would expect 1 match like this to arise by chance so if I have a match with such an E-valuehellip

Also ldquoexpect valueldquo or ldquoexpectationrdquo

E-values From First Principles

Some database statistics (23rd July 2005)

Database NCBI RefSeq mRNA 272619 sequences 503566580 total letters (~50 x 108)

Database NCBI nr 3329110 sequences 14601814750 total letters (~14 x 1010)

Notation

12e-35 = 12 x 10-35

48 x 106 = 4800000

We will consider first searching a nucleotide sequence (lsquoACGTAGACGTrsquo) against a nucleotide database eg the RefSeq mRNA above

Then we will consider the more complex case of amino acid sequence (protein) searches Which is of course what we mostly do

Calculating an E-valueThe RefSeq mRNA database has ~ 50 x 108 letters There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA

Query = lsquoArsquo

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACrsquo

AC AC AC AC

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACGrsquo

ACG

Expected number of matches = (50 x 108) 4 = ~12 x 108

Expected number of matches = (50 x 108) (4x 4) = ~31 x 107

Expected number of matches = (50 x 108) (4 x 4 x 4) = ~81 x 106

Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer

Expected number of matches = (50 x 108) (4 x 4 x 4 x 4 hellip 60 times ) = (50 x 108) 1036 = 50 x 10-28

E-value = 50 x 10-28

E-values In PracticeSo if I take a 60 nt sequencegtsequenceACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA and actually BLAST it against the RefSeq mRNA database I get

BLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds

Length=6060 Score = 119 bits (60) Expect = 2e-26 Identities = 6060 (100) Gaps = 060 (0) Strand=PlusPlus

Query 1 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 60 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 2977 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 3036

What do I get if I BLAST it against the larger nr databaseBLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds

Length=6060 Score = 119 bits (60) Expect = 6e-25 Identities = 6060 (100) Gaps = 060 (0) Strand=PlusPlus

Query 1 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 60 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 2977 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 3036

theoretical value was 50e-28 -

E-value Exercise

Given a transcription factor binding site

ACC[TG]TA

How many would you expect to find by chance in a 10k promoter sequence

How would this differ if there was an optional additional base between the 4th and 5th positionsIeACC[TG]TAOR ACC[TG]TA

E-value Exercise AnswerACC[TG]TA

Expect lsquoArsquo every 4 ntExpect lsquoACCrsquo every 4x4x4 = 64 nt

Expect lsquoT or Grsquo every 2nd ntExpect lsquoACC[TG]rsquo every 64x2 nt = 128 nt

Expect lsquoTArsquo every 4x4 = 16 ntExpect lsquoACC[TG]TArsquo every 128x16 nt = 2048 nt (4x4x4x2x4x4)We would expect ~5 of these promoter sites every 10k by chance

If also ACC[TG]TAA allowed

The two motifs independently have the same E-valueTo allow either means we expect twice as many

E-values Effect of Database SizeThe nr mRNA database has ~ 14 x 1010 letters (was RefSeq and 50 x108)There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA

Query = lsquoArsquo

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACrsquo

AC AC AC AC

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACGrsquo

ACG

Expected number of matches = (14 x 1010) 4 = ~12 x 108

Expected number of matches = (14 x 1010) (4x 4) = ~31 x 107

Expected number of matches = (14 x 1010) (4 x 4 x 4) = ~81 x 106

Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer

Expected number of matches = (14 x 1010) (4 x 4 x 4 x 4 hellip 60 times ) = (14 x 1010) 1036 = 14 x 10-26

E-value = 14 x 10-26

(was E-value = 50 x 10-28)

E-values Effect of Database Size

The E-value is simply dependent on database size

RefSeq

nr

14 x 1010 letters

50 x108 letters

30 x bigger

BLAST the same sequenceagainst each E-value = 14e-26

E-value = 50e-28

The database was ~30 times bigger and so the E-value was ~30 times bigger

Why were the values differentOur calculated E-value for searching against the RefSeq mRNA database was 50 x 10-28But our actual BLAST search at NCBI gave a value of

20 x 10-26 - about 40x larger - why is this

Gapped alignments

If we were expecting N matches for a query sequence lsquoACGTACGTACGTrsquo imagine what would happen to N if we allowed gaps in our matches

ACGTACGTACGT

This would now give us additional possible alignments that would meet our lsquomatchrsquo criteria

ACGTACGTACGT ACGTACAGTACGT ACGTACCGTACGT etc|||||||||||| |||||| |||||| |||||| ||||||ACGTACGTACGT ACGTAC-GTACGT ACGTAC-GTACGT

We will expect many more matches in a given database if we allow our alignments to have gaps The E-value will be larger

E-values Effect of Query Length

Biologically itrsquos the same match Does it mean we are any less sure that this match didnrsquot occur by chance The E-value is simply dependent on match length

database

BLAST 500 nt sequence against a database

BLASTn Get a full length match with sequence XYZ at an E-value = 50e-160

gtsequence ACTAGTCTAGCTAGACATCGATCGATGATGCTACACAGATAGACGATAGATAGTAAGTCGATCGATCGCGCATCGATCGTCTAGATCGATCGCTCGCTGTGTAGATAGATCGGCGATAGA

database

BLAST half of the same sequence against the same database

BLASTn

gtsequence ACTAGTCTAGCTAGACATCGATCGATGATGCTACACAGATAGACGATAGATAGTAAGTCG

Get a match with sequence XYZ again but at an E-value = 50e-80

Why not just use identityAt some levels this a good question

But consider two very different searches both of which give a 75 identity match

Query1 was 60 nt longCGGAGCTCAGGGCTTAACGACTGATATCTCCGCGCATGTCGAGAAACGATACAGCCAGCG||||||||||| || | || | || || |||| | | | |||||| | ||||||||||CGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGWhich would have an E-value ~ 50 x 10-19

And Query2 only 16 nt longACGTACGTACGTACGT||| || | |||| ||ACGCACCTTCGTAGGTWhich would have an E-value ~ 30

And intuitively we feel we would expect to see that sort of number of matches in the database just by chancehellip

So whatrsquos the real problemBasically you are usually trying to answer the question

Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism

Are there any useful guidelines though at least for biological meaningfulness

Basically you are usually trying to answer the question

Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism

BLAST

The difficulty is because

ORTHOLOGY

BLAST Similarity + Probability

biological knowledge

nature of query sequence

phylogenetic relationship

match length PI size of databasehellip

Rules of ThumbHow good does an E-value have to be before we might even think we have an ortholog

largerworse smallerbetter

E-values 10-5 10-10 10-40 10-100 00

fantasy borderline encouraging

pretty good canrsquot get

better

But note that in some gene families with closely related members you can get an E-value of 00 for several different matches and then identity may be more sensitive Also bear in mind in cases like this that ideas of lsquofunctionalrsquo orthology may break down with more than one locus producing identical proteins which share the same functionhellip

Protein BLASTItrsquos (nearly) always better to make comparisons at the amino acid level between protein sequences than the DNA level because the amino acid sequence is more conserved than the underlying DNA sequenceDoes this cause us to treat expected values any differently

If we follow the argument as before then for an exact match of a 20 amino acid sequence in the RefSeq protein database each additional amino acid will reduce the E-value by 120th (there are 20 different amino acids) And as there are 347895532 letters in that databaseE-value = ~35 x 108 (20 x 20 x 20 hellip20 times) = ~35 x 10-18

But this is what we get if we run the blast at NCBI

Score = 431 bits (100) Expect = 8e-04 Identities = 2020 (100) Positives = 2020 (100) Gaps = 020 (0) Frame = +3

Query 3 SSSSFRAYRAALSEVEPPCI 62 SSSSFRAYRAALSEVEPPCISbjct 972 SSSSFRAYRAALSEVEPPCI 991

Really too big a discrepancy to easily explain with hand wavinghellip

Amino Acid Substitutions

A SC F LWYG I LMVL IMFVM ILVP V ILMW FY

N DHSQ REHKS ANTT SY HFW

H NQYK RQER QK

D NEE DQK

In fact we need to take into account both amino acid substitutability as well as as before allowing gapped alignments On average any residue can be substituted for by about 2 others so each position has about 17th chance of lsquomatchingrsquo rather than 120th

So now we getE-value = ~35 x 108 (7 x 7 x 7 hellip20 times) = ~44 x 10-9which is much closer to the actual BLAST value

These substitutabilities are dealt with by the BLOSUM and PAM matrices

ExercisesGo to the file random-DNA-sequenceshtml select one of the 20 randomly generated nucleotide sequences and do a BLASTx (translated DNA-gtprotein) at NCBI against the nr protein database

Did you find any lsquosignificantrsquo hits

Repeat with a second sequence

What conclusions might you draw from this exercise

Try the same sequence(s) against the nr nucleotide database

Is there any general difference

Part 4 Tweaking BLASTAlthough you normally see BLAST as a web page with boxes to place data in and tick boxes etc it is actually a command line program that can be run just by typing the appropriate command and options eg

promptgt blastall ndashp ltblast typegt ndashi ltinput sequencegt ndashd ltdatabasegt

This is the simplest form where the basic program lsquoblastallrsquo takes a number of different options or parameters indicated by the ndashx and followed by its value -p ltwhich blast flavour to rungt-i ltfile with query sequence ingt-d ltpre-indexed database namegt

Not All Parameters are hellipThere are many other parameters and if not listed explicitly they will use a default value most appropriate to the blast flavour requested Eg for ndashW ltword sizegt blastn uses ndashW 11 where blastx uses ndashW 3

There are also some options that appear on the web pages that are not really parameters but manage the job in a similar way One of the most useful of these is on the NCBI blast pages where you can use Entrez queries or pick from an organism list to modify your search

The Many Parameters of BLASTThere are almost literally hundreds of parameters but most are way too obscure even for die-hard techies like me Very few of them are regularly useful in any but their default value but just occasionally they are very necessary

Here are some of the ones that I have used

-e max expected value -m output format (graphical or tabularspreadsheet)-F filter query sequence for low complexity (default TRUE)-U use only upper case regions of query (default FALSE)-G gap opening cost-E gap extension cost-q nucleotide mismatch penalty (BLASTx uses matrices)-r nucleotide match reward-b number of matching sequences to report-g allow gaps (default TRUE)-W word size-z effective database size (removes effect of actual database size)-S query strands to search (default both directions)-l restrict database sequences to given list of lsquogilsquo numbers

BLAST Parameters Exercises1 BLASTn vs BLASTx

Open the file example-sequenceshtml copy the sequence gtblastn-vs-blastxThis is a Xenopus tropicalis cDNA sequence

Go to the NCBI BLAST Home PageNucleotide-nucleotide BLAST (blastn) section Paste your sequence into the box

Run BLASTn against the nr nucleotide database using all default optionsThen hit [format] to wait for the results in a new page(hint if you paste the sequence definition line lsquogtnamersquo into the box as well your results will be labelled accordingly which can be useful)

Now repeat but go to the TRANSLATED BLAST section and BLAST against the nr protein database using BLASTx

How might the different results help us view the presence of this gene in other vertebrates

Results for Exercise 1 BLASTn

BLASTx

BLAST Parameters Exercises2 Low complexity filtering

Open the file example-sequenceshtml copy the sequence gtlow-complexity-filtering-A This sequence contains a long AT tandem repeat

Go to the NCBI BLAST Home PageTRANSLATED BLAST sectionBLASTx Paste your sequence into the box

Carefully UNTICK the ldquoChoose filter [ ] Low complexityrdquo BOX in the second section And then run BLASTx against the nr database

What do you feel about these alignmentsRe-run but leave the low-complexity filter ON this timeDoes this change our view of the protein matches

Now continue with gtlow-complexity-filtering-B and ndashCC is an especially interesting case ndash what can we deduce about the cDNA sequence Annotators beware

Results for Exercise 2A (OFF) BLASTn ndash low complexity filtering OFF

Results for Exercise 2A (ON) BLASTn ndash low complexity filtering ON

Results for Exercise 2B

ON OFF

Results for Exercise 2C

There is a sequence error an extra G at position 117 in the sequence cDNA (117)AGAAAAGAAGAAACATGGCAATGGATCAGAA|||||||||||||||| ||||||||||||||AGAAAAGAAGAAACAT-GCAATGGATCAGAA

Genomic sequence

ON

OFF

BLAST Parameters Exercises3 Limit by Entrez query

Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching sequences in fruit fly enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT

Open the file example-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home Page TRANSLATED BLAST sectionBLASTx and paste the sequence

Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip

BLAST Parameters Exercises4 BLASTn vs tBLASTx and nucleotide mismatch penalties

Open the file example-sequenceshtml

Also open the NCBI BLAST Home PageSPECIAL ndash Align two sequences section

There are several Xenopus tropicalis cyclins in the examples fileCopy the sequence gtcyclin-A1-Xt to the Sequence 1 BLAST windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 BLAST window(i) Run the default comparison should be BLASTn Note the alignmentNow run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these

(ii) Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1 Then try reducing the gap open and gap extension penaltieshellipWhat do we learn from this

(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip

Results for Exercise 4 (i)

BLASTn tBLASTx

Results for Exercise 4 (ii)

Mismatch penalty = -2 (default) Mismatch penalty = -1

BLAST Parameters Exercises5 E-Value maximum for reporting

Open the file example-sequenceshtml

Copy the sequence gtsumo-binding-motif and go to the NCBI BLAST Home PageGo to the PROTEIN BLAST section BLASTp and paste the sequence

Run the search with the default values

Now re-run the search setting the maximum E-value in the box

Expect 100

What difference does this make

BLAST Parameters Exercises6 Word Size

Open the file example-sequenceshtml

Copy the sequence gtmorpholino and go to the NCBI BLAST Home PageGo to the NUCLEOTIDE BLAST section BLASTn and paste the sequence

Check OFF the low complexity filter and then run the search

Now re-run the search setting the following parameters

Low complexity OFFExpect 100Word Size 7Other advanced -q-1 (mismatch penalty -1 instead of default -3)

What difference does this make

END

  • Bioinformatics Workshop 1 Sequences and Similarity Searches
  • The Basic Questions
  • Part 1 Structural Genomics
  • Chromosomes and Genes
  • Gene to Protein
  • Sequence Signals
  • Genomic Signals
  • Derivative Sequences
  • Gene Models
  • Sequences and Genes (Accession Numbers and Names)
  • Gene Symbols Names Etc
  • A Gene-Centric View
  • Sequences and Accession Numbers
  • mRNA Splicing Signals
  • Gene Predictions
  • Supporting Evidence
  • TheoreticalPredicted Sequences
  • Sequences for a model organism
  • So Whatrsquos in the Databases Now
  • Part 2 Comparative Genomics
  • Speciation
  • Residual Similarity
  • Computers Can Detect Homology
  • Orthologs
  • Paralogs
  • lsquoOtherrsquo-logs
  • The Essential Paradigm
  • Function Conserved Longer than Detectable Similarity
  • Redundancy in the Genetic Code
  • Protein Similarity Persists Longer
  • Always Compare Protein Sequences
  • Exercise 1 nucleotide vs amino acid search
  • Answers Exercise 1
  • The Essential Task
  • Functional Orthologs
  • Finding Orthologs
  • Using Synteny is Better
  • Metazome
  • Metazome Exercise
  • Part 3 Finding Sequence Similarities
  • Gaps in Alignments
  • The Downside of Gaps
  • BLAST
  • Flavours of BLAST
  • How does it work
  • BLAST WORDS and INDEXING
  • Analyse the Query Sequence
  • Expand from Word Based Matches
  • BLAST ndashTypical Output
  • When is a match significant
  • E-values
  • E-values From First Principles
  • Calculating an E-value
  • E-values In Practice
  • E-value Exercise
  • E-value Exercise Answer
  • E-values Effect of Database Size
  • Slide 58
  • Why were the values different
  • E-values Effect of Query Length
  • Why not just use identity
  • So whatrsquos the real problem
  • Rules of Thumb
  • Protein BLAST
  • Amino Acid Substitutions
  • Exercises
  • Part 4 Tweaking BLAST
  • Not All Parameters are hellip
  • The Many Parameters of BLAST
  • Slide 70
  • BLAST Parameters Exercises
  • Results for Exercise 1
  • Slide 73
  • Results for Exercise 2A (OFF)
  • Results for Exercise 2A (ON)
  • Results for Exercise 2B
  • Results for Exercise 2C
  • Slide 78
  • Slide 79
  • Results for Exercise 4 (i)
  • Results for Exercise 4 (ii)
  • Slide 82
  • Slide 83
  • END
Page 2: Bioinformatics Workshop 1 Sequences and Similarity Searches

The Basic Questions

Where and how do I find something

How do I know itrsquos real

Exercise 0

Write a concise definition of what a gene is

Part 1 Structural Genomics

DNA arranged in chromosomes

Vertebrate ~ 109 base pairs

Chromosomes and Genes

Total of ~30000 genes on ~20 chromosomes

1000 ndash 2000 genes per chromosome

locus

Gene to Protein~ gene

mRNA

protein

genome

primary transcript

CTACCATCCATGCTAACCATTCTACTAGCATAACTGGCTA

Sequence Signals

CTACCATCCATGCTAACCATTCTACTAGCATAACTGGCTA

mRNA

MLTIL AL

Genomic Signals

transcription start site

===CGCTATAAGCG====================

===CGCAATAAAGCG===================

polyadenylation signal

===CACGATCGAGTC===================

promoters

enhancers

==ACGTAhelliphelliphelliphellipCAGTA====================

splice sites

Derivative Sequences

mRNA

capture by cloning into cDNA library

3rsquo EST

5rsquo EST

cDNA sequence

EST single pass sequence from each end of the clone

cDNA multiple pass sequencing over whole length of the clone

5rsquo 3rsquo

Gene Models

gene modelexons

Sequences and Genes(Accession Numbers and Names)

AAB229701

AAP212451

CAA415451

NP_1877592

proteins

S431051mRNAscDNAs lsquosimilar to Cyclin B1 [mus musculus]rsquo

gene

BT0064371 lsquoCyclin B1 isoform 1 [mus musculus]rsquo

X587081

NM_1119853 lsquoCCNB1 Cyclin B1 [mus musculus]rsquo

lsquoCyclin B1 isoform 2 [mus musculus]rsquo

Gene Symbols Names Etc

Gene Symbol CCNB1

Gene Name cyclin B1 [Homo sapiens]

Description G2mitotic-specific cyclin B1

Aliases CCNB CYCB1

A Gene-Centric View

Entrez Genehttpwwwncbinlmnihgov

Cyclin B1

S431051

BT0064371

X587081

NM_1119853

AAB229701

AAP212451

CAA415451

NP_1877592

Exercise 1

Go to Entrez Gene and look for your favourite gene or genes

genomic location

expression data

Sequences and Accession Numbers

NM_0010159221 gi=62860271

GATCGTTCGATTAGCTAGGGACACCACCGATCGATATGACCACAAAAA

BC0096381 gi=16307106

GTTCGATTAGCTAGGGACACCACCGATCGATATGACCACAAAA

NM_0010159222 gi=62860589

GACCGTTCGATTAGCTAGGGACACCACCGATCGATATGACCACAAA

NP_0010159221 protein translated from mRNA

XM_0011025671 predicted mRNA

XP_0010897651 predicted protein translated from predicted mRNA

mRNA Splicing Signals

gene model

genome

CTACCATCCATGCTAACCATTCTACCATTTTATACTCATGCAACGGACCGTAGCGTAGTCGCTTAGCATCCTTTATAACTGGCTA

CTACCATCCATGCTAACCATTCTAC CATTTTATACTCATGCAACGGACCGT AGCGTAGTCGCTTAGCATCCTTTATAACTGGCTA

CTACCATCCATGCTAACCATTCTACGTAAGTCATCTATATCAATATTATTTCAGCATTTTATACTCATGCAACGGACCGTGTCAGTATTACAGAGCGTAGTCGCTTAGCATCCTTTATAACTGGCTA

GTAAGdonor

TTTCAG acceptor

mRNA

exon intron exon intron exon

splice sites

Gene PredictionsGiven- coding sequence must run from ATG ndash STOP codon in-frame- introns GT AG can be spliced out

Also take a statistical approach- coding and non-coding sequence are slightly different in composition- some lsquopossiblersquo splice sites are more likely than others

CGTCGTATGGCTTCGATGTAGTACATCGGATCGGTATGGAATCATTTCAGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA

CGTCGTATGGCTTCGATGTAGTACATCGGATCGGTATGGAATCATTTCAGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA CGTCGTATGGCTTCGATGTAGTACATCGGATCGGTATGGAATCATTTCAGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA

CGTCGTATGGCTTCGATGTAGTACATCGGATCGGTATGGAATCATTTCAGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA

CGTCGTATGGCTTCGATTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA

CGTCGTATGGCTTCGATGTAGTACATCGGATCGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA CGTCGTATGGCTTCGATGTAGTACATCGGATCGGTATGGAATCATTTCAGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA

CGTCGTATGGCTTCGATTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA

CGTCGTATGGCTTCGATGTAGTACATCGGATCGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA

scan genomic sequence hellip

CGTCGTATGGCTTCGATGTAGTACATCGGATCGGTATGGAATCATTTCAGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA

most likely gene model

Supporting Evidence

EST evidence

genome

gene model

We note that in the absence of EST evidence it is only really possible to predict coding sequence with any confidence (and even thenhellip)

So predicted genes based on computational gene models alone will usually lack UTR regions which has some important consequences

exons 1 2 3 4

TheoreticalPredicted Sequences

genome

predicted gene modelexons 1 2 3 4

Wersquove now reversed the process of working out exon structure from aligning cDNA sequences against the genome sequence but we shouldnrsquot lose sight of the fact that we donrsquot really know if these predicted proteins exists ndash especially where supporting EST evidence is weak or non-existent

predicted transcript

predicted protein

Sequences for a model organism

ESTs ndash millions pound10 eachCheap to sequence ndash so we get millions per organismBut lots of errorsAnd incomplete gene sequencesCan give us relative expression levels

cDNAs ndash tens of thousands pound1000 eachExpensive ndash but only need to do one (or a small number) per geneFew errors with multipass sequencingGives us protein sequences

Genomes ndash one pound30000000Extremely expensiveBut the only way to get the whole pictureGives us gene regulation

So Whatrsquos in the Databases Now

15000000ESTs

3300000cDNAs

NCBI July 2005

2700000proteins

950000proteins

nrRefSeq

DNA

Proteins

Part 2 Comparative Genomics

ATGAAGGCTGCCTACGACTGCCGTGATGCAGGCTGCCTACGACTGCCGTGATGCAGGCTGCCAACGACTGCCGTGATGCATGCTGCCAACGACTGCCGTGATGCATGCTGCCAACGACTGCCCTGATGCATGCTGCCAACGGCTGCCCTGATGCATGCTGCCAACGGATGCCCTGATGCATGCCGCCAACGGATGCCCTGATGCATGCCGCCAACGGATGTCCTG

Imagine one mutation gets fixed every 100000 years in this gene sequencehellip

Gene sequence

Evolution by sequence mutation

Speciation

ATGAAGGCTGCCTACGACTGCCGTG

ATGCAGGCTGCCTACGACTGCCGTGATGCAGGCTGCCAACGACTGCCGTGATGCATGCTGCCAACGACTGCCGTGATGCATGCTGCCAACGACTGCCCTGATGCATGCTGCCAACGGCTGCCCTGATGCATGCTGCCAACGGATGCCCTG

Gene AATGAAGGCTGCCTACGACTGCCGTG

ATGAAGGCCGCCTACGACTGCCGTGATGAAGGCCGCCAACGACTGTCGTGATGAAAGCCGCCAACGACTGTCGTGATGAAAGCCGCCAACGACAGTCGTGATGAAAGCCGCCTACGACAGTCGTGATGAAAGCCGCCTACGACAGTCCTG

ATGCATGCTGCCAACGGATGCCCTG

ATGAAAGCCGCCTACGACAGTCCTG||| | || ||| ||| | ||||

If the genetic difference means they can no longer interbreed with fertile offspring ndash then we have a new specieshellip

Residual Similarity

ATGCATGCTGCCAACGGATGCCCTG

ATGAAAGCCGCCTACGACAGTCCTG||| | || ||| ||| | ||||

ATGCATGCTGCCAACGGATGCCCTG

ATGGAAGGCGCTTAGGATAGTCCAG||| | | || | | | || |

After longer periods of evolution homology may no longer be detectable in the DNA sequencehellip

We can still easily detect residual similarity between these sequences this is what we call homology ndash detectable similarity because of common evolutionary origin

Computers Can Detect Homology

In fact computers are very good at this task ndash the two primary challenges are

(a) performing the search fast enough to look through millions of sequence in a timescale compatible with a lab scientistrsquos attention span

(b) at low levels of similarity being able to distinguish between biologically related sequences and chance matcheshellip

ATGCATGCTGCCAACGGATGCCCTG

ATGAAAGCCGCCTACGACAGTCCTG||| | || ||| ||| | ||||

GCTGACTCGTAGCGCTTAGCTAGCT

CCAACATCTAGCCAGATTAGTTAGT | || | | | |

Orthologs

A A

A Gene duplication though speciation The two copies of Gene

A will now evolve independently but will continue to have the ~same function

They are ORTHOLOGS

Paralogs

A

Gene duplication though internal genome duplication

The two copies of Gene A will now evolve independently but will probably not continue to have exactly the same function

They are PARALOGS

A

A Arsquo

A

lsquoOtherrsquo-logsWhat about gene duplication after speciation

How can we describe the relationship(s) between the various copies of gene A in the two frogs

Bear in mind that understanding gene function is more important than semanticshellip

The two copies of A in the orange frog are sometimes called IN-PARALOGS

If they were also present in the green frog (and therefore were in the ancestor species) they would be OUT-PARALOGS

A

A

A

Arsquo A

The Essential Paradigm

1 any group of modern species can be traced back to some extinct common ancestor

A

A

2 in all likelihood they share orthologous genes which have the same function in the modern animal as in the extinct ancestor

3 If we can experimentally determine the function of a gene in one of these organisms then there is a good chance the ORTHOLOGOUS gene in another organism will have the same function

A A

cyclin b1

cyclin b1

Function Conserved Longer than Detectable Similarity

start from first self-replicating sequence

same function detectable similarity

living organisms

whole genome duplication local duplication

Redundancy in the Genetic Code

GCA A alanine GCC A GCG A GCT A

TGC C cystine TGT C

GAC D aspartate GAT D

GGA G glycine GGC G GGG G GGT G

lsquoSynonymousrsquo or lsquosilentrsquo mutations in the third position of the codon triplets have no effect on the amino acid coded for ndash so there is no evolutionary pressure against thishellip

Protein Similarity Persists Longer

CTATCACGAGAACCTGTGCTATCCCGAGAACCTGTGCTATCCCGAGAACCAGTGCTATCCCGTGAACCAGTGCTATCCCGTGAGCCAGTGCTATCCCGTGAGCCAGTTCTGTCCCGTGAGCCAGTT

CTATCACGAGAACCTGTG

CTGTCCCGTGAGCCAGTT|| || || || || ||

LSREPV

LSREPV||||||

CTATCACGAGAACCTGTG

TTGTCCCGGTCGCCAGTT | || | || ||

LSREPV

LSRFPV||| ||

67 100

44 80

Always Compare Protein Sequences

ATGAATGCAGCCTATGATTGCCGAGCCAGAATGCTAAGG MNAAYDCRARMLR ||||| || || || || || || || ||||| || || | ||||||||+||ATGAAGGCCGCATACGACTGTCGTGCTAGAATCCTGAGA MKAAYDCRARILR

DNA comparison amino acid comparison

The DNA sequence can change while the amino acid sequence stays the same so always look for similarities by comparing amino acid sequences

Exercise 1nucleotide vs amino acid search

Go to the file example-sequenceshtml and locate the section for this exercise There should be two sequences lsquosurfeit1rsquo for frog and fly

Go to NCBI Blast home page then lsquoAlign two sequencesrsquo (bottom left lsquospecialrsquo panel) paste one sequence into each window and hit lsquoAlignrsquo ndash this will do a direct DNADNA comparison

Now find the open reading frames of the two genes and translate them into amino acid protein sequences then repeat the two sequences comparison

Go to NCBI ORF Finder ndash paste sequence ndash hit OrfFind ndash identify longest ORF ndash click on it ndash next screen hit Accept ndash change View to Fasta protein ndash hit View ndash copy sequence to Blast2Seqs Do the same with the other sequence

Before you hit lsquoAlignrsquo change the lsquoProgramrsquo (top left) to blastphellip

Answers Exercise 1

The Essential Taskexperiment data mining

gene sequence what is its function

database of proteins in other species

Cyclin-AFoxA1

cdc25

alpha-tubulin

Predicted protein

Gravin-like

Sprouty-2

calmodulin

KIAA10786568

frizzled

Wint8

Troponin T3

Gravin-like

we can only do this because of implied function based on orthology

Functional Orthologs

function known annotation lsquoGravinrsquo available

Human geneXenopus genefunction unknown

sequence similarityorthologs

same function But we know that function is largely determined by shape

similar shape

Which in general we cannot determine ndash but it is probably SHAPE not SEQUENCE that is conserved

We make an assumption that the same gene function is likely to be present in the two organisms and the ones that have this function are likely to be the most similar in sequence

Finding OrthologsSo how do we find orthologs and can we know when we have

The simplest is Reciprocal Best BLAST but it implicitly relies on having all the protein sequences of you own organism and the one you wish to find an ortholog in

frog proteindatabase of human proteins

best match human protein

database of frog proteins

x

Using Synteny is Better

We know that large regions of (say) vertebrate genomes have preserved their overall organisation from one organism to another

And we find the same genes (ie orthologs) in more or less the same order in the syntenic sectionsThese of course represent chromosomal re-arrangements since these organisms diverged

Human chromosome 5

Mouse chromosome 10

Mouse chromosome 2

MetazomeFortunately someone has done all the hard work for ushellip Dan Rokhsar httpwwwmetazomenet

Metazome Exercise

Go back to Entrez Gene and look for your favourite gene again

Pick probable ortholog vertebrate genes from common organisms (human mouse rat chicken frog fish) and paste their protein sequences into a temporary space

Go to Metazome (httpwwwmetazomenet) find the blast window open two versions of it and blast your sequences against the Tetrapod or Jawed vertebrate node

See if you get the same cluster ID as best top hit and have a look at the Metazome alignment(s)hellip

Part 3 Finding Sequence Similarities

We want computer programs which will compare sequences at all possible different alignments looking for a degree of similarity greater than we would expect to find by chance

But first we have to consider the implication of gapshellip

Insertions and deletions are other possible forms of mutations and they can really mess up our simple alignments

ATGCATGCTGCCAACGGATGTCCTG

ATGAAAGCCGCCTACGAAAGTCCTG||| | || ||| ||| | ||||||

ATGCATGCTGGCCAACGGATGTCCTG

ATGAAAGCCGCCTACGAAAGTCCTG||| | || ||| | | | |

Gaps in Alignments

Consider these two obviously similar sequences

TTCCCAACTCTCCTCTTTCACCATGAAGCTCAAGGACAGATTCCACTCGCCCCAAAATCAAGCTCACCCCGTCCAAGAA | || | || |||||||||||||||||||| ||||||||| ||| ||| | ||| | | |TTCCCACCTCTCCTCTTTGCACCATGAAGCTCAAGGACAAATTCCACTCCCCCAAAATCAAGCGCACCCCGTCCCAGAA

In fact we realise that the most probable alignment (regarding biological origin) is with a small gap in each sequence

TTCCCAACTCTCCTCTTT=CACCATGAAGCTCAAGGACAGATTCCACTCGCCCCAAAATCAAGCTCACCCCGTCCAAGAA |||||| ||||||||||| |||||||||||||||||||| ||||||||| |||||||||||||| |||||||||| ||||TTCCCACCTCTCCTCTTTGCACCATGAAGCTCAAGGACAAATTCCACTC=CCCCAAAATCAAGCGCACCCCGTCCCAGAA

So in general we allow ourselves to insert gaps until we find the optimal alignment

But where should this process stop

The Downside of GapsTake two random sequences with no lsquorealrsquo similarity

GACACTAGGTCGATGCGTGGTGGCGAGA

ACGCATCCGGATGTGCACCGTGGAACTG

And allow lsquocost freersquo gaps

GAC--ACT----AGGTCGATGC---GTGG---TGGCGAGA || | | | | | ||| |||| || ACGCA-TCCGGA--T-G-TGCACCGTGGAACTG

Clearly although the alignment has no mismatches it is obviously not biologically meaningful

To prevent this we assign a cost to adding gaps which is offset against the benefit of finding matches ndash and this is the essence of lsquofinding gapped alignmentsrsquo

We want to find the lsquoalignmentrsquo between the two (or more) sequences which shows the greatest degree of similarity while introducing the fewest gaps hellip

BLAST

gtqueryAGACGAACCTAGCACAAGCGCGTCTGGAAAGACCCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGTTCAGGAGTATTTGGACTGCAATATTGGCCCTCGTTCAAGGGCGCCTACCATCACCCGACGGTCATGCCGGTCCCCAGCAGCTGCTAATAACTTCCTTCGCTACTCAAGTTACCACGCTAGCAAAACCCACGGCATACCGTTTACCCTTTAAAATCAGCTTCAACCAGCAACGAA

There are many programs used to find similarities between sequencesThey range from relatively slow programs which find the exact best matching alignment through ones which take progressively inexact shortcuts to speed things up Of this latter class the best known and easily most widely used is BLAST developed by Stephen Altschul and others and continuously refined over the last 10-15 years

The essential idea is to compare your query sequence against a collection or lsquodatabasersquo of target sequences looking for the one(s) that match the query sequence the best

gttarget1AAAACAGGAATATTTACCGGGACCGGGTAATGATGCATCTCGAGGTACACAATATACCTG GAGAACCGAATTATGAGTTGGCCACCTTACTTAACGAAACCAGCAGAGAAAATCCAACAT GGCAACACCCCTCTGACTACACTAGAAGGAACTACTATGTAAGAAAACAGCCTGTCCCTT GCAGTTTGAATGACTGGGTGATGCGAAATGGGGGTCCTGCCATAGAGCGCTTCCATGGTT TACCTTGCACATTTCAGAGAAGTCCTATGCCAGGAGTCCTTCCTACAGGGCCTTCCTGAA ACTATATATGTGCTTATTCTTGTTTGATTTGGCTTTGCAGgttarget2CTCTTAATTTATTTCTCTTCCTGCAGCTCCCTCGCTTTTTCCTTTCCCTGTTACATTCAT CTGACTTGAAGAGTTGCAAATTTTCAGTGTTTCTGTTTTTGTTGCTGATATGTTGTAAAC TTTTTAATAAAATCTATTTCTATAG gttarget3GCAGTTTGAATGACTGGGTGATGCGAAATGGGGGTCCTGCCATAGAGCGCTTCCATGGTT TACCTTGCACATTTCAGAGAAGTCCTATGCCAGGAGTCCTTCCTACAGGGCCTTCCTGAA ACTATATATGTGCTTATTCTTGTTTGATTTGGCTTTGCAGCTAGGGTTTTCACCTTTTCT GGAAAAAAAAATACTGGCTTCC gttarget4CTGCTATTAATGGGCAAAACAACTCAAATAAAGTCCCTCTGCCACCCTCAGACACTGCCC CTGGCCCCCAGCTGCCCGCTGATCCTTGTAGCCAGAGCAGTAAAGTTTTGAAAGTGGAGC CCAAGGAGAATAAAGTTATTAAAGAAACTGGCTTTGAACAAGGTGAAAAGTCTTGTGCAG CACCTCTAGATCATACTGTGAAGGAAAATCTTGGACAAACTTCTAAAGAACAGGTGGTAG

query

database

COMPARE

LIST MATCHES

Flavours of BLAST

ACGATAGATCCCATCCATAAAT ATGACGATAGATCCCATCAT CGATAGGACCACCACA GATAGACCAGGATACATAGGATAATTA AGCTCGCTTGGCTCGATGGCT

MKJLSPWERSYTRGHYTWER MGHTVNBZY MKLPWRHGDBKJGMNDFD MBKLRPIUHDFRTASGSLKWWRTVBN

MKJLSPWERSYTRGHYTWER MGHTVNBZY MKLPWRHGDBKJGMNDFD MBKLRPIUHDFRTASGSLKWWRTVBN

ATGACGATAGATCCCATCAT CGATAGGACCACCACA GATAGACCAGGATACATAGGATAATTA AGCTCGCTTGGCTCGATGGCT

query sequence other operation database sequences

ATGACGATAGATCCCATCAT CGATAGGACCACCACA GATAGACCAGGATACATAGGATAATTA AGCTCGCTTGGCTCGATGGCT

BLASTn

BLASTp

BLASTx

tBLASTn

tBLASTx

ACGATAGATCCCATCCATAAAT

ACGATAGATCCCATCCATAAAT

MQWCGYRWTYQGYRW

MQWCGYRWTYQGYRW

FAST

FAST

SLOW

SLOWER

HORRIBLY

SLOW

6 fra

me

trans

latio

n

How does it work The main task of any sequence comparison program is to test all possible mutual alignments of two sequence and see how good the match is

CTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGT

CCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGTTC | | | | | ||||||||||||||||||||||||| CTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGTCTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGT || | | | | | | | | |||||||||||||||||||||||| | | | | | |

CCGAGCTTCTCATTGCTCTTCCTAACAGTG=TGATAGGCTAACCGTAATGGCGTTC||||||||||||||||||||||||| ||||||||||||||||||||||||

query

1st database sequence

This would actually be a very slow search process if implemented like thishellip

BLAST achieves its speed through two strategies

- it takes a WORD based approach- it pre-INDEXES database sequences

BLAST WORDS and INDEXING1 GACAAATCCAAACCCCTGAAGTTCTCCACCAGCAAAGCCA

2 TAAGCAAATTTAATTTTGTTTACATTTTC

3 GTTAAGACCTTCCCTGACATTTGCAGCAGTTTCAAATGTA

AAAAAAAA 00001 AAAAAAAC 00002 AAAAAAAG 00003 ACAAATCC 07967 ACAAATCC 07968 ACAAATCC 07979 GACAAATC 33568 GACAAATG 33569 TCCAAACC 64321 TCCAAACC 64322

sequence position word

1 1 33658

1 2 07967

1 3 16210

3 15 33568

3 16 07967

Database of sequences

Numbered list of all possible lsquowordsrsquo

Build a position index of all words in the database

Analyse the Query Sequence gtquery AGACAAATCCAAACCCCTGAAGTTCTCCACCAGCAAAGCCA

AAAAAAAA 00001 AAAAAAAC 00002 AAAAAAAG 00003 ACAAATCC 07967 ACAAATCC 07968 ACAAATCC 07979 GACAAATC 33568 GACAAATG 33569 TCCAAACC 64321 TCCAAACC 64322

QUERY SEQUENCE

Numbered list of all possible lsquowordsrsquo

position word

1 14236

2 33658

3 07967

Analyse QUERY SEQUENCE

sequence position word

1 1 33658

1 2 07967

1 3 16210

3 15 33568

3 16 07967

Index of database

Expand from Word Based Matches We lsquoinstantlyrsquo know which sequences in the database have at least a word length match with our query sequence and at what relative position

Next the potential alignments are expanded adding up a score for (total matches ndash mismatches ndash gap penalties) to make the best possible alignment But this is usually for a tiny proportion of the sequences in the database ndash so overall it is much quicker

The highest scoring alignments are reported

But we can potentially miss alignments with no word-size bits in common consider BLASTn with a default word-size of 11

TCGGAAGTGGAAGCTGAACCTGATTGTAGAGTTGGAGGCCAGTGTTCTGGCTGAGC||||||||| ||||| |||||||||| |||||||||| |||| ||||| ||||||| TCGGAAGTGTAAGCTCAACCTGATTGCAGAGTTGGAGTCCAGAGTTCTAGCTGAGC

Care is sometimes neededhellip

BLAST ndashTypical OutputINPUT

gtpartial cDNA sequence Xenopus tropicalisCGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGTTCCCACCTCTCCTCTTTCACCATGAAGCTCAAGGACAAATTCCACTCCCCCAAAATCAAGCGCACCCCGTCCAAGAAGGGGAAGCCGGCCGACCTCACCGTCAAAACAGAAGAGAAACCCGTCAACAAAACCTTAAGCCGCTTGGAGGAACAGGAGAAAGAAGTCGTTAATGCCTTGCGTTACTTTAAGACAATTGTTGACAAGATGGCGGTGGACAAGATGGTGCTGGTGATGCTGCCAGGGTCGGCGA

OUTPUTQuery= (311 letters) Database NCBI Protein Reference Sequences 954378 sequences 347895532 total letters

gtgi|41055060|ref|NP_9574201| similar to guanine nucleotide-releasing factor 2 (specific for crk proto-oncogene) [Danio rerio]

Length=691

Score = 133 bits (335)Expect = 6e-31 Identities = 7698 (77) Positives = 8298 (83) Gaps = 498 (4) Frame = +2

Query 26 MSGKIE-KADSQRSHLSSFTMKLKDKFHSPKIKRTPSKKGKPA--DLTVKTEEKPVNKTL 196 MSGKIE K +SQ+SHLSSFTMKL KFHSPKIKRTPSKKGK + VKT EKPVNK + Sbjct 1 MSGKIESKHESQKSHLSSFTMKLM-KFHSPKIKRTPSKKGKQLQPEPAVKTPEKPVNKKV 59

Query 197 SRLEEQEKEVVNALRYFKTIVDKMAVDKMVLVMLPGSA 310 SRLEEQEK+VV+ALRYFKTIVDKM VD VL MLPGSA Sbjct 60 SRLEEQEKDVVSALRYFKTIVDKMNVDTKVLQMLPGSA 97

When is a match significant

RFKISDCQHPCTYSHNQYMTNHMRECPYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV

NFSWKKTSEKETNCQFDYPNDYNEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFNMCWLEVNSS

RF---KISDCQHPCTYSH-NQYMTNHMREC----PYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV F K S+ + C + + N Y N +C P+ + +W +P + D I N M ++ NFSWKKTSEKETNCQFDYPNDY--NEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFN------MCWLEVNSS

Here is a lsquotypicalrsquo weak alignment from BLASTp

In fact the sequences were randomly generated so there is no biologically significant alignmenthellip

E-values

The number of matches like the discovered match that I would expect to find by chance

An E-value of 00 implies that I would expect no matches like this to arise by chance thereforehellip

An E-value of 1 implies I would expect 1 match like this to arise by chance so if I have a match with such an E-valuehellip

Also ldquoexpect valueldquo or ldquoexpectationrdquo

E-values From First Principles

Some database statistics (23rd July 2005)

Database NCBI RefSeq mRNA 272619 sequences 503566580 total letters (~50 x 108)

Database NCBI nr 3329110 sequences 14601814750 total letters (~14 x 1010)

Notation

12e-35 = 12 x 10-35

48 x 106 = 4800000

We will consider first searching a nucleotide sequence (lsquoACGTAGACGTrsquo) against a nucleotide database eg the RefSeq mRNA above

Then we will consider the more complex case of amino acid sequence (protein) searches Which is of course what we mostly do

Calculating an E-valueThe RefSeq mRNA database has ~ 50 x 108 letters There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA

Query = lsquoArsquo

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACrsquo

AC AC AC AC

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACGrsquo

ACG

Expected number of matches = (50 x 108) 4 = ~12 x 108

Expected number of matches = (50 x 108) (4x 4) = ~31 x 107

Expected number of matches = (50 x 108) (4 x 4 x 4) = ~81 x 106

Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer

Expected number of matches = (50 x 108) (4 x 4 x 4 x 4 hellip 60 times ) = (50 x 108) 1036 = 50 x 10-28

E-value = 50 x 10-28

E-values In PracticeSo if I take a 60 nt sequencegtsequenceACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA and actually BLAST it against the RefSeq mRNA database I get

BLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds

Length=6060 Score = 119 bits (60) Expect = 2e-26 Identities = 6060 (100) Gaps = 060 (0) Strand=PlusPlus

Query 1 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 60 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 2977 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 3036

What do I get if I BLAST it against the larger nr databaseBLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds

Length=6060 Score = 119 bits (60) Expect = 6e-25 Identities = 6060 (100) Gaps = 060 (0) Strand=PlusPlus

Query 1 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 60 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 2977 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 3036

theoretical value was 50e-28 -

E-value Exercise

Given a transcription factor binding site

ACC[TG]TA

How many would you expect to find by chance in a 10k promoter sequence

How would this differ if there was an optional additional base between the 4th and 5th positionsIeACC[TG]TAOR ACC[TG]TA

E-value Exercise AnswerACC[TG]TA

Expect lsquoArsquo every 4 ntExpect lsquoACCrsquo every 4x4x4 = 64 nt

Expect lsquoT or Grsquo every 2nd ntExpect lsquoACC[TG]rsquo every 64x2 nt = 128 nt

Expect lsquoTArsquo every 4x4 = 16 ntExpect lsquoACC[TG]TArsquo every 128x16 nt = 2048 nt (4x4x4x2x4x4)We would expect ~5 of these promoter sites every 10k by chance

If also ACC[TG]TAA allowed

The two motifs independently have the same E-valueTo allow either means we expect twice as many

E-values Effect of Database SizeThe nr mRNA database has ~ 14 x 1010 letters (was RefSeq and 50 x108)There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA

Query = lsquoArsquo

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACrsquo

AC AC AC AC

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACGrsquo

ACG

Expected number of matches = (14 x 1010) 4 = ~12 x 108

Expected number of matches = (14 x 1010) (4x 4) = ~31 x 107

Expected number of matches = (14 x 1010) (4 x 4 x 4) = ~81 x 106

Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer

Expected number of matches = (14 x 1010) (4 x 4 x 4 x 4 hellip 60 times ) = (14 x 1010) 1036 = 14 x 10-26

E-value = 14 x 10-26

(was E-value = 50 x 10-28)

E-values Effect of Database Size

The E-value is simply dependent on database size

RefSeq

nr

14 x 1010 letters

50 x108 letters

30 x bigger

BLAST the same sequenceagainst each E-value = 14e-26

E-value = 50e-28

The database was ~30 times bigger and so the E-value was ~30 times bigger

Why were the values differentOur calculated E-value for searching against the RefSeq mRNA database was 50 x 10-28But our actual BLAST search at NCBI gave a value of

20 x 10-26 - about 40x larger - why is this

Gapped alignments

If we were expecting N matches for a query sequence lsquoACGTACGTACGTrsquo imagine what would happen to N if we allowed gaps in our matches

ACGTACGTACGT

This would now give us additional possible alignments that would meet our lsquomatchrsquo criteria

ACGTACGTACGT ACGTACAGTACGT ACGTACCGTACGT etc|||||||||||| |||||| |||||| |||||| ||||||ACGTACGTACGT ACGTAC-GTACGT ACGTAC-GTACGT

We will expect many more matches in a given database if we allow our alignments to have gaps The E-value will be larger

E-values Effect of Query Length

Biologically itrsquos the same match Does it mean we are any less sure that this match didnrsquot occur by chance The E-value is simply dependent on match length

database

BLAST 500 nt sequence against a database

BLASTn Get a full length match with sequence XYZ at an E-value = 50e-160

gtsequence ACTAGTCTAGCTAGACATCGATCGATGATGCTACACAGATAGACGATAGATAGTAAGTCGATCGATCGCGCATCGATCGTCTAGATCGATCGCTCGCTGTGTAGATAGATCGGCGATAGA

database

BLAST half of the same sequence against the same database

BLASTn

gtsequence ACTAGTCTAGCTAGACATCGATCGATGATGCTACACAGATAGACGATAGATAGTAAGTCG

Get a match with sequence XYZ again but at an E-value = 50e-80

Why not just use identityAt some levels this a good question

But consider two very different searches both of which give a 75 identity match

Query1 was 60 nt longCGGAGCTCAGGGCTTAACGACTGATATCTCCGCGCATGTCGAGAAACGATACAGCCAGCG||||||||||| || | || | || || |||| | | | |||||| | ||||||||||CGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGWhich would have an E-value ~ 50 x 10-19

And Query2 only 16 nt longACGTACGTACGTACGT||| || | |||| ||ACGCACCTTCGTAGGTWhich would have an E-value ~ 30

And intuitively we feel we would expect to see that sort of number of matches in the database just by chancehellip

So whatrsquos the real problemBasically you are usually trying to answer the question

Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism

Are there any useful guidelines though at least for biological meaningfulness

Basically you are usually trying to answer the question

Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism

BLAST

The difficulty is because

ORTHOLOGY

BLAST Similarity + Probability

biological knowledge

nature of query sequence

phylogenetic relationship

match length PI size of databasehellip

Rules of ThumbHow good does an E-value have to be before we might even think we have an ortholog

largerworse smallerbetter

E-values 10-5 10-10 10-40 10-100 00

fantasy borderline encouraging

pretty good canrsquot get

better

But note that in some gene families with closely related members you can get an E-value of 00 for several different matches and then identity may be more sensitive Also bear in mind in cases like this that ideas of lsquofunctionalrsquo orthology may break down with more than one locus producing identical proteins which share the same functionhellip

Protein BLASTItrsquos (nearly) always better to make comparisons at the amino acid level between protein sequences than the DNA level because the amino acid sequence is more conserved than the underlying DNA sequenceDoes this cause us to treat expected values any differently

If we follow the argument as before then for an exact match of a 20 amino acid sequence in the RefSeq protein database each additional amino acid will reduce the E-value by 120th (there are 20 different amino acids) And as there are 347895532 letters in that databaseE-value = ~35 x 108 (20 x 20 x 20 hellip20 times) = ~35 x 10-18

But this is what we get if we run the blast at NCBI

Score = 431 bits (100) Expect = 8e-04 Identities = 2020 (100) Positives = 2020 (100) Gaps = 020 (0) Frame = +3

Query 3 SSSSFRAYRAALSEVEPPCI 62 SSSSFRAYRAALSEVEPPCISbjct 972 SSSSFRAYRAALSEVEPPCI 991

Really too big a discrepancy to easily explain with hand wavinghellip

Amino Acid Substitutions

A SC F LWYG I LMVL IMFVM ILVP V ILMW FY

N DHSQ REHKS ANTT SY HFW

H NQYK RQER QK

D NEE DQK

In fact we need to take into account both amino acid substitutability as well as as before allowing gapped alignments On average any residue can be substituted for by about 2 others so each position has about 17th chance of lsquomatchingrsquo rather than 120th

So now we getE-value = ~35 x 108 (7 x 7 x 7 hellip20 times) = ~44 x 10-9which is much closer to the actual BLAST value

These substitutabilities are dealt with by the BLOSUM and PAM matrices

ExercisesGo to the file random-DNA-sequenceshtml select one of the 20 randomly generated nucleotide sequences and do a BLASTx (translated DNA-gtprotein) at NCBI against the nr protein database

Did you find any lsquosignificantrsquo hits

Repeat with a second sequence

What conclusions might you draw from this exercise

Try the same sequence(s) against the nr nucleotide database

Is there any general difference

Part 4 Tweaking BLASTAlthough you normally see BLAST as a web page with boxes to place data in and tick boxes etc it is actually a command line program that can be run just by typing the appropriate command and options eg

promptgt blastall ndashp ltblast typegt ndashi ltinput sequencegt ndashd ltdatabasegt

This is the simplest form where the basic program lsquoblastallrsquo takes a number of different options or parameters indicated by the ndashx and followed by its value -p ltwhich blast flavour to rungt-i ltfile with query sequence ingt-d ltpre-indexed database namegt

Not All Parameters are hellipThere are many other parameters and if not listed explicitly they will use a default value most appropriate to the blast flavour requested Eg for ndashW ltword sizegt blastn uses ndashW 11 where blastx uses ndashW 3

There are also some options that appear on the web pages that are not really parameters but manage the job in a similar way One of the most useful of these is on the NCBI blast pages where you can use Entrez queries or pick from an organism list to modify your search

The Many Parameters of BLASTThere are almost literally hundreds of parameters but most are way too obscure even for die-hard techies like me Very few of them are regularly useful in any but their default value but just occasionally they are very necessary

Here are some of the ones that I have used

-e max expected value -m output format (graphical or tabularspreadsheet)-F filter query sequence for low complexity (default TRUE)-U use only upper case regions of query (default FALSE)-G gap opening cost-E gap extension cost-q nucleotide mismatch penalty (BLASTx uses matrices)-r nucleotide match reward-b number of matching sequences to report-g allow gaps (default TRUE)-W word size-z effective database size (removes effect of actual database size)-S query strands to search (default both directions)-l restrict database sequences to given list of lsquogilsquo numbers

BLAST Parameters Exercises1 BLASTn vs BLASTx

Open the file example-sequenceshtml copy the sequence gtblastn-vs-blastxThis is a Xenopus tropicalis cDNA sequence

Go to the NCBI BLAST Home PageNucleotide-nucleotide BLAST (blastn) section Paste your sequence into the box

Run BLASTn against the nr nucleotide database using all default optionsThen hit [format] to wait for the results in a new page(hint if you paste the sequence definition line lsquogtnamersquo into the box as well your results will be labelled accordingly which can be useful)

Now repeat but go to the TRANSLATED BLAST section and BLAST against the nr protein database using BLASTx

How might the different results help us view the presence of this gene in other vertebrates

Results for Exercise 1 BLASTn

BLASTx

BLAST Parameters Exercises2 Low complexity filtering

Open the file example-sequenceshtml copy the sequence gtlow-complexity-filtering-A This sequence contains a long AT tandem repeat

Go to the NCBI BLAST Home PageTRANSLATED BLAST sectionBLASTx Paste your sequence into the box

Carefully UNTICK the ldquoChoose filter [ ] Low complexityrdquo BOX in the second section And then run BLASTx against the nr database

What do you feel about these alignmentsRe-run but leave the low-complexity filter ON this timeDoes this change our view of the protein matches

Now continue with gtlow-complexity-filtering-B and ndashCC is an especially interesting case ndash what can we deduce about the cDNA sequence Annotators beware

Results for Exercise 2A (OFF) BLASTn ndash low complexity filtering OFF

Results for Exercise 2A (ON) BLASTn ndash low complexity filtering ON

Results for Exercise 2B

ON OFF

Results for Exercise 2C

There is a sequence error an extra G at position 117 in the sequence cDNA (117)AGAAAAGAAGAAACATGGCAATGGATCAGAA|||||||||||||||| ||||||||||||||AGAAAAGAAGAAACAT-GCAATGGATCAGAA

Genomic sequence

ON

OFF

BLAST Parameters Exercises3 Limit by Entrez query

Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching sequences in fruit fly enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT

Open the file example-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home Page TRANSLATED BLAST sectionBLASTx and paste the sequence

Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip

BLAST Parameters Exercises4 BLASTn vs tBLASTx and nucleotide mismatch penalties

Open the file example-sequenceshtml

Also open the NCBI BLAST Home PageSPECIAL ndash Align two sequences section

There are several Xenopus tropicalis cyclins in the examples fileCopy the sequence gtcyclin-A1-Xt to the Sequence 1 BLAST windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 BLAST window(i) Run the default comparison should be BLASTn Note the alignmentNow run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these

(ii) Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1 Then try reducing the gap open and gap extension penaltieshellipWhat do we learn from this

(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip

Results for Exercise 4 (i)

BLASTn tBLASTx

Results for Exercise 4 (ii)

Mismatch penalty = -2 (default) Mismatch penalty = -1

BLAST Parameters Exercises5 E-Value maximum for reporting

Open the file example-sequenceshtml

Copy the sequence gtsumo-binding-motif and go to the NCBI BLAST Home PageGo to the PROTEIN BLAST section BLASTp and paste the sequence

Run the search with the default values

Now re-run the search setting the maximum E-value in the box

Expect 100

What difference does this make

BLAST Parameters Exercises6 Word Size

Open the file example-sequenceshtml

Copy the sequence gtmorpholino and go to the NCBI BLAST Home PageGo to the NUCLEOTIDE BLAST section BLASTn and paste the sequence

Check OFF the low complexity filter and then run the search

Now re-run the search setting the following parameters

Low complexity OFFExpect 100Word Size 7Other advanced -q-1 (mismatch penalty -1 instead of default -3)

What difference does this make

END

  • Bioinformatics Workshop 1 Sequences and Similarity Searches
  • The Basic Questions
  • Part 1 Structural Genomics
  • Chromosomes and Genes
  • Gene to Protein
  • Sequence Signals
  • Genomic Signals
  • Derivative Sequences
  • Gene Models
  • Sequences and Genes (Accession Numbers and Names)
  • Gene Symbols Names Etc
  • A Gene-Centric View
  • Sequences and Accession Numbers
  • mRNA Splicing Signals
  • Gene Predictions
  • Supporting Evidence
  • TheoreticalPredicted Sequences
  • Sequences for a model organism
  • So Whatrsquos in the Databases Now
  • Part 2 Comparative Genomics
  • Speciation
  • Residual Similarity
  • Computers Can Detect Homology
  • Orthologs
  • Paralogs
  • lsquoOtherrsquo-logs
  • The Essential Paradigm
  • Function Conserved Longer than Detectable Similarity
  • Redundancy in the Genetic Code
  • Protein Similarity Persists Longer
  • Always Compare Protein Sequences
  • Exercise 1 nucleotide vs amino acid search
  • Answers Exercise 1
  • The Essential Task
  • Functional Orthologs
  • Finding Orthologs
  • Using Synteny is Better
  • Metazome
  • Metazome Exercise
  • Part 3 Finding Sequence Similarities
  • Gaps in Alignments
  • The Downside of Gaps
  • BLAST
  • Flavours of BLAST
  • How does it work
  • BLAST WORDS and INDEXING
  • Analyse the Query Sequence
  • Expand from Word Based Matches
  • BLAST ndashTypical Output
  • When is a match significant
  • E-values
  • E-values From First Principles
  • Calculating an E-value
  • E-values In Practice
  • E-value Exercise
  • E-value Exercise Answer
  • E-values Effect of Database Size
  • Slide 58
  • Why were the values different
  • E-values Effect of Query Length
  • Why not just use identity
  • So whatrsquos the real problem
  • Rules of Thumb
  • Protein BLAST
  • Amino Acid Substitutions
  • Exercises
  • Part 4 Tweaking BLAST
  • Not All Parameters are hellip
  • The Many Parameters of BLAST
  • Slide 70
  • BLAST Parameters Exercises
  • Results for Exercise 1
  • Slide 73
  • Results for Exercise 2A (OFF)
  • Results for Exercise 2A (ON)
  • Results for Exercise 2B
  • Results for Exercise 2C
  • Slide 78
  • Slide 79
  • Results for Exercise 4 (i)
  • Results for Exercise 4 (ii)
  • Slide 82
  • Slide 83
  • END
Page 3: Bioinformatics Workshop 1 Sequences and Similarity Searches

Part 1 Structural Genomics

DNA arranged in chromosomes

Vertebrate ~ 109 base pairs

Chromosomes and Genes

Total of ~30000 genes on ~20 chromosomes

1000 ndash 2000 genes per chromosome

locus

Gene to Protein~ gene

mRNA

protein

genome

primary transcript

CTACCATCCATGCTAACCATTCTACTAGCATAACTGGCTA

Sequence Signals

CTACCATCCATGCTAACCATTCTACTAGCATAACTGGCTA

mRNA

MLTIL AL

Genomic Signals

transcription start site

===CGCTATAAGCG====================

===CGCAATAAAGCG===================

polyadenylation signal

===CACGATCGAGTC===================

promoters

enhancers

==ACGTAhelliphelliphelliphellipCAGTA====================

splice sites

Derivative Sequences

mRNA

capture by cloning into cDNA library

3rsquo EST

5rsquo EST

cDNA sequence

EST single pass sequence from each end of the clone

cDNA multiple pass sequencing over whole length of the clone

5rsquo 3rsquo

Gene Models

gene modelexons

Sequences and Genes(Accession Numbers and Names)

AAB229701

AAP212451

CAA415451

NP_1877592

proteins

S431051mRNAscDNAs lsquosimilar to Cyclin B1 [mus musculus]rsquo

gene

BT0064371 lsquoCyclin B1 isoform 1 [mus musculus]rsquo

X587081

NM_1119853 lsquoCCNB1 Cyclin B1 [mus musculus]rsquo

lsquoCyclin B1 isoform 2 [mus musculus]rsquo

Gene Symbols Names Etc

Gene Symbol CCNB1

Gene Name cyclin B1 [Homo sapiens]

Description G2mitotic-specific cyclin B1

Aliases CCNB CYCB1

A Gene-Centric View

Entrez Genehttpwwwncbinlmnihgov

Cyclin B1

S431051

BT0064371

X587081

NM_1119853

AAB229701

AAP212451

CAA415451

NP_1877592

Exercise 1

Go to Entrez Gene and look for your favourite gene or genes

genomic location

expression data

Sequences and Accession Numbers

NM_0010159221 gi=62860271

GATCGTTCGATTAGCTAGGGACACCACCGATCGATATGACCACAAAAA

BC0096381 gi=16307106

GTTCGATTAGCTAGGGACACCACCGATCGATATGACCACAAAA

NM_0010159222 gi=62860589

GACCGTTCGATTAGCTAGGGACACCACCGATCGATATGACCACAAA

NP_0010159221 protein translated from mRNA

XM_0011025671 predicted mRNA

XP_0010897651 predicted protein translated from predicted mRNA

mRNA Splicing Signals

gene model

genome

CTACCATCCATGCTAACCATTCTACCATTTTATACTCATGCAACGGACCGTAGCGTAGTCGCTTAGCATCCTTTATAACTGGCTA

CTACCATCCATGCTAACCATTCTAC CATTTTATACTCATGCAACGGACCGT AGCGTAGTCGCTTAGCATCCTTTATAACTGGCTA

CTACCATCCATGCTAACCATTCTACGTAAGTCATCTATATCAATATTATTTCAGCATTTTATACTCATGCAACGGACCGTGTCAGTATTACAGAGCGTAGTCGCTTAGCATCCTTTATAACTGGCTA

GTAAGdonor

TTTCAG acceptor

mRNA

exon intron exon intron exon

splice sites

Gene PredictionsGiven- coding sequence must run from ATG ndash STOP codon in-frame- introns GT AG can be spliced out

Also take a statistical approach- coding and non-coding sequence are slightly different in composition- some lsquopossiblersquo splice sites are more likely than others

CGTCGTATGGCTTCGATGTAGTACATCGGATCGGTATGGAATCATTTCAGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA

CGTCGTATGGCTTCGATGTAGTACATCGGATCGGTATGGAATCATTTCAGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA CGTCGTATGGCTTCGATGTAGTACATCGGATCGGTATGGAATCATTTCAGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA

CGTCGTATGGCTTCGATGTAGTACATCGGATCGGTATGGAATCATTTCAGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA

CGTCGTATGGCTTCGATTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA

CGTCGTATGGCTTCGATGTAGTACATCGGATCGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA CGTCGTATGGCTTCGATGTAGTACATCGGATCGGTATGGAATCATTTCAGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA

CGTCGTATGGCTTCGATTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA

CGTCGTATGGCTTCGATGTAGTACATCGGATCGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA

scan genomic sequence hellip

CGTCGTATGGCTTCGATGTAGTACATCGGATCGGTATGGAATCATTTCAGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA

most likely gene model

Supporting Evidence

EST evidence

genome

gene model

We note that in the absence of EST evidence it is only really possible to predict coding sequence with any confidence (and even thenhellip)

So predicted genes based on computational gene models alone will usually lack UTR regions which has some important consequences

exons 1 2 3 4

TheoreticalPredicted Sequences

genome

predicted gene modelexons 1 2 3 4

Wersquove now reversed the process of working out exon structure from aligning cDNA sequences against the genome sequence but we shouldnrsquot lose sight of the fact that we donrsquot really know if these predicted proteins exists ndash especially where supporting EST evidence is weak or non-existent

predicted transcript

predicted protein

Sequences for a model organism

ESTs ndash millions pound10 eachCheap to sequence ndash so we get millions per organismBut lots of errorsAnd incomplete gene sequencesCan give us relative expression levels

cDNAs ndash tens of thousands pound1000 eachExpensive ndash but only need to do one (or a small number) per geneFew errors with multipass sequencingGives us protein sequences

Genomes ndash one pound30000000Extremely expensiveBut the only way to get the whole pictureGives us gene regulation

So Whatrsquos in the Databases Now

15000000ESTs

3300000cDNAs

NCBI July 2005

2700000proteins

950000proteins

nrRefSeq

DNA

Proteins

Part 2 Comparative Genomics

ATGAAGGCTGCCTACGACTGCCGTGATGCAGGCTGCCTACGACTGCCGTGATGCAGGCTGCCAACGACTGCCGTGATGCATGCTGCCAACGACTGCCGTGATGCATGCTGCCAACGACTGCCCTGATGCATGCTGCCAACGGCTGCCCTGATGCATGCTGCCAACGGATGCCCTGATGCATGCCGCCAACGGATGCCCTGATGCATGCCGCCAACGGATGTCCTG

Imagine one mutation gets fixed every 100000 years in this gene sequencehellip

Gene sequence

Evolution by sequence mutation

Speciation

ATGAAGGCTGCCTACGACTGCCGTG

ATGCAGGCTGCCTACGACTGCCGTGATGCAGGCTGCCAACGACTGCCGTGATGCATGCTGCCAACGACTGCCGTGATGCATGCTGCCAACGACTGCCCTGATGCATGCTGCCAACGGCTGCCCTGATGCATGCTGCCAACGGATGCCCTG

Gene AATGAAGGCTGCCTACGACTGCCGTG

ATGAAGGCCGCCTACGACTGCCGTGATGAAGGCCGCCAACGACTGTCGTGATGAAAGCCGCCAACGACTGTCGTGATGAAAGCCGCCAACGACAGTCGTGATGAAAGCCGCCTACGACAGTCGTGATGAAAGCCGCCTACGACAGTCCTG

ATGCATGCTGCCAACGGATGCCCTG

ATGAAAGCCGCCTACGACAGTCCTG||| | || ||| ||| | ||||

If the genetic difference means they can no longer interbreed with fertile offspring ndash then we have a new specieshellip

Residual Similarity

ATGCATGCTGCCAACGGATGCCCTG

ATGAAAGCCGCCTACGACAGTCCTG||| | || ||| ||| | ||||

ATGCATGCTGCCAACGGATGCCCTG

ATGGAAGGCGCTTAGGATAGTCCAG||| | | || | | | || |

After longer periods of evolution homology may no longer be detectable in the DNA sequencehellip

We can still easily detect residual similarity between these sequences this is what we call homology ndash detectable similarity because of common evolutionary origin

Computers Can Detect Homology

In fact computers are very good at this task ndash the two primary challenges are

(a) performing the search fast enough to look through millions of sequence in a timescale compatible with a lab scientistrsquos attention span

(b) at low levels of similarity being able to distinguish between biologically related sequences and chance matcheshellip

ATGCATGCTGCCAACGGATGCCCTG

ATGAAAGCCGCCTACGACAGTCCTG||| | || ||| ||| | ||||

GCTGACTCGTAGCGCTTAGCTAGCT

CCAACATCTAGCCAGATTAGTTAGT | || | | | |

Orthologs

A A

A Gene duplication though speciation The two copies of Gene

A will now evolve independently but will continue to have the ~same function

They are ORTHOLOGS

Paralogs

A

Gene duplication though internal genome duplication

The two copies of Gene A will now evolve independently but will probably not continue to have exactly the same function

They are PARALOGS

A

A Arsquo

A

lsquoOtherrsquo-logsWhat about gene duplication after speciation

How can we describe the relationship(s) between the various copies of gene A in the two frogs

Bear in mind that understanding gene function is more important than semanticshellip

The two copies of A in the orange frog are sometimes called IN-PARALOGS

If they were also present in the green frog (and therefore were in the ancestor species) they would be OUT-PARALOGS

A

A

A

Arsquo A

The Essential Paradigm

1 any group of modern species can be traced back to some extinct common ancestor

A

A

2 in all likelihood they share orthologous genes which have the same function in the modern animal as in the extinct ancestor

3 If we can experimentally determine the function of a gene in one of these organisms then there is a good chance the ORTHOLOGOUS gene in another organism will have the same function

A A

cyclin b1

cyclin b1

Function Conserved Longer than Detectable Similarity

start from first self-replicating sequence

same function detectable similarity

living organisms

whole genome duplication local duplication

Redundancy in the Genetic Code

GCA A alanine GCC A GCG A GCT A

TGC C cystine TGT C

GAC D aspartate GAT D

GGA G glycine GGC G GGG G GGT G

lsquoSynonymousrsquo or lsquosilentrsquo mutations in the third position of the codon triplets have no effect on the amino acid coded for ndash so there is no evolutionary pressure against thishellip

Protein Similarity Persists Longer

CTATCACGAGAACCTGTGCTATCCCGAGAACCTGTGCTATCCCGAGAACCAGTGCTATCCCGTGAACCAGTGCTATCCCGTGAGCCAGTGCTATCCCGTGAGCCAGTTCTGTCCCGTGAGCCAGTT

CTATCACGAGAACCTGTG

CTGTCCCGTGAGCCAGTT|| || || || || ||

LSREPV

LSREPV||||||

CTATCACGAGAACCTGTG

TTGTCCCGGTCGCCAGTT | || | || ||

LSREPV

LSRFPV||| ||

67 100

44 80

Always Compare Protein Sequences

ATGAATGCAGCCTATGATTGCCGAGCCAGAATGCTAAGG MNAAYDCRARMLR ||||| || || || || || || || ||||| || || | ||||||||+||ATGAAGGCCGCATACGACTGTCGTGCTAGAATCCTGAGA MKAAYDCRARILR

DNA comparison amino acid comparison

The DNA sequence can change while the amino acid sequence stays the same so always look for similarities by comparing amino acid sequences

Exercise 1nucleotide vs amino acid search

Go to the file example-sequenceshtml and locate the section for this exercise There should be two sequences lsquosurfeit1rsquo for frog and fly

Go to NCBI Blast home page then lsquoAlign two sequencesrsquo (bottom left lsquospecialrsquo panel) paste one sequence into each window and hit lsquoAlignrsquo ndash this will do a direct DNADNA comparison

Now find the open reading frames of the two genes and translate them into amino acid protein sequences then repeat the two sequences comparison

Go to NCBI ORF Finder ndash paste sequence ndash hit OrfFind ndash identify longest ORF ndash click on it ndash next screen hit Accept ndash change View to Fasta protein ndash hit View ndash copy sequence to Blast2Seqs Do the same with the other sequence

Before you hit lsquoAlignrsquo change the lsquoProgramrsquo (top left) to blastphellip

Answers Exercise 1

The Essential Taskexperiment data mining

gene sequence what is its function

database of proteins in other species

Cyclin-AFoxA1

cdc25

alpha-tubulin

Predicted protein

Gravin-like

Sprouty-2

calmodulin

KIAA10786568

frizzled

Wint8

Troponin T3

Gravin-like

we can only do this because of implied function based on orthology

Functional Orthologs

function known annotation lsquoGravinrsquo available

Human geneXenopus genefunction unknown

sequence similarityorthologs

same function But we know that function is largely determined by shape

similar shape

Which in general we cannot determine ndash but it is probably SHAPE not SEQUENCE that is conserved

We make an assumption that the same gene function is likely to be present in the two organisms and the ones that have this function are likely to be the most similar in sequence

Finding OrthologsSo how do we find orthologs and can we know when we have

The simplest is Reciprocal Best BLAST but it implicitly relies on having all the protein sequences of you own organism and the one you wish to find an ortholog in

frog proteindatabase of human proteins

best match human protein

database of frog proteins

x

Using Synteny is Better

We know that large regions of (say) vertebrate genomes have preserved their overall organisation from one organism to another

And we find the same genes (ie orthologs) in more or less the same order in the syntenic sectionsThese of course represent chromosomal re-arrangements since these organisms diverged

Human chromosome 5

Mouse chromosome 10

Mouse chromosome 2

MetazomeFortunately someone has done all the hard work for ushellip Dan Rokhsar httpwwwmetazomenet

Metazome Exercise

Go back to Entrez Gene and look for your favourite gene again

Pick probable ortholog vertebrate genes from common organisms (human mouse rat chicken frog fish) and paste their protein sequences into a temporary space

Go to Metazome (httpwwwmetazomenet) find the blast window open two versions of it and blast your sequences against the Tetrapod or Jawed vertebrate node

See if you get the same cluster ID as best top hit and have a look at the Metazome alignment(s)hellip

Part 3 Finding Sequence Similarities

We want computer programs which will compare sequences at all possible different alignments looking for a degree of similarity greater than we would expect to find by chance

But first we have to consider the implication of gapshellip

Insertions and deletions are other possible forms of mutations and they can really mess up our simple alignments

ATGCATGCTGCCAACGGATGTCCTG

ATGAAAGCCGCCTACGAAAGTCCTG||| | || ||| ||| | ||||||

ATGCATGCTGGCCAACGGATGTCCTG

ATGAAAGCCGCCTACGAAAGTCCTG||| | || ||| | | | |

Gaps in Alignments

Consider these two obviously similar sequences

TTCCCAACTCTCCTCTTTCACCATGAAGCTCAAGGACAGATTCCACTCGCCCCAAAATCAAGCTCACCCCGTCCAAGAA | || | || |||||||||||||||||||| ||||||||| ||| ||| | ||| | | |TTCCCACCTCTCCTCTTTGCACCATGAAGCTCAAGGACAAATTCCACTCCCCCAAAATCAAGCGCACCCCGTCCCAGAA

In fact we realise that the most probable alignment (regarding biological origin) is with a small gap in each sequence

TTCCCAACTCTCCTCTTT=CACCATGAAGCTCAAGGACAGATTCCACTCGCCCCAAAATCAAGCTCACCCCGTCCAAGAA |||||| ||||||||||| |||||||||||||||||||| ||||||||| |||||||||||||| |||||||||| ||||TTCCCACCTCTCCTCTTTGCACCATGAAGCTCAAGGACAAATTCCACTC=CCCCAAAATCAAGCGCACCCCGTCCCAGAA

So in general we allow ourselves to insert gaps until we find the optimal alignment

But where should this process stop

The Downside of GapsTake two random sequences with no lsquorealrsquo similarity

GACACTAGGTCGATGCGTGGTGGCGAGA

ACGCATCCGGATGTGCACCGTGGAACTG

And allow lsquocost freersquo gaps

GAC--ACT----AGGTCGATGC---GTGG---TGGCGAGA || | | | | | ||| |||| || ACGCA-TCCGGA--T-G-TGCACCGTGGAACTG

Clearly although the alignment has no mismatches it is obviously not biologically meaningful

To prevent this we assign a cost to adding gaps which is offset against the benefit of finding matches ndash and this is the essence of lsquofinding gapped alignmentsrsquo

We want to find the lsquoalignmentrsquo between the two (or more) sequences which shows the greatest degree of similarity while introducing the fewest gaps hellip

BLAST

gtqueryAGACGAACCTAGCACAAGCGCGTCTGGAAAGACCCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGTTCAGGAGTATTTGGACTGCAATATTGGCCCTCGTTCAAGGGCGCCTACCATCACCCGACGGTCATGCCGGTCCCCAGCAGCTGCTAATAACTTCCTTCGCTACTCAAGTTACCACGCTAGCAAAACCCACGGCATACCGTTTACCCTTTAAAATCAGCTTCAACCAGCAACGAA

There are many programs used to find similarities between sequencesThey range from relatively slow programs which find the exact best matching alignment through ones which take progressively inexact shortcuts to speed things up Of this latter class the best known and easily most widely used is BLAST developed by Stephen Altschul and others and continuously refined over the last 10-15 years

The essential idea is to compare your query sequence against a collection or lsquodatabasersquo of target sequences looking for the one(s) that match the query sequence the best

gttarget1AAAACAGGAATATTTACCGGGACCGGGTAATGATGCATCTCGAGGTACACAATATACCTG GAGAACCGAATTATGAGTTGGCCACCTTACTTAACGAAACCAGCAGAGAAAATCCAACAT GGCAACACCCCTCTGACTACACTAGAAGGAACTACTATGTAAGAAAACAGCCTGTCCCTT GCAGTTTGAATGACTGGGTGATGCGAAATGGGGGTCCTGCCATAGAGCGCTTCCATGGTT TACCTTGCACATTTCAGAGAAGTCCTATGCCAGGAGTCCTTCCTACAGGGCCTTCCTGAA ACTATATATGTGCTTATTCTTGTTTGATTTGGCTTTGCAGgttarget2CTCTTAATTTATTTCTCTTCCTGCAGCTCCCTCGCTTTTTCCTTTCCCTGTTACATTCAT CTGACTTGAAGAGTTGCAAATTTTCAGTGTTTCTGTTTTTGTTGCTGATATGTTGTAAAC TTTTTAATAAAATCTATTTCTATAG gttarget3GCAGTTTGAATGACTGGGTGATGCGAAATGGGGGTCCTGCCATAGAGCGCTTCCATGGTT TACCTTGCACATTTCAGAGAAGTCCTATGCCAGGAGTCCTTCCTACAGGGCCTTCCTGAA ACTATATATGTGCTTATTCTTGTTTGATTTGGCTTTGCAGCTAGGGTTTTCACCTTTTCT GGAAAAAAAAATACTGGCTTCC gttarget4CTGCTATTAATGGGCAAAACAACTCAAATAAAGTCCCTCTGCCACCCTCAGACACTGCCC CTGGCCCCCAGCTGCCCGCTGATCCTTGTAGCCAGAGCAGTAAAGTTTTGAAAGTGGAGC CCAAGGAGAATAAAGTTATTAAAGAAACTGGCTTTGAACAAGGTGAAAAGTCTTGTGCAG CACCTCTAGATCATACTGTGAAGGAAAATCTTGGACAAACTTCTAAAGAACAGGTGGTAG

query

database

COMPARE

LIST MATCHES

Flavours of BLAST

ACGATAGATCCCATCCATAAAT ATGACGATAGATCCCATCAT CGATAGGACCACCACA GATAGACCAGGATACATAGGATAATTA AGCTCGCTTGGCTCGATGGCT

MKJLSPWERSYTRGHYTWER MGHTVNBZY MKLPWRHGDBKJGMNDFD MBKLRPIUHDFRTASGSLKWWRTVBN

MKJLSPWERSYTRGHYTWER MGHTVNBZY MKLPWRHGDBKJGMNDFD MBKLRPIUHDFRTASGSLKWWRTVBN

ATGACGATAGATCCCATCAT CGATAGGACCACCACA GATAGACCAGGATACATAGGATAATTA AGCTCGCTTGGCTCGATGGCT

query sequence other operation database sequences

ATGACGATAGATCCCATCAT CGATAGGACCACCACA GATAGACCAGGATACATAGGATAATTA AGCTCGCTTGGCTCGATGGCT

BLASTn

BLASTp

BLASTx

tBLASTn

tBLASTx

ACGATAGATCCCATCCATAAAT

ACGATAGATCCCATCCATAAAT

MQWCGYRWTYQGYRW

MQWCGYRWTYQGYRW

FAST

FAST

SLOW

SLOWER

HORRIBLY

SLOW

6 fra

me

trans

latio

n

How does it work The main task of any sequence comparison program is to test all possible mutual alignments of two sequence and see how good the match is

CTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGT

CCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGTTC | | | | | ||||||||||||||||||||||||| CTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGTCTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGT || | | | | | | | | |||||||||||||||||||||||| | | | | | |

CCGAGCTTCTCATTGCTCTTCCTAACAGTG=TGATAGGCTAACCGTAATGGCGTTC||||||||||||||||||||||||| ||||||||||||||||||||||||

query

1st database sequence

This would actually be a very slow search process if implemented like thishellip

BLAST achieves its speed through two strategies

- it takes a WORD based approach- it pre-INDEXES database sequences

BLAST WORDS and INDEXING1 GACAAATCCAAACCCCTGAAGTTCTCCACCAGCAAAGCCA

2 TAAGCAAATTTAATTTTGTTTACATTTTC

3 GTTAAGACCTTCCCTGACATTTGCAGCAGTTTCAAATGTA

AAAAAAAA 00001 AAAAAAAC 00002 AAAAAAAG 00003 ACAAATCC 07967 ACAAATCC 07968 ACAAATCC 07979 GACAAATC 33568 GACAAATG 33569 TCCAAACC 64321 TCCAAACC 64322

sequence position word

1 1 33658

1 2 07967

1 3 16210

3 15 33568

3 16 07967

Database of sequences

Numbered list of all possible lsquowordsrsquo

Build a position index of all words in the database

Analyse the Query Sequence gtquery AGACAAATCCAAACCCCTGAAGTTCTCCACCAGCAAAGCCA

AAAAAAAA 00001 AAAAAAAC 00002 AAAAAAAG 00003 ACAAATCC 07967 ACAAATCC 07968 ACAAATCC 07979 GACAAATC 33568 GACAAATG 33569 TCCAAACC 64321 TCCAAACC 64322

QUERY SEQUENCE

Numbered list of all possible lsquowordsrsquo

position word

1 14236

2 33658

3 07967

Analyse QUERY SEQUENCE

sequence position word

1 1 33658

1 2 07967

1 3 16210

3 15 33568

3 16 07967

Index of database

Expand from Word Based Matches We lsquoinstantlyrsquo know which sequences in the database have at least a word length match with our query sequence and at what relative position

Next the potential alignments are expanded adding up a score for (total matches ndash mismatches ndash gap penalties) to make the best possible alignment But this is usually for a tiny proportion of the sequences in the database ndash so overall it is much quicker

The highest scoring alignments are reported

But we can potentially miss alignments with no word-size bits in common consider BLASTn with a default word-size of 11

TCGGAAGTGGAAGCTGAACCTGATTGTAGAGTTGGAGGCCAGTGTTCTGGCTGAGC||||||||| ||||| |||||||||| |||||||||| |||| ||||| ||||||| TCGGAAGTGTAAGCTCAACCTGATTGCAGAGTTGGAGTCCAGAGTTCTAGCTGAGC

Care is sometimes neededhellip

BLAST ndashTypical OutputINPUT

gtpartial cDNA sequence Xenopus tropicalisCGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGTTCCCACCTCTCCTCTTTCACCATGAAGCTCAAGGACAAATTCCACTCCCCCAAAATCAAGCGCACCCCGTCCAAGAAGGGGAAGCCGGCCGACCTCACCGTCAAAACAGAAGAGAAACCCGTCAACAAAACCTTAAGCCGCTTGGAGGAACAGGAGAAAGAAGTCGTTAATGCCTTGCGTTACTTTAAGACAATTGTTGACAAGATGGCGGTGGACAAGATGGTGCTGGTGATGCTGCCAGGGTCGGCGA

OUTPUTQuery= (311 letters) Database NCBI Protein Reference Sequences 954378 sequences 347895532 total letters

gtgi|41055060|ref|NP_9574201| similar to guanine nucleotide-releasing factor 2 (specific for crk proto-oncogene) [Danio rerio]

Length=691

Score = 133 bits (335)Expect = 6e-31 Identities = 7698 (77) Positives = 8298 (83) Gaps = 498 (4) Frame = +2

Query 26 MSGKIE-KADSQRSHLSSFTMKLKDKFHSPKIKRTPSKKGKPA--DLTVKTEEKPVNKTL 196 MSGKIE K +SQ+SHLSSFTMKL KFHSPKIKRTPSKKGK + VKT EKPVNK + Sbjct 1 MSGKIESKHESQKSHLSSFTMKLM-KFHSPKIKRTPSKKGKQLQPEPAVKTPEKPVNKKV 59

Query 197 SRLEEQEKEVVNALRYFKTIVDKMAVDKMVLVMLPGSA 310 SRLEEQEK+VV+ALRYFKTIVDKM VD VL MLPGSA Sbjct 60 SRLEEQEKDVVSALRYFKTIVDKMNVDTKVLQMLPGSA 97

When is a match significant

RFKISDCQHPCTYSHNQYMTNHMRECPYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV

NFSWKKTSEKETNCQFDYPNDYNEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFNMCWLEVNSS

RF---KISDCQHPCTYSH-NQYMTNHMREC----PYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV F K S+ + C + + N Y N +C P+ + +W +P + D I N M ++ NFSWKKTSEKETNCQFDYPNDY--NEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFN------MCWLEVNSS

Here is a lsquotypicalrsquo weak alignment from BLASTp

In fact the sequences were randomly generated so there is no biologically significant alignmenthellip

E-values

The number of matches like the discovered match that I would expect to find by chance

An E-value of 00 implies that I would expect no matches like this to arise by chance thereforehellip

An E-value of 1 implies I would expect 1 match like this to arise by chance so if I have a match with such an E-valuehellip

Also ldquoexpect valueldquo or ldquoexpectationrdquo

E-values From First Principles

Some database statistics (23rd July 2005)

Database NCBI RefSeq mRNA 272619 sequences 503566580 total letters (~50 x 108)

Database NCBI nr 3329110 sequences 14601814750 total letters (~14 x 1010)

Notation

12e-35 = 12 x 10-35

48 x 106 = 4800000

We will consider first searching a nucleotide sequence (lsquoACGTAGACGTrsquo) against a nucleotide database eg the RefSeq mRNA above

Then we will consider the more complex case of amino acid sequence (protein) searches Which is of course what we mostly do

Calculating an E-valueThe RefSeq mRNA database has ~ 50 x 108 letters There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA

Query = lsquoArsquo

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACrsquo

AC AC AC AC

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACGrsquo

ACG

Expected number of matches = (50 x 108) 4 = ~12 x 108

Expected number of matches = (50 x 108) (4x 4) = ~31 x 107

Expected number of matches = (50 x 108) (4 x 4 x 4) = ~81 x 106

Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer

Expected number of matches = (50 x 108) (4 x 4 x 4 x 4 hellip 60 times ) = (50 x 108) 1036 = 50 x 10-28

E-value = 50 x 10-28

E-values In PracticeSo if I take a 60 nt sequencegtsequenceACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA and actually BLAST it against the RefSeq mRNA database I get

BLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds

Length=6060 Score = 119 bits (60) Expect = 2e-26 Identities = 6060 (100) Gaps = 060 (0) Strand=PlusPlus

Query 1 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 60 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 2977 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 3036

What do I get if I BLAST it against the larger nr databaseBLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds

Length=6060 Score = 119 bits (60) Expect = 6e-25 Identities = 6060 (100) Gaps = 060 (0) Strand=PlusPlus

Query 1 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 60 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 2977 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 3036

theoretical value was 50e-28 -

E-value Exercise

Given a transcription factor binding site

ACC[TG]TA

How many would you expect to find by chance in a 10k promoter sequence

How would this differ if there was an optional additional base between the 4th and 5th positionsIeACC[TG]TAOR ACC[TG]TA

E-value Exercise AnswerACC[TG]TA

Expect lsquoArsquo every 4 ntExpect lsquoACCrsquo every 4x4x4 = 64 nt

Expect lsquoT or Grsquo every 2nd ntExpect lsquoACC[TG]rsquo every 64x2 nt = 128 nt

Expect lsquoTArsquo every 4x4 = 16 ntExpect lsquoACC[TG]TArsquo every 128x16 nt = 2048 nt (4x4x4x2x4x4)We would expect ~5 of these promoter sites every 10k by chance

If also ACC[TG]TAA allowed

The two motifs independently have the same E-valueTo allow either means we expect twice as many

E-values Effect of Database SizeThe nr mRNA database has ~ 14 x 1010 letters (was RefSeq and 50 x108)There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA

Query = lsquoArsquo

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACrsquo

AC AC AC AC

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACGrsquo

ACG

Expected number of matches = (14 x 1010) 4 = ~12 x 108

Expected number of matches = (14 x 1010) (4x 4) = ~31 x 107

Expected number of matches = (14 x 1010) (4 x 4 x 4) = ~81 x 106

Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer

Expected number of matches = (14 x 1010) (4 x 4 x 4 x 4 hellip 60 times ) = (14 x 1010) 1036 = 14 x 10-26

E-value = 14 x 10-26

(was E-value = 50 x 10-28)

E-values Effect of Database Size

The E-value is simply dependent on database size

RefSeq

nr

14 x 1010 letters

50 x108 letters

30 x bigger

BLAST the same sequenceagainst each E-value = 14e-26

E-value = 50e-28

The database was ~30 times bigger and so the E-value was ~30 times bigger

Why were the values differentOur calculated E-value for searching against the RefSeq mRNA database was 50 x 10-28But our actual BLAST search at NCBI gave a value of

20 x 10-26 - about 40x larger - why is this

Gapped alignments

If we were expecting N matches for a query sequence lsquoACGTACGTACGTrsquo imagine what would happen to N if we allowed gaps in our matches

ACGTACGTACGT

This would now give us additional possible alignments that would meet our lsquomatchrsquo criteria

ACGTACGTACGT ACGTACAGTACGT ACGTACCGTACGT etc|||||||||||| |||||| |||||| |||||| ||||||ACGTACGTACGT ACGTAC-GTACGT ACGTAC-GTACGT

We will expect many more matches in a given database if we allow our alignments to have gaps The E-value will be larger

E-values Effect of Query Length

Biologically itrsquos the same match Does it mean we are any less sure that this match didnrsquot occur by chance The E-value is simply dependent on match length

database

BLAST 500 nt sequence against a database

BLASTn Get a full length match with sequence XYZ at an E-value = 50e-160

gtsequence ACTAGTCTAGCTAGACATCGATCGATGATGCTACACAGATAGACGATAGATAGTAAGTCGATCGATCGCGCATCGATCGTCTAGATCGATCGCTCGCTGTGTAGATAGATCGGCGATAGA

database

BLAST half of the same sequence against the same database

BLASTn

gtsequence ACTAGTCTAGCTAGACATCGATCGATGATGCTACACAGATAGACGATAGATAGTAAGTCG

Get a match with sequence XYZ again but at an E-value = 50e-80

Why not just use identityAt some levels this a good question

But consider two very different searches both of which give a 75 identity match

Query1 was 60 nt longCGGAGCTCAGGGCTTAACGACTGATATCTCCGCGCATGTCGAGAAACGATACAGCCAGCG||||||||||| || | || | || || |||| | | | |||||| | ||||||||||CGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGWhich would have an E-value ~ 50 x 10-19

And Query2 only 16 nt longACGTACGTACGTACGT||| || | |||| ||ACGCACCTTCGTAGGTWhich would have an E-value ~ 30

And intuitively we feel we would expect to see that sort of number of matches in the database just by chancehellip

So whatrsquos the real problemBasically you are usually trying to answer the question

Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism

Are there any useful guidelines though at least for biological meaningfulness

Basically you are usually trying to answer the question

Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism

BLAST

The difficulty is because

ORTHOLOGY

BLAST Similarity + Probability

biological knowledge

nature of query sequence

phylogenetic relationship

match length PI size of databasehellip

Rules of ThumbHow good does an E-value have to be before we might even think we have an ortholog

largerworse smallerbetter

E-values 10-5 10-10 10-40 10-100 00

fantasy borderline encouraging

pretty good canrsquot get

better

But note that in some gene families with closely related members you can get an E-value of 00 for several different matches and then identity may be more sensitive Also bear in mind in cases like this that ideas of lsquofunctionalrsquo orthology may break down with more than one locus producing identical proteins which share the same functionhellip

Protein BLASTItrsquos (nearly) always better to make comparisons at the amino acid level between protein sequences than the DNA level because the amino acid sequence is more conserved than the underlying DNA sequenceDoes this cause us to treat expected values any differently

If we follow the argument as before then for an exact match of a 20 amino acid sequence in the RefSeq protein database each additional amino acid will reduce the E-value by 120th (there are 20 different amino acids) And as there are 347895532 letters in that databaseE-value = ~35 x 108 (20 x 20 x 20 hellip20 times) = ~35 x 10-18

But this is what we get if we run the blast at NCBI

Score = 431 bits (100) Expect = 8e-04 Identities = 2020 (100) Positives = 2020 (100) Gaps = 020 (0) Frame = +3

Query 3 SSSSFRAYRAALSEVEPPCI 62 SSSSFRAYRAALSEVEPPCISbjct 972 SSSSFRAYRAALSEVEPPCI 991

Really too big a discrepancy to easily explain with hand wavinghellip

Amino Acid Substitutions

A SC F LWYG I LMVL IMFVM ILVP V ILMW FY

N DHSQ REHKS ANTT SY HFW

H NQYK RQER QK

D NEE DQK

In fact we need to take into account both amino acid substitutability as well as as before allowing gapped alignments On average any residue can be substituted for by about 2 others so each position has about 17th chance of lsquomatchingrsquo rather than 120th

So now we getE-value = ~35 x 108 (7 x 7 x 7 hellip20 times) = ~44 x 10-9which is much closer to the actual BLAST value

These substitutabilities are dealt with by the BLOSUM and PAM matrices

ExercisesGo to the file random-DNA-sequenceshtml select one of the 20 randomly generated nucleotide sequences and do a BLASTx (translated DNA-gtprotein) at NCBI against the nr protein database

Did you find any lsquosignificantrsquo hits

Repeat with a second sequence

What conclusions might you draw from this exercise

Try the same sequence(s) against the nr nucleotide database

Is there any general difference

Part 4 Tweaking BLASTAlthough you normally see BLAST as a web page with boxes to place data in and tick boxes etc it is actually a command line program that can be run just by typing the appropriate command and options eg

promptgt blastall ndashp ltblast typegt ndashi ltinput sequencegt ndashd ltdatabasegt

This is the simplest form where the basic program lsquoblastallrsquo takes a number of different options or parameters indicated by the ndashx and followed by its value -p ltwhich blast flavour to rungt-i ltfile with query sequence ingt-d ltpre-indexed database namegt

Not All Parameters are hellipThere are many other parameters and if not listed explicitly they will use a default value most appropriate to the blast flavour requested Eg for ndashW ltword sizegt blastn uses ndashW 11 where blastx uses ndashW 3

There are also some options that appear on the web pages that are not really parameters but manage the job in a similar way One of the most useful of these is on the NCBI blast pages where you can use Entrez queries or pick from an organism list to modify your search

The Many Parameters of BLASTThere are almost literally hundreds of parameters but most are way too obscure even for die-hard techies like me Very few of them are regularly useful in any but their default value but just occasionally they are very necessary

Here are some of the ones that I have used

-e max expected value -m output format (graphical or tabularspreadsheet)-F filter query sequence for low complexity (default TRUE)-U use only upper case regions of query (default FALSE)-G gap opening cost-E gap extension cost-q nucleotide mismatch penalty (BLASTx uses matrices)-r nucleotide match reward-b number of matching sequences to report-g allow gaps (default TRUE)-W word size-z effective database size (removes effect of actual database size)-S query strands to search (default both directions)-l restrict database sequences to given list of lsquogilsquo numbers

BLAST Parameters Exercises1 BLASTn vs BLASTx

Open the file example-sequenceshtml copy the sequence gtblastn-vs-blastxThis is a Xenopus tropicalis cDNA sequence

Go to the NCBI BLAST Home PageNucleotide-nucleotide BLAST (blastn) section Paste your sequence into the box

Run BLASTn against the nr nucleotide database using all default optionsThen hit [format] to wait for the results in a new page(hint if you paste the sequence definition line lsquogtnamersquo into the box as well your results will be labelled accordingly which can be useful)

Now repeat but go to the TRANSLATED BLAST section and BLAST against the nr protein database using BLASTx

How might the different results help us view the presence of this gene in other vertebrates

Results for Exercise 1 BLASTn

BLASTx

BLAST Parameters Exercises2 Low complexity filtering

Open the file example-sequenceshtml copy the sequence gtlow-complexity-filtering-A This sequence contains a long AT tandem repeat

Go to the NCBI BLAST Home PageTRANSLATED BLAST sectionBLASTx Paste your sequence into the box

Carefully UNTICK the ldquoChoose filter [ ] Low complexityrdquo BOX in the second section And then run BLASTx against the nr database

What do you feel about these alignmentsRe-run but leave the low-complexity filter ON this timeDoes this change our view of the protein matches

Now continue with gtlow-complexity-filtering-B and ndashCC is an especially interesting case ndash what can we deduce about the cDNA sequence Annotators beware

Results for Exercise 2A (OFF) BLASTn ndash low complexity filtering OFF

Results for Exercise 2A (ON) BLASTn ndash low complexity filtering ON

Results for Exercise 2B

ON OFF

Results for Exercise 2C

There is a sequence error an extra G at position 117 in the sequence cDNA (117)AGAAAAGAAGAAACATGGCAATGGATCAGAA|||||||||||||||| ||||||||||||||AGAAAAGAAGAAACAT-GCAATGGATCAGAA

Genomic sequence

ON

OFF

BLAST Parameters Exercises3 Limit by Entrez query

Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching sequences in fruit fly enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT

Open the file example-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home Page TRANSLATED BLAST sectionBLASTx and paste the sequence

Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip

BLAST Parameters Exercises4 BLASTn vs tBLASTx and nucleotide mismatch penalties

Open the file example-sequenceshtml

Also open the NCBI BLAST Home PageSPECIAL ndash Align two sequences section

There are several Xenopus tropicalis cyclins in the examples fileCopy the sequence gtcyclin-A1-Xt to the Sequence 1 BLAST windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 BLAST window(i) Run the default comparison should be BLASTn Note the alignmentNow run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these

(ii) Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1 Then try reducing the gap open and gap extension penaltieshellipWhat do we learn from this

(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip

Results for Exercise 4 (i)

BLASTn tBLASTx

Results for Exercise 4 (ii)

Mismatch penalty = -2 (default) Mismatch penalty = -1

BLAST Parameters Exercises5 E-Value maximum for reporting

Open the file example-sequenceshtml

Copy the sequence gtsumo-binding-motif and go to the NCBI BLAST Home PageGo to the PROTEIN BLAST section BLASTp and paste the sequence

Run the search with the default values

Now re-run the search setting the maximum E-value in the box

Expect 100

What difference does this make

BLAST Parameters Exercises6 Word Size

Open the file example-sequenceshtml

Copy the sequence gtmorpholino and go to the NCBI BLAST Home PageGo to the NUCLEOTIDE BLAST section BLASTn and paste the sequence

Check OFF the low complexity filter and then run the search

Now re-run the search setting the following parameters

Low complexity OFFExpect 100Word Size 7Other advanced -q-1 (mismatch penalty -1 instead of default -3)

What difference does this make

END

  • Bioinformatics Workshop 1 Sequences and Similarity Searches
  • The Basic Questions
  • Part 1 Structural Genomics
  • Chromosomes and Genes
  • Gene to Protein
  • Sequence Signals
  • Genomic Signals
  • Derivative Sequences
  • Gene Models
  • Sequences and Genes (Accession Numbers and Names)
  • Gene Symbols Names Etc
  • A Gene-Centric View
  • Sequences and Accession Numbers
  • mRNA Splicing Signals
  • Gene Predictions
  • Supporting Evidence
  • TheoreticalPredicted Sequences
  • Sequences for a model organism
  • So Whatrsquos in the Databases Now
  • Part 2 Comparative Genomics
  • Speciation
  • Residual Similarity
  • Computers Can Detect Homology
  • Orthologs
  • Paralogs
  • lsquoOtherrsquo-logs
  • The Essential Paradigm
  • Function Conserved Longer than Detectable Similarity
  • Redundancy in the Genetic Code
  • Protein Similarity Persists Longer
  • Always Compare Protein Sequences
  • Exercise 1 nucleotide vs amino acid search
  • Answers Exercise 1
  • The Essential Task
  • Functional Orthologs
  • Finding Orthologs
  • Using Synteny is Better
  • Metazome
  • Metazome Exercise
  • Part 3 Finding Sequence Similarities
  • Gaps in Alignments
  • The Downside of Gaps
  • BLAST
  • Flavours of BLAST
  • How does it work
  • BLAST WORDS and INDEXING
  • Analyse the Query Sequence
  • Expand from Word Based Matches
  • BLAST ndashTypical Output
  • When is a match significant
  • E-values
  • E-values From First Principles
  • Calculating an E-value
  • E-values In Practice
  • E-value Exercise
  • E-value Exercise Answer
  • E-values Effect of Database Size
  • Slide 58
  • Why were the values different
  • E-values Effect of Query Length
  • Why not just use identity
  • So whatrsquos the real problem
  • Rules of Thumb
  • Protein BLAST
  • Amino Acid Substitutions
  • Exercises
  • Part 4 Tweaking BLAST
  • Not All Parameters are hellip
  • The Many Parameters of BLAST
  • Slide 70
  • BLAST Parameters Exercises
  • Results for Exercise 1
  • Slide 73
  • Results for Exercise 2A (OFF)
  • Results for Exercise 2A (ON)
  • Results for Exercise 2B
  • Results for Exercise 2C
  • Slide 78
  • Slide 79
  • Results for Exercise 4 (i)
  • Results for Exercise 4 (ii)
  • Slide 82
  • Slide 83
  • END
Page 4: Bioinformatics Workshop 1 Sequences and Similarity Searches

Chromosomes and Genes

Total of ~30000 genes on ~20 chromosomes

1000 ndash 2000 genes per chromosome

locus

Gene to Protein~ gene

mRNA

protein

genome

primary transcript

CTACCATCCATGCTAACCATTCTACTAGCATAACTGGCTA

Sequence Signals

CTACCATCCATGCTAACCATTCTACTAGCATAACTGGCTA

mRNA

MLTIL AL

Genomic Signals

transcription start site

===CGCTATAAGCG====================

===CGCAATAAAGCG===================

polyadenylation signal

===CACGATCGAGTC===================

promoters

enhancers

==ACGTAhelliphelliphelliphellipCAGTA====================

splice sites

Derivative Sequences

mRNA

capture by cloning into cDNA library

3rsquo EST

5rsquo EST

cDNA sequence

EST single pass sequence from each end of the clone

cDNA multiple pass sequencing over whole length of the clone

5rsquo 3rsquo

Gene Models

gene modelexons

Sequences and Genes(Accession Numbers and Names)

AAB229701

AAP212451

CAA415451

NP_1877592

proteins

S431051mRNAscDNAs lsquosimilar to Cyclin B1 [mus musculus]rsquo

gene

BT0064371 lsquoCyclin B1 isoform 1 [mus musculus]rsquo

X587081

NM_1119853 lsquoCCNB1 Cyclin B1 [mus musculus]rsquo

lsquoCyclin B1 isoform 2 [mus musculus]rsquo

Gene Symbols Names Etc

Gene Symbol CCNB1

Gene Name cyclin B1 [Homo sapiens]

Description G2mitotic-specific cyclin B1

Aliases CCNB CYCB1

A Gene-Centric View

Entrez Genehttpwwwncbinlmnihgov

Cyclin B1

S431051

BT0064371

X587081

NM_1119853

AAB229701

AAP212451

CAA415451

NP_1877592

Exercise 1

Go to Entrez Gene and look for your favourite gene or genes

genomic location

expression data

Sequences and Accession Numbers

NM_0010159221 gi=62860271

GATCGTTCGATTAGCTAGGGACACCACCGATCGATATGACCACAAAAA

BC0096381 gi=16307106

GTTCGATTAGCTAGGGACACCACCGATCGATATGACCACAAAA

NM_0010159222 gi=62860589

GACCGTTCGATTAGCTAGGGACACCACCGATCGATATGACCACAAA

NP_0010159221 protein translated from mRNA

XM_0011025671 predicted mRNA

XP_0010897651 predicted protein translated from predicted mRNA

mRNA Splicing Signals

gene model

genome

CTACCATCCATGCTAACCATTCTACCATTTTATACTCATGCAACGGACCGTAGCGTAGTCGCTTAGCATCCTTTATAACTGGCTA

CTACCATCCATGCTAACCATTCTAC CATTTTATACTCATGCAACGGACCGT AGCGTAGTCGCTTAGCATCCTTTATAACTGGCTA

CTACCATCCATGCTAACCATTCTACGTAAGTCATCTATATCAATATTATTTCAGCATTTTATACTCATGCAACGGACCGTGTCAGTATTACAGAGCGTAGTCGCTTAGCATCCTTTATAACTGGCTA

GTAAGdonor

TTTCAG acceptor

mRNA

exon intron exon intron exon

splice sites

Gene PredictionsGiven- coding sequence must run from ATG ndash STOP codon in-frame- introns GT AG can be spliced out

Also take a statistical approach- coding and non-coding sequence are slightly different in composition- some lsquopossiblersquo splice sites are more likely than others

CGTCGTATGGCTTCGATGTAGTACATCGGATCGGTATGGAATCATTTCAGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA

CGTCGTATGGCTTCGATGTAGTACATCGGATCGGTATGGAATCATTTCAGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA CGTCGTATGGCTTCGATGTAGTACATCGGATCGGTATGGAATCATTTCAGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA

CGTCGTATGGCTTCGATGTAGTACATCGGATCGGTATGGAATCATTTCAGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA

CGTCGTATGGCTTCGATTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA

CGTCGTATGGCTTCGATGTAGTACATCGGATCGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA CGTCGTATGGCTTCGATGTAGTACATCGGATCGGTATGGAATCATTTCAGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA

CGTCGTATGGCTTCGATTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA

CGTCGTATGGCTTCGATGTAGTACATCGGATCGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA

scan genomic sequence hellip

CGTCGTATGGCTTCGATGTAGTACATCGGATCGGTATGGAATCATTTCAGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA

most likely gene model

Supporting Evidence

EST evidence

genome

gene model

We note that in the absence of EST evidence it is only really possible to predict coding sequence with any confidence (and even thenhellip)

So predicted genes based on computational gene models alone will usually lack UTR regions which has some important consequences

exons 1 2 3 4

TheoreticalPredicted Sequences

genome

predicted gene modelexons 1 2 3 4

Wersquove now reversed the process of working out exon structure from aligning cDNA sequences against the genome sequence but we shouldnrsquot lose sight of the fact that we donrsquot really know if these predicted proteins exists ndash especially where supporting EST evidence is weak or non-existent

predicted transcript

predicted protein

Sequences for a model organism

ESTs ndash millions pound10 eachCheap to sequence ndash so we get millions per organismBut lots of errorsAnd incomplete gene sequencesCan give us relative expression levels

cDNAs ndash tens of thousands pound1000 eachExpensive ndash but only need to do one (or a small number) per geneFew errors with multipass sequencingGives us protein sequences

Genomes ndash one pound30000000Extremely expensiveBut the only way to get the whole pictureGives us gene regulation

So Whatrsquos in the Databases Now

15000000ESTs

3300000cDNAs

NCBI July 2005

2700000proteins

950000proteins

nrRefSeq

DNA

Proteins

Part 2 Comparative Genomics

ATGAAGGCTGCCTACGACTGCCGTGATGCAGGCTGCCTACGACTGCCGTGATGCAGGCTGCCAACGACTGCCGTGATGCATGCTGCCAACGACTGCCGTGATGCATGCTGCCAACGACTGCCCTGATGCATGCTGCCAACGGCTGCCCTGATGCATGCTGCCAACGGATGCCCTGATGCATGCCGCCAACGGATGCCCTGATGCATGCCGCCAACGGATGTCCTG

Imagine one mutation gets fixed every 100000 years in this gene sequencehellip

Gene sequence

Evolution by sequence mutation

Speciation

ATGAAGGCTGCCTACGACTGCCGTG

ATGCAGGCTGCCTACGACTGCCGTGATGCAGGCTGCCAACGACTGCCGTGATGCATGCTGCCAACGACTGCCGTGATGCATGCTGCCAACGACTGCCCTGATGCATGCTGCCAACGGCTGCCCTGATGCATGCTGCCAACGGATGCCCTG

Gene AATGAAGGCTGCCTACGACTGCCGTG

ATGAAGGCCGCCTACGACTGCCGTGATGAAGGCCGCCAACGACTGTCGTGATGAAAGCCGCCAACGACTGTCGTGATGAAAGCCGCCAACGACAGTCGTGATGAAAGCCGCCTACGACAGTCGTGATGAAAGCCGCCTACGACAGTCCTG

ATGCATGCTGCCAACGGATGCCCTG

ATGAAAGCCGCCTACGACAGTCCTG||| | || ||| ||| | ||||

If the genetic difference means they can no longer interbreed with fertile offspring ndash then we have a new specieshellip

Residual Similarity

ATGCATGCTGCCAACGGATGCCCTG

ATGAAAGCCGCCTACGACAGTCCTG||| | || ||| ||| | ||||

ATGCATGCTGCCAACGGATGCCCTG

ATGGAAGGCGCTTAGGATAGTCCAG||| | | || | | | || |

After longer periods of evolution homology may no longer be detectable in the DNA sequencehellip

We can still easily detect residual similarity between these sequences this is what we call homology ndash detectable similarity because of common evolutionary origin

Computers Can Detect Homology

In fact computers are very good at this task ndash the two primary challenges are

(a) performing the search fast enough to look through millions of sequence in a timescale compatible with a lab scientistrsquos attention span

(b) at low levels of similarity being able to distinguish between biologically related sequences and chance matcheshellip

ATGCATGCTGCCAACGGATGCCCTG

ATGAAAGCCGCCTACGACAGTCCTG||| | || ||| ||| | ||||

GCTGACTCGTAGCGCTTAGCTAGCT

CCAACATCTAGCCAGATTAGTTAGT | || | | | |

Orthologs

A A

A Gene duplication though speciation The two copies of Gene

A will now evolve independently but will continue to have the ~same function

They are ORTHOLOGS

Paralogs

A

Gene duplication though internal genome duplication

The two copies of Gene A will now evolve independently but will probably not continue to have exactly the same function

They are PARALOGS

A

A Arsquo

A

lsquoOtherrsquo-logsWhat about gene duplication after speciation

How can we describe the relationship(s) between the various copies of gene A in the two frogs

Bear in mind that understanding gene function is more important than semanticshellip

The two copies of A in the orange frog are sometimes called IN-PARALOGS

If they were also present in the green frog (and therefore were in the ancestor species) they would be OUT-PARALOGS

A

A

A

Arsquo A

The Essential Paradigm

1 any group of modern species can be traced back to some extinct common ancestor

A

A

2 in all likelihood they share orthologous genes which have the same function in the modern animal as in the extinct ancestor

3 If we can experimentally determine the function of a gene in one of these organisms then there is a good chance the ORTHOLOGOUS gene in another organism will have the same function

A A

cyclin b1

cyclin b1

Function Conserved Longer than Detectable Similarity

start from first self-replicating sequence

same function detectable similarity

living organisms

whole genome duplication local duplication

Redundancy in the Genetic Code

GCA A alanine GCC A GCG A GCT A

TGC C cystine TGT C

GAC D aspartate GAT D

GGA G glycine GGC G GGG G GGT G

lsquoSynonymousrsquo or lsquosilentrsquo mutations in the third position of the codon triplets have no effect on the amino acid coded for ndash so there is no evolutionary pressure against thishellip

Protein Similarity Persists Longer

CTATCACGAGAACCTGTGCTATCCCGAGAACCTGTGCTATCCCGAGAACCAGTGCTATCCCGTGAACCAGTGCTATCCCGTGAGCCAGTGCTATCCCGTGAGCCAGTTCTGTCCCGTGAGCCAGTT

CTATCACGAGAACCTGTG

CTGTCCCGTGAGCCAGTT|| || || || || ||

LSREPV

LSREPV||||||

CTATCACGAGAACCTGTG

TTGTCCCGGTCGCCAGTT | || | || ||

LSREPV

LSRFPV||| ||

67 100

44 80

Always Compare Protein Sequences

ATGAATGCAGCCTATGATTGCCGAGCCAGAATGCTAAGG MNAAYDCRARMLR ||||| || || || || || || || ||||| || || | ||||||||+||ATGAAGGCCGCATACGACTGTCGTGCTAGAATCCTGAGA MKAAYDCRARILR

DNA comparison amino acid comparison

The DNA sequence can change while the amino acid sequence stays the same so always look for similarities by comparing amino acid sequences

Exercise 1nucleotide vs amino acid search

Go to the file example-sequenceshtml and locate the section for this exercise There should be two sequences lsquosurfeit1rsquo for frog and fly

Go to NCBI Blast home page then lsquoAlign two sequencesrsquo (bottom left lsquospecialrsquo panel) paste one sequence into each window and hit lsquoAlignrsquo ndash this will do a direct DNADNA comparison

Now find the open reading frames of the two genes and translate them into amino acid protein sequences then repeat the two sequences comparison

Go to NCBI ORF Finder ndash paste sequence ndash hit OrfFind ndash identify longest ORF ndash click on it ndash next screen hit Accept ndash change View to Fasta protein ndash hit View ndash copy sequence to Blast2Seqs Do the same with the other sequence

Before you hit lsquoAlignrsquo change the lsquoProgramrsquo (top left) to blastphellip

Answers Exercise 1

The Essential Taskexperiment data mining

gene sequence what is its function

database of proteins in other species

Cyclin-AFoxA1

cdc25

alpha-tubulin

Predicted protein

Gravin-like

Sprouty-2

calmodulin

KIAA10786568

frizzled

Wint8

Troponin T3

Gravin-like

we can only do this because of implied function based on orthology

Functional Orthologs

function known annotation lsquoGravinrsquo available

Human geneXenopus genefunction unknown

sequence similarityorthologs

same function But we know that function is largely determined by shape

similar shape

Which in general we cannot determine ndash but it is probably SHAPE not SEQUENCE that is conserved

We make an assumption that the same gene function is likely to be present in the two organisms and the ones that have this function are likely to be the most similar in sequence

Finding OrthologsSo how do we find orthologs and can we know when we have

The simplest is Reciprocal Best BLAST but it implicitly relies on having all the protein sequences of you own organism and the one you wish to find an ortholog in

frog proteindatabase of human proteins

best match human protein

database of frog proteins

x

Using Synteny is Better

We know that large regions of (say) vertebrate genomes have preserved their overall organisation from one organism to another

And we find the same genes (ie orthologs) in more or less the same order in the syntenic sectionsThese of course represent chromosomal re-arrangements since these organisms diverged

Human chromosome 5

Mouse chromosome 10

Mouse chromosome 2

MetazomeFortunately someone has done all the hard work for ushellip Dan Rokhsar httpwwwmetazomenet

Metazome Exercise

Go back to Entrez Gene and look for your favourite gene again

Pick probable ortholog vertebrate genes from common organisms (human mouse rat chicken frog fish) and paste their protein sequences into a temporary space

Go to Metazome (httpwwwmetazomenet) find the blast window open two versions of it and blast your sequences against the Tetrapod or Jawed vertebrate node

See if you get the same cluster ID as best top hit and have a look at the Metazome alignment(s)hellip

Part 3 Finding Sequence Similarities

We want computer programs which will compare sequences at all possible different alignments looking for a degree of similarity greater than we would expect to find by chance

But first we have to consider the implication of gapshellip

Insertions and deletions are other possible forms of mutations and they can really mess up our simple alignments

ATGCATGCTGCCAACGGATGTCCTG

ATGAAAGCCGCCTACGAAAGTCCTG||| | || ||| ||| | ||||||

ATGCATGCTGGCCAACGGATGTCCTG

ATGAAAGCCGCCTACGAAAGTCCTG||| | || ||| | | | |

Gaps in Alignments

Consider these two obviously similar sequences

TTCCCAACTCTCCTCTTTCACCATGAAGCTCAAGGACAGATTCCACTCGCCCCAAAATCAAGCTCACCCCGTCCAAGAA | || | || |||||||||||||||||||| ||||||||| ||| ||| | ||| | | |TTCCCACCTCTCCTCTTTGCACCATGAAGCTCAAGGACAAATTCCACTCCCCCAAAATCAAGCGCACCCCGTCCCAGAA

In fact we realise that the most probable alignment (regarding biological origin) is with a small gap in each sequence

TTCCCAACTCTCCTCTTT=CACCATGAAGCTCAAGGACAGATTCCACTCGCCCCAAAATCAAGCTCACCCCGTCCAAGAA |||||| ||||||||||| |||||||||||||||||||| ||||||||| |||||||||||||| |||||||||| ||||TTCCCACCTCTCCTCTTTGCACCATGAAGCTCAAGGACAAATTCCACTC=CCCCAAAATCAAGCGCACCCCGTCCCAGAA

So in general we allow ourselves to insert gaps until we find the optimal alignment

But where should this process stop

The Downside of GapsTake two random sequences with no lsquorealrsquo similarity

GACACTAGGTCGATGCGTGGTGGCGAGA

ACGCATCCGGATGTGCACCGTGGAACTG

And allow lsquocost freersquo gaps

GAC--ACT----AGGTCGATGC---GTGG---TGGCGAGA || | | | | | ||| |||| || ACGCA-TCCGGA--T-G-TGCACCGTGGAACTG

Clearly although the alignment has no mismatches it is obviously not biologically meaningful

To prevent this we assign a cost to adding gaps which is offset against the benefit of finding matches ndash and this is the essence of lsquofinding gapped alignmentsrsquo

We want to find the lsquoalignmentrsquo between the two (or more) sequences which shows the greatest degree of similarity while introducing the fewest gaps hellip

BLAST

gtqueryAGACGAACCTAGCACAAGCGCGTCTGGAAAGACCCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGTTCAGGAGTATTTGGACTGCAATATTGGCCCTCGTTCAAGGGCGCCTACCATCACCCGACGGTCATGCCGGTCCCCAGCAGCTGCTAATAACTTCCTTCGCTACTCAAGTTACCACGCTAGCAAAACCCACGGCATACCGTTTACCCTTTAAAATCAGCTTCAACCAGCAACGAA

There are many programs used to find similarities between sequencesThey range from relatively slow programs which find the exact best matching alignment through ones which take progressively inexact shortcuts to speed things up Of this latter class the best known and easily most widely used is BLAST developed by Stephen Altschul and others and continuously refined over the last 10-15 years

The essential idea is to compare your query sequence against a collection or lsquodatabasersquo of target sequences looking for the one(s) that match the query sequence the best

gttarget1AAAACAGGAATATTTACCGGGACCGGGTAATGATGCATCTCGAGGTACACAATATACCTG GAGAACCGAATTATGAGTTGGCCACCTTACTTAACGAAACCAGCAGAGAAAATCCAACAT GGCAACACCCCTCTGACTACACTAGAAGGAACTACTATGTAAGAAAACAGCCTGTCCCTT GCAGTTTGAATGACTGGGTGATGCGAAATGGGGGTCCTGCCATAGAGCGCTTCCATGGTT TACCTTGCACATTTCAGAGAAGTCCTATGCCAGGAGTCCTTCCTACAGGGCCTTCCTGAA ACTATATATGTGCTTATTCTTGTTTGATTTGGCTTTGCAGgttarget2CTCTTAATTTATTTCTCTTCCTGCAGCTCCCTCGCTTTTTCCTTTCCCTGTTACATTCAT CTGACTTGAAGAGTTGCAAATTTTCAGTGTTTCTGTTTTTGTTGCTGATATGTTGTAAAC TTTTTAATAAAATCTATTTCTATAG gttarget3GCAGTTTGAATGACTGGGTGATGCGAAATGGGGGTCCTGCCATAGAGCGCTTCCATGGTT TACCTTGCACATTTCAGAGAAGTCCTATGCCAGGAGTCCTTCCTACAGGGCCTTCCTGAA ACTATATATGTGCTTATTCTTGTTTGATTTGGCTTTGCAGCTAGGGTTTTCACCTTTTCT GGAAAAAAAAATACTGGCTTCC gttarget4CTGCTATTAATGGGCAAAACAACTCAAATAAAGTCCCTCTGCCACCCTCAGACACTGCCC CTGGCCCCCAGCTGCCCGCTGATCCTTGTAGCCAGAGCAGTAAAGTTTTGAAAGTGGAGC CCAAGGAGAATAAAGTTATTAAAGAAACTGGCTTTGAACAAGGTGAAAAGTCTTGTGCAG CACCTCTAGATCATACTGTGAAGGAAAATCTTGGACAAACTTCTAAAGAACAGGTGGTAG

query

database

COMPARE

LIST MATCHES

Flavours of BLAST

ACGATAGATCCCATCCATAAAT ATGACGATAGATCCCATCAT CGATAGGACCACCACA GATAGACCAGGATACATAGGATAATTA AGCTCGCTTGGCTCGATGGCT

MKJLSPWERSYTRGHYTWER MGHTVNBZY MKLPWRHGDBKJGMNDFD MBKLRPIUHDFRTASGSLKWWRTVBN

MKJLSPWERSYTRGHYTWER MGHTVNBZY MKLPWRHGDBKJGMNDFD MBKLRPIUHDFRTASGSLKWWRTVBN

ATGACGATAGATCCCATCAT CGATAGGACCACCACA GATAGACCAGGATACATAGGATAATTA AGCTCGCTTGGCTCGATGGCT

query sequence other operation database sequences

ATGACGATAGATCCCATCAT CGATAGGACCACCACA GATAGACCAGGATACATAGGATAATTA AGCTCGCTTGGCTCGATGGCT

BLASTn

BLASTp

BLASTx

tBLASTn

tBLASTx

ACGATAGATCCCATCCATAAAT

ACGATAGATCCCATCCATAAAT

MQWCGYRWTYQGYRW

MQWCGYRWTYQGYRW

FAST

FAST

SLOW

SLOWER

HORRIBLY

SLOW

6 fra

me

trans

latio

n

How does it work The main task of any sequence comparison program is to test all possible mutual alignments of two sequence and see how good the match is

CTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGT

CCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGTTC | | | | | ||||||||||||||||||||||||| CTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGTCTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGT || | | | | | | | | |||||||||||||||||||||||| | | | | | |

CCGAGCTTCTCATTGCTCTTCCTAACAGTG=TGATAGGCTAACCGTAATGGCGTTC||||||||||||||||||||||||| ||||||||||||||||||||||||

query

1st database sequence

This would actually be a very slow search process if implemented like thishellip

BLAST achieves its speed through two strategies

- it takes a WORD based approach- it pre-INDEXES database sequences

BLAST WORDS and INDEXING1 GACAAATCCAAACCCCTGAAGTTCTCCACCAGCAAAGCCA

2 TAAGCAAATTTAATTTTGTTTACATTTTC

3 GTTAAGACCTTCCCTGACATTTGCAGCAGTTTCAAATGTA

AAAAAAAA 00001 AAAAAAAC 00002 AAAAAAAG 00003 ACAAATCC 07967 ACAAATCC 07968 ACAAATCC 07979 GACAAATC 33568 GACAAATG 33569 TCCAAACC 64321 TCCAAACC 64322

sequence position word

1 1 33658

1 2 07967

1 3 16210

3 15 33568

3 16 07967

Database of sequences

Numbered list of all possible lsquowordsrsquo

Build a position index of all words in the database

Analyse the Query Sequence gtquery AGACAAATCCAAACCCCTGAAGTTCTCCACCAGCAAAGCCA

AAAAAAAA 00001 AAAAAAAC 00002 AAAAAAAG 00003 ACAAATCC 07967 ACAAATCC 07968 ACAAATCC 07979 GACAAATC 33568 GACAAATG 33569 TCCAAACC 64321 TCCAAACC 64322

QUERY SEQUENCE

Numbered list of all possible lsquowordsrsquo

position word

1 14236

2 33658

3 07967

Analyse QUERY SEQUENCE

sequence position word

1 1 33658

1 2 07967

1 3 16210

3 15 33568

3 16 07967

Index of database

Expand from Word Based Matches We lsquoinstantlyrsquo know which sequences in the database have at least a word length match with our query sequence and at what relative position

Next the potential alignments are expanded adding up a score for (total matches ndash mismatches ndash gap penalties) to make the best possible alignment But this is usually for a tiny proportion of the sequences in the database ndash so overall it is much quicker

The highest scoring alignments are reported

But we can potentially miss alignments with no word-size bits in common consider BLASTn with a default word-size of 11

TCGGAAGTGGAAGCTGAACCTGATTGTAGAGTTGGAGGCCAGTGTTCTGGCTGAGC||||||||| ||||| |||||||||| |||||||||| |||| ||||| ||||||| TCGGAAGTGTAAGCTCAACCTGATTGCAGAGTTGGAGTCCAGAGTTCTAGCTGAGC

Care is sometimes neededhellip

BLAST ndashTypical OutputINPUT

gtpartial cDNA sequence Xenopus tropicalisCGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGTTCCCACCTCTCCTCTTTCACCATGAAGCTCAAGGACAAATTCCACTCCCCCAAAATCAAGCGCACCCCGTCCAAGAAGGGGAAGCCGGCCGACCTCACCGTCAAAACAGAAGAGAAACCCGTCAACAAAACCTTAAGCCGCTTGGAGGAACAGGAGAAAGAAGTCGTTAATGCCTTGCGTTACTTTAAGACAATTGTTGACAAGATGGCGGTGGACAAGATGGTGCTGGTGATGCTGCCAGGGTCGGCGA

OUTPUTQuery= (311 letters) Database NCBI Protein Reference Sequences 954378 sequences 347895532 total letters

gtgi|41055060|ref|NP_9574201| similar to guanine nucleotide-releasing factor 2 (specific for crk proto-oncogene) [Danio rerio]

Length=691

Score = 133 bits (335)Expect = 6e-31 Identities = 7698 (77) Positives = 8298 (83) Gaps = 498 (4) Frame = +2

Query 26 MSGKIE-KADSQRSHLSSFTMKLKDKFHSPKIKRTPSKKGKPA--DLTVKTEEKPVNKTL 196 MSGKIE K +SQ+SHLSSFTMKL KFHSPKIKRTPSKKGK + VKT EKPVNK + Sbjct 1 MSGKIESKHESQKSHLSSFTMKLM-KFHSPKIKRTPSKKGKQLQPEPAVKTPEKPVNKKV 59

Query 197 SRLEEQEKEVVNALRYFKTIVDKMAVDKMVLVMLPGSA 310 SRLEEQEK+VV+ALRYFKTIVDKM VD VL MLPGSA Sbjct 60 SRLEEQEKDVVSALRYFKTIVDKMNVDTKVLQMLPGSA 97

When is a match significant

RFKISDCQHPCTYSHNQYMTNHMRECPYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV

NFSWKKTSEKETNCQFDYPNDYNEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFNMCWLEVNSS

RF---KISDCQHPCTYSH-NQYMTNHMREC----PYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV F K S+ + C + + N Y N +C P+ + +W +P + D I N M ++ NFSWKKTSEKETNCQFDYPNDY--NEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFN------MCWLEVNSS

Here is a lsquotypicalrsquo weak alignment from BLASTp

In fact the sequences were randomly generated so there is no biologically significant alignmenthellip

E-values

The number of matches like the discovered match that I would expect to find by chance

An E-value of 00 implies that I would expect no matches like this to arise by chance thereforehellip

An E-value of 1 implies I would expect 1 match like this to arise by chance so if I have a match with such an E-valuehellip

Also ldquoexpect valueldquo or ldquoexpectationrdquo

E-values From First Principles

Some database statistics (23rd July 2005)

Database NCBI RefSeq mRNA 272619 sequences 503566580 total letters (~50 x 108)

Database NCBI nr 3329110 sequences 14601814750 total letters (~14 x 1010)

Notation

12e-35 = 12 x 10-35

48 x 106 = 4800000

We will consider first searching a nucleotide sequence (lsquoACGTAGACGTrsquo) against a nucleotide database eg the RefSeq mRNA above

Then we will consider the more complex case of amino acid sequence (protein) searches Which is of course what we mostly do

Calculating an E-valueThe RefSeq mRNA database has ~ 50 x 108 letters There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA

Query = lsquoArsquo

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACrsquo

AC AC AC AC

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACGrsquo

ACG

Expected number of matches = (50 x 108) 4 = ~12 x 108

Expected number of matches = (50 x 108) (4x 4) = ~31 x 107

Expected number of matches = (50 x 108) (4 x 4 x 4) = ~81 x 106

Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer

Expected number of matches = (50 x 108) (4 x 4 x 4 x 4 hellip 60 times ) = (50 x 108) 1036 = 50 x 10-28

E-value = 50 x 10-28

E-values In PracticeSo if I take a 60 nt sequencegtsequenceACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA and actually BLAST it against the RefSeq mRNA database I get

BLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds

Length=6060 Score = 119 bits (60) Expect = 2e-26 Identities = 6060 (100) Gaps = 060 (0) Strand=PlusPlus

Query 1 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 60 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 2977 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 3036

What do I get if I BLAST it against the larger nr databaseBLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds

Length=6060 Score = 119 bits (60) Expect = 6e-25 Identities = 6060 (100) Gaps = 060 (0) Strand=PlusPlus

Query 1 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 60 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 2977 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 3036

theoretical value was 50e-28 -

E-value Exercise

Given a transcription factor binding site

ACC[TG]TA

How many would you expect to find by chance in a 10k promoter sequence

How would this differ if there was an optional additional base between the 4th and 5th positionsIeACC[TG]TAOR ACC[TG]TA

E-value Exercise AnswerACC[TG]TA

Expect lsquoArsquo every 4 ntExpect lsquoACCrsquo every 4x4x4 = 64 nt

Expect lsquoT or Grsquo every 2nd ntExpect lsquoACC[TG]rsquo every 64x2 nt = 128 nt

Expect lsquoTArsquo every 4x4 = 16 ntExpect lsquoACC[TG]TArsquo every 128x16 nt = 2048 nt (4x4x4x2x4x4)We would expect ~5 of these promoter sites every 10k by chance

If also ACC[TG]TAA allowed

The two motifs independently have the same E-valueTo allow either means we expect twice as many

E-values Effect of Database SizeThe nr mRNA database has ~ 14 x 1010 letters (was RefSeq and 50 x108)There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA

Query = lsquoArsquo

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACrsquo

AC AC AC AC

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACGrsquo

ACG

Expected number of matches = (14 x 1010) 4 = ~12 x 108

Expected number of matches = (14 x 1010) (4x 4) = ~31 x 107

Expected number of matches = (14 x 1010) (4 x 4 x 4) = ~81 x 106

Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer

Expected number of matches = (14 x 1010) (4 x 4 x 4 x 4 hellip 60 times ) = (14 x 1010) 1036 = 14 x 10-26

E-value = 14 x 10-26

(was E-value = 50 x 10-28)

E-values Effect of Database Size

The E-value is simply dependent on database size

RefSeq

nr

14 x 1010 letters

50 x108 letters

30 x bigger

BLAST the same sequenceagainst each E-value = 14e-26

E-value = 50e-28

The database was ~30 times bigger and so the E-value was ~30 times bigger

Why were the values differentOur calculated E-value for searching against the RefSeq mRNA database was 50 x 10-28But our actual BLAST search at NCBI gave a value of

20 x 10-26 - about 40x larger - why is this

Gapped alignments

If we were expecting N matches for a query sequence lsquoACGTACGTACGTrsquo imagine what would happen to N if we allowed gaps in our matches

ACGTACGTACGT

This would now give us additional possible alignments that would meet our lsquomatchrsquo criteria

ACGTACGTACGT ACGTACAGTACGT ACGTACCGTACGT etc|||||||||||| |||||| |||||| |||||| ||||||ACGTACGTACGT ACGTAC-GTACGT ACGTAC-GTACGT

We will expect many more matches in a given database if we allow our alignments to have gaps The E-value will be larger

E-values Effect of Query Length

Biologically itrsquos the same match Does it mean we are any less sure that this match didnrsquot occur by chance The E-value is simply dependent on match length

database

BLAST 500 nt sequence against a database

BLASTn Get a full length match with sequence XYZ at an E-value = 50e-160

gtsequence ACTAGTCTAGCTAGACATCGATCGATGATGCTACACAGATAGACGATAGATAGTAAGTCGATCGATCGCGCATCGATCGTCTAGATCGATCGCTCGCTGTGTAGATAGATCGGCGATAGA

database

BLAST half of the same sequence against the same database

BLASTn

gtsequence ACTAGTCTAGCTAGACATCGATCGATGATGCTACACAGATAGACGATAGATAGTAAGTCG

Get a match with sequence XYZ again but at an E-value = 50e-80

Why not just use identityAt some levels this a good question

But consider two very different searches both of which give a 75 identity match

Query1 was 60 nt longCGGAGCTCAGGGCTTAACGACTGATATCTCCGCGCATGTCGAGAAACGATACAGCCAGCG||||||||||| || | || | || || |||| | | | |||||| | ||||||||||CGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGWhich would have an E-value ~ 50 x 10-19

And Query2 only 16 nt longACGTACGTACGTACGT||| || | |||| ||ACGCACCTTCGTAGGTWhich would have an E-value ~ 30

And intuitively we feel we would expect to see that sort of number of matches in the database just by chancehellip

So whatrsquos the real problemBasically you are usually trying to answer the question

Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism

Are there any useful guidelines though at least for biological meaningfulness

Basically you are usually trying to answer the question

Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism

BLAST

The difficulty is because

ORTHOLOGY

BLAST Similarity + Probability

biological knowledge

nature of query sequence

phylogenetic relationship

match length PI size of databasehellip

Rules of ThumbHow good does an E-value have to be before we might even think we have an ortholog

largerworse smallerbetter

E-values 10-5 10-10 10-40 10-100 00

fantasy borderline encouraging

pretty good canrsquot get

better

But note that in some gene families with closely related members you can get an E-value of 00 for several different matches and then identity may be more sensitive Also bear in mind in cases like this that ideas of lsquofunctionalrsquo orthology may break down with more than one locus producing identical proteins which share the same functionhellip

Protein BLASTItrsquos (nearly) always better to make comparisons at the amino acid level between protein sequences than the DNA level because the amino acid sequence is more conserved than the underlying DNA sequenceDoes this cause us to treat expected values any differently

If we follow the argument as before then for an exact match of a 20 amino acid sequence in the RefSeq protein database each additional amino acid will reduce the E-value by 120th (there are 20 different amino acids) And as there are 347895532 letters in that databaseE-value = ~35 x 108 (20 x 20 x 20 hellip20 times) = ~35 x 10-18

But this is what we get if we run the blast at NCBI

Score = 431 bits (100) Expect = 8e-04 Identities = 2020 (100) Positives = 2020 (100) Gaps = 020 (0) Frame = +3

Query 3 SSSSFRAYRAALSEVEPPCI 62 SSSSFRAYRAALSEVEPPCISbjct 972 SSSSFRAYRAALSEVEPPCI 991

Really too big a discrepancy to easily explain with hand wavinghellip

Amino Acid Substitutions

A SC F LWYG I LMVL IMFVM ILVP V ILMW FY

N DHSQ REHKS ANTT SY HFW

H NQYK RQER QK

D NEE DQK

In fact we need to take into account both amino acid substitutability as well as as before allowing gapped alignments On average any residue can be substituted for by about 2 others so each position has about 17th chance of lsquomatchingrsquo rather than 120th

So now we getE-value = ~35 x 108 (7 x 7 x 7 hellip20 times) = ~44 x 10-9which is much closer to the actual BLAST value

These substitutabilities are dealt with by the BLOSUM and PAM matrices

ExercisesGo to the file random-DNA-sequenceshtml select one of the 20 randomly generated nucleotide sequences and do a BLASTx (translated DNA-gtprotein) at NCBI against the nr protein database

Did you find any lsquosignificantrsquo hits

Repeat with a second sequence

What conclusions might you draw from this exercise

Try the same sequence(s) against the nr nucleotide database

Is there any general difference

Part 4 Tweaking BLASTAlthough you normally see BLAST as a web page with boxes to place data in and tick boxes etc it is actually a command line program that can be run just by typing the appropriate command and options eg

promptgt blastall ndashp ltblast typegt ndashi ltinput sequencegt ndashd ltdatabasegt

This is the simplest form where the basic program lsquoblastallrsquo takes a number of different options or parameters indicated by the ndashx and followed by its value -p ltwhich blast flavour to rungt-i ltfile with query sequence ingt-d ltpre-indexed database namegt

Not All Parameters are hellipThere are many other parameters and if not listed explicitly they will use a default value most appropriate to the blast flavour requested Eg for ndashW ltword sizegt blastn uses ndashW 11 where blastx uses ndashW 3

There are also some options that appear on the web pages that are not really parameters but manage the job in a similar way One of the most useful of these is on the NCBI blast pages where you can use Entrez queries or pick from an organism list to modify your search

The Many Parameters of BLASTThere are almost literally hundreds of parameters but most are way too obscure even for die-hard techies like me Very few of them are regularly useful in any but their default value but just occasionally they are very necessary

Here are some of the ones that I have used

-e max expected value -m output format (graphical or tabularspreadsheet)-F filter query sequence for low complexity (default TRUE)-U use only upper case regions of query (default FALSE)-G gap opening cost-E gap extension cost-q nucleotide mismatch penalty (BLASTx uses matrices)-r nucleotide match reward-b number of matching sequences to report-g allow gaps (default TRUE)-W word size-z effective database size (removes effect of actual database size)-S query strands to search (default both directions)-l restrict database sequences to given list of lsquogilsquo numbers

BLAST Parameters Exercises1 BLASTn vs BLASTx

Open the file example-sequenceshtml copy the sequence gtblastn-vs-blastxThis is a Xenopus tropicalis cDNA sequence

Go to the NCBI BLAST Home PageNucleotide-nucleotide BLAST (blastn) section Paste your sequence into the box

Run BLASTn against the nr nucleotide database using all default optionsThen hit [format] to wait for the results in a new page(hint if you paste the sequence definition line lsquogtnamersquo into the box as well your results will be labelled accordingly which can be useful)

Now repeat but go to the TRANSLATED BLAST section and BLAST against the nr protein database using BLASTx

How might the different results help us view the presence of this gene in other vertebrates

Results for Exercise 1 BLASTn

BLASTx

BLAST Parameters Exercises2 Low complexity filtering

Open the file example-sequenceshtml copy the sequence gtlow-complexity-filtering-A This sequence contains a long AT tandem repeat

Go to the NCBI BLAST Home PageTRANSLATED BLAST sectionBLASTx Paste your sequence into the box

Carefully UNTICK the ldquoChoose filter [ ] Low complexityrdquo BOX in the second section And then run BLASTx against the nr database

What do you feel about these alignmentsRe-run but leave the low-complexity filter ON this timeDoes this change our view of the protein matches

Now continue with gtlow-complexity-filtering-B and ndashCC is an especially interesting case ndash what can we deduce about the cDNA sequence Annotators beware

Results for Exercise 2A (OFF) BLASTn ndash low complexity filtering OFF

Results for Exercise 2A (ON) BLASTn ndash low complexity filtering ON

Results for Exercise 2B

ON OFF

Results for Exercise 2C

There is a sequence error an extra G at position 117 in the sequence cDNA (117)AGAAAAGAAGAAACATGGCAATGGATCAGAA|||||||||||||||| ||||||||||||||AGAAAAGAAGAAACAT-GCAATGGATCAGAA

Genomic sequence

ON

OFF

BLAST Parameters Exercises3 Limit by Entrez query

Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching sequences in fruit fly enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT

Open the file example-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home Page TRANSLATED BLAST sectionBLASTx and paste the sequence

Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip

BLAST Parameters Exercises4 BLASTn vs tBLASTx and nucleotide mismatch penalties

Open the file example-sequenceshtml

Also open the NCBI BLAST Home PageSPECIAL ndash Align two sequences section

There are several Xenopus tropicalis cyclins in the examples fileCopy the sequence gtcyclin-A1-Xt to the Sequence 1 BLAST windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 BLAST window(i) Run the default comparison should be BLASTn Note the alignmentNow run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these

(ii) Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1 Then try reducing the gap open and gap extension penaltieshellipWhat do we learn from this

(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip

Results for Exercise 4 (i)

BLASTn tBLASTx

Results for Exercise 4 (ii)

Mismatch penalty = -2 (default) Mismatch penalty = -1

BLAST Parameters Exercises5 E-Value maximum for reporting

Open the file example-sequenceshtml

Copy the sequence gtsumo-binding-motif and go to the NCBI BLAST Home PageGo to the PROTEIN BLAST section BLASTp and paste the sequence

Run the search with the default values

Now re-run the search setting the maximum E-value in the box

Expect 100

What difference does this make

BLAST Parameters Exercises6 Word Size

Open the file example-sequenceshtml

Copy the sequence gtmorpholino and go to the NCBI BLAST Home PageGo to the NUCLEOTIDE BLAST section BLASTn and paste the sequence

Check OFF the low complexity filter and then run the search

Now re-run the search setting the following parameters

Low complexity OFFExpect 100Word Size 7Other advanced -q-1 (mismatch penalty -1 instead of default -3)

What difference does this make

END

  • Bioinformatics Workshop 1 Sequences and Similarity Searches
  • The Basic Questions
  • Part 1 Structural Genomics
  • Chromosomes and Genes
  • Gene to Protein
  • Sequence Signals
  • Genomic Signals
  • Derivative Sequences
  • Gene Models
  • Sequences and Genes (Accession Numbers and Names)
  • Gene Symbols Names Etc
  • A Gene-Centric View
  • Sequences and Accession Numbers
  • mRNA Splicing Signals
  • Gene Predictions
  • Supporting Evidence
  • TheoreticalPredicted Sequences
  • Sequences for a model organism
  • So Whatrsquos in the Databases Now
  • Part 2 Comparative Genomics
  • Speciation
  • Residual Similarity
  • Computers Can Detect Homology
  • Orthologs
  • Paralogs
  • lsquoOtherrsquo-logs
  • The Essential Paradigm
  • Function Conserved Longer than Detectable Similarity
  • Redundancy in the Genetic Code
  • Protein Similarity Persists Longer
  • Always Compare Protein Sequences
  • Exercise 1 nucleotide vs amino acid search
  • Answers Exercise 1
  • The Essential Task
  • Functional Orthologs
  • Finding Orthologs
  • Using Synteny is Better
  • Metazome
  • Metazome Exercise
  • Part 3 Finding Sequence Similarities
  • Gaps in Alignments
  • The Downside of Gaps
  • BLAST
  • Flavours of BLAST
  • How does it work
  • BLAST WORDS and INDEXING
  • Analyse the Query Sequence
  • Expand from Word Based Matches
  • BLAST ndashTypical Output
  • When is a match significant
  • E-values
  • E-values From First Principles
  • Calculating an E-value
  • E-values In Practice
  • E-value Exercise
  • E-value Exercise Answer
  • E-values Effect of Database Size
  • Slide 58
  • Why were the values different
  • E-values Effect of Query Length
  • Why not just use identity
  • So whatrsquos the real problem
  • Rules of Thumb
  • Protein BLAST
  • Amino Acid Substitutions
  • Exercises
  • Part 4 Tweaking BLAST
  • Not All Parameters are hellip
  • The Many Parameters of BLAST
  • Slide 70
  • BLAST Parameters Exercises
  • Results for Exercise 1
  • Slide 73
  • Results for Exercise 2A (OFF)
  • Results for Exercise 2A (ON)
  • Results for Exercise 2B
  • Results for Exercise 2C
  • Slide 78
  • Slide 79
  • Results for Exercise 4 (i)
  • Results for Exercise 4 (ii)
  • Slide 82
  • Slide 83
  • END
Page 5: Bioinformatics Workshop 1 Sequences and Similarity Searches

locus

Gene to Protein~ gene

mRNA

protein

genome

primary transcript

CTACCATCCATGCTAACCATTCTACTAGCATAACTGGCTA

Sequence Signals

CTACCATCCATGCTAACCATTCTACTAGCATAACTGGCTA

mRNA

MLTIL AL

Genomic Signals

transcription start site

===CGCTATAAGCG====================

===CGCAATAAAGCG===================

polyadenylation signal

===CACGATCGAGTC===================

promoters

enhancers

==ACGTAhelliphelliphelliphellipCAGTA====================

splice sites

Derivative Sequences

mRNA

capture by cloning into cDNA library

3rsquo EST

5rsquo EST

cDNA sequence

EST single pass sequence from each end of the clone

cDNA multiple pass sequencing over whole length of the clone

5rsquo 3rsquo

Gene Models

gene modelexons

Sequences and Genes(Accession Numbers and Names)

AAB229701

AAP212451

CAA415451

NP_1877592

proteins

S431051mRNAscDNAs lsquosimilar to Cyclin B1 [mus musculus]rsquo

gene

BT0064371 lsquoCyclin B1 isoform 1 [mus musculus]rsquo

X587081

NM_1119853 lsquoCCNB1 Cyclin B1 [mus musculus]rsquo

lsquoCyclin B1 isoform 2 [mus musculus]rsquo

Gene Symbols Names Etc

Gene Symbol CCNB1

Gene Name cyclin B1 [Homo sapiens]

Description G2mitotic-specific cyclin B1

Aliases CCNB CYCB1

A Gene-Centric View

Entrez Genehttpwwwncbinlmnihgov

Cyclin B1

S431051

BT0064371

X587081

NM_1119853

AAB229701

AAP212451

CAA415451

NP_1877592

Exercise 1

Go to Entrez Gene and look for your favourite gene or genes

genomic location

expression data

Sequences and Accession Numbers

NM_0010159221 gi=62860271

GATCGTTCGATTAGCTAGGGACACCACCGATCGATATGACCACAAAAA

BC0096381 gi=16307106

GTTCGATTAGCTAGGGACACCACCGATCGATATGACCACAAAA

NM_0010159222 gi=62860589

GACCGTTCGATTAGCTAGGGACACCACCGATCGATATGACCACAAA

NP_0010159221 protein translated from mRNA

XM_0011025671 predicted mRNA

XP_0010897651 predicted protein translated from predicted mRNA

mRNA Splicing Signals

gene model

genome

CTACCATCCATGCTAACCATTCTACCATTTTATACTCATGCAACGGACCGTAGCGTAGTCGCTTAGCATCCTTTATAACTGGCTA

CTACCATCCATGCTAACCATTCTAC CATTTTATACTCATGCAACGGACCGT AGCGTAGTCGCTTAGCATCCTTTATAACTGGCTA

CTACCATCCATGCTAACCATTCTACGTAAGTCATCTATATCAATATTATTTCAGCATTTTATACTCATGCAACGGACCGTGTCAGTATTACAGAGCGTAGTCGCTTAGCATCCTTTATAACTGGCTA

GTAAGdonor

TTTCAG acceptor

mRNA

exon intron exon intron exon

splice sites

Gene PredictionsGiven- coding sequence must run from ATG ndash STOP codon in-frame- introns GT AG can be spliced out

Also take a statistical approach- coding and non-coding sequence are slightly different in composition- some lsquopossiblersquo splice sites are more likely than others

CGTCGTATGGCTTCGATGTAGTACATCGGATCGGTATGGAATCATTTCAGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA

CGTCGTATGGCTTCGATGTAGTACATCGGATCGGTATGGAATCATTTCAGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA CGTCGTATGGCTTCGATGTAGTACATCGGATCGGTATGGAATCATTTCAGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA

CGTCGTATGGCTTCGATGTAGTACATCGGATCGGTATGGAATCATTTCAGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA

CGTCGTATGGCTTCGATTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA

CGTCGTATGGCTTCGATGTAGTACATCGGATCGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA CGTCGTATGGCTTCGATGTAGTACATCGGATCGGTATGGAATCATTTCAGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA

CGTCGTATGGCTTCGATTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA

CGTCGTATGGCTTCGATGTAGTACATCGGATCGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA

scan genomic sequence hellip

CGTCGTATGGCTTCGATGTAGTACATCGGATCGGTATGGAATCATTTCAGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA

most likely gene model

Supporting Evidence

EST evidence

genome

gene model

We note that in the absence of EST evidence it is only really possible to predict coding sequence with any confidence (and even thenhellip)

So predicted genes based on computational gene models alone will usually lack UTR regions which has some important consequences

exons 1 2 3 4

TheoreticalPredicted Sequences

genome

predicted gene modelexons 1 2 3 4

Wersquove now reversed the process of working out exon structure from aligning cDNA sequences against the genome sequence but we shouldnrsquot lose sight of the fact that we donrsquot really know if these predicted proteins exists ndash especially where supporting EST evidence is weak or non-existent

predicted transcript

predicted protein

Sequences for a model organism

ESTs ndash millions pound10 eachCheap to sequence ndash so we get millions per organismBut lots of errorsAnd incomplete gene sequencesCan give us relative expression levels

cDNAs ndash tens of thousands pound1000 eachExpensive ndash but only need to do one (or a small number) per geneFew errors with multipass sequencingGives us protein sequences

Genomes ndash one pound30000000Extremely expensiveBut the only way to get the whole pictureGives us gene regulation

So Whatrsquos in the Databases Now

15000000ESTs

3300000cDNAs

NCBI July 2005

2700000proteins

950000proteins

nrRefSeq

DNA

Proteins

Part 2 Comparative Genomics

ATGAAGGCTGCCTACGACTGCCGTGATGCAGGCTGCCTACGACTGCCGTGATGCAGGCTGCCAACGACTGCCGTGATGCATGCTGCCAACGACTGCCGTGATGCATGCTGCCAACGACTGCCCTGATGCATGCTGCCAACGGCTGCCCTGATGCATGCTGCCAACGGATGCCCTGATGCATGCCGCCAACGGATGCCCTGATGCATGCCGCCAACGGATGTCCTG

Imagine one mutation gets fixed every 100000 years in this gene sequencehellip

Gene sequence

Evolution by sequence mutation

Speciation

ATGAAGGCTGCCTACGACTGCCGTG

ATGCAGGCTGCCTACGACTGCCGTGATGCAGGCTGCCAACGACTGCCGTGATGCATGCTGCCAACGACTGCCGTGATGCATGCTGCCAACGACTGCCCTGATGCATGCTGCCAACGGCTGCCCTGATGCATGCTGCCAACGGATGCCCTG

Gene AATGAAGGCTGCCTACGACTGCCGTG

ATGAAGGCCGCCTACGACTGCCGTGATGAAGGCCGCCAACGACTGTCGTGATGAAAGCCGCCAACGACTGTCGTGATGAAAGCCGCCAACGACAGTCGTGATGAAAGCCGCCTACGACAGTCGTGATGAAAGCCGCCTACGACAGTCCTG

ATGCATGCTGCCAACGGATGCCCTG

ATGAAAGCCGCCTACGACAGTCCTG||| | || ||| ||| | ||||

If the genetic difference means they can no longer interbreed with fertile offspring ndash then we have a new specieshellip

Residual Similarity

ATGCATGCTGCCAACGGATGCCCTG

ATGAAAGCCGCCTACGACAGTCCTG||| | || ||| ||| | ||||

ATGCATGCTGCCAACGGATGCCCTG

ATGGAAGGCGCTTAGGATAGTCCAG||| | | || | | | || |

After longer periods of evolution homology may no longer be detectable in the DNA sequencehellip

We can still easily detect residual similarity between these sequences this is what we call homology ndash detectable similarity because of common evolutionary origin

Computers Can Detect Homology

In fact computers are very good at this task ndash the two primary challenges are

(a) performing the search fast enough to look through millions of sequence in a timescale compatible with a lab scientistrsquos attention span

(b) at low levels of similarity being able to distinguish between biologically related sequences and chance matcheshellip

ATGCATGCTGCCAACGGATGCCCTG

ATGAAAGCCGCCTACGACAGTCCTG||| | || ||| ||| | ||||

GCTGACTCGTAGCGCTTAGCTAGCT

CCAACATCTAGCCAGATTAGTTAGT | || | | | |

Orthologs

A A

A Gene duplication though speciation The two copies of Gene

A will now evolve independently but will continue to have the ~same function

They are ORTHOLOGS

Paralogs

A

Gene duplication though internal genome duplication

The two copies of Gene A will now evolve independently but will probably not continue to have exactly the same function

They are PARALOGS

A

A Arsquo

A

lsquoOtherrsquo-logsWhat about gene duplication after speciation

How can we describe the relationship(s) between the various copies of gene A in the two frogs

Bear in mind that understanding gene function is more important than semanticshellip

The two copies of A in the orange frog are sometimes called IN-PARALOGS

If they were also present in the green frog (and therefore were in the ancestor species) they would be OUT-PARALOGS

A

A

A

Arsquo A

The Essential Paradigm

1 any group of modern species can be traced back to some extinct common ancestor

A

A

2 in all likelihood they share orthologous genes which have the same function in the modern animal as in the extinct ancestor

3 If we can experimentally determine the function of a gene in one of these organisms then there is a good chance the ORTHOLOGOUS gene in another organism will have the same function

A A

cyclin b1

cyclin b1

Function Conserved Longer than Detectable Similarity

start from first self-replicating sequence

same function detectable similarity

living organisms

whole genome duplication local duplication

Redundancy in the Genetic Code

GCA A alanine GCC A GCG A GCT A

TGC C cystine TGT C

GAC D aspartate GAT D

GGA G glycine GGC G GGG G GGT G

lsquoSynonymousrsquo or lsquosilentrsquo mutations in the third position of the codon triplets have no effect on the amino acid coded for ndash so there is no evolutionary pressure against thishellip

Protein Similarity Persists Longer

CTATCACGAGAACCTGTGCTATCCCGAGAACCTGTGCTATCCCGAGAACCAGTGCTATCCCGTGAACCAGTGCTATCCCGTGAGCCAGTGCTATCCCGTGAGCCAGTTCTGTCCCGTGAGCCAGTT

CTATCACGAGAACCTGTG

CTGTCCCGTGAGCCAGTT|| || || || || ||

LSREPV

LSREPV||||||

CTATCACGAGAACCTGTG

TTGTCCCGGTCGCCAGTT | || | || ||

LSREPV

LSRFPV||| ||

67 100

44 80

Always Compare Protein Sequences

ATGAATGCAGCCTATGATTGCCGAGCCAGAATGCTAAGG MNAAYDCRARMLR ||||| || || || || || || || ||||| || || | ||||||||+||ATGAAGGCCGCATACGACTGTCGTGCTAGAATCCTGAGA MKAAYDCRARILR

DNA comparison amino acid comparison

The DNA sequence can change while the amino acid sequence stays the same so always look for similarities by comparing amino acid sequences

Exercise 1nucleotide vs amino acid search

Go to the file example-sequenceshtml and locate the section for this exercise There should be two sequences lsquosurfeit1rsquo for frog and fly

Go to NCBI Blast home page then lsquoAlign two sequencesrsquo (bottom left lsquospecialrsquo panel) paste one sequence into each window and hit lsquoAlignrsquo ndash this will do a direct DNADNA comparison

Now find the open reading frames of the two genes and translate them into amino acid protein sequences then repeat the two sequences comparison

Go to NCBI ORF Finder ndash paste sequence ndash hit OrfFind ndash identify longest ORF ndash click on it ndash next screen hit Accept ndash change View to Fasta protein ndash hit View ndash copy sequence to Blast2Seqs Do the same with the other sequence

Before you hit lsquoAlignrsquo change the lsquoProgramrsquo (top left) to blastphellip

Answers Exercise 1

The Essential Taskexperiment data mining

gene sequence what is its function

database of proteins in other species

Cyclin-AFoxA1

cdc25

alpha-tubulin

Predicted protein

Gravin-like

Sprouty-2

calmodulin

KIAA10786568

frizzled

Wint8

Troponin T3

Gravin-like

we can only do this because of implied function based on orthology

Functional Orthologs

function known annotation lsquoGravinrsquo available

Human geneXenopus genefunction unknown

sequence similarityorthologs

same function But we know that function is largely determined by shape

similar shape

Which in general we cannot determine ndash but it is probably SHAPE not SEQUENCE that is conserved

We make an assumption that the same gene function is likely to be present in the two organisms and the ones that have this function are likely to be the most similar in sequence

Finding OrthologsSo how do we find orthologs and can we know when we have

The simplest is Reciprocal Best BLAST but it implicitly relies on having all the protein sequences of you own organism and the one you wish to find an ortholog in

frog proteindatabase of human proteins

best match human protein

database of frog proteins

x

Using Synteny is Better

We know that large regions of (say) vertebrate genomes have preserved their overall organisation from one organism to another

And we find the same genes (ie orthologs) in more or less the same order in the syntenic sectionsThese of course represent chromosomal re-arrangements since these organisms diverged

Human chromosome 5

Mouse chromosome 10

Mouse chromosome 2

MetazomeFortunately someone has done all the hard work for ushellip Dan Rokhsar httpwwwmetazomenet

Metazome Exercise

Go back to Entrez Gene and look for your favourite gene again

Pick probable ortholog vertebrate genes from common organisms (human mouse rat chicken frog fish) and paste their protein sequences into a temporary space

Go to Metazome (httpwwwmetazomenet) find the blast window open two versions of it and blast your sequences against the Tetrapod or Jawed vertebrate node

See if you get the same cluster ID as best top hit and have a look at the Metazome alignment(s)hellip

Part 3 Finding Sequence Similarities

We want computer programs which will compare sequences at all possible different alignments looking for a degree of similarity greater than we would expect to find by chance

But first we have to consider the implication of gapshellip

Insertions and deletions are other possible forms of mutations and they can really mess up our simple alignments

ATGCATGCTGCCAACGGATGTCCTG

ATGAAAGCCGCCTACGAAAGTCCTG||| | || ||| ||| | ||||||

ATGCATGCTGGCCAACGGATGTCCTG

ATGAAAGCCGCCTACGAAAGTCCTG||| | || ||| | | | |

Gaps in Alignments

Consider these two obviously similar sequences

TTCCCAACTCTCCTCTTTCACCATGAAGCTCAAGGACAGATTCCACTCGCCCCAAAATCAAGCTCACCCCGTCCAAGAA | || | || |||||||||||||||||||| ||||||||| ||| ||| | ||| | | |TTCCCACCTCTCCTCTTTGCACCATGAAGCTCAAGGACAAATTCCACTCCCCCAAAATCAAGCGCACCCCGTCCCAGAA

In fact we realise that the most probable alignment (regarding biological origin) is with a small gap in each sequence

TTCCCAACTCTCCTCTTT=CACCATGAAGCTCAAGGACAGATTCCACTCGCCCCAAAATCAAGCTCACCCCGTCCAAGAA |||||| ||||||||||| |||||||||||||||||||| ||||||||| |||||||||||||| |||||||||| ||||TTCCCACCTCTCCTCTTTGCACCATGAAGCTCAAGGACAAATTCCACTC=CCCCAAAATCAAGCGCACCCCGTCCCAGAA

So in general we allow ourselves to insert gaps until we find the optimal alignment

But where should this process stop

The Downside of GapsTake two random sequences with no lsquorealrsquo similarity

GACACTAGGTCGATGCGTGGTGGCGAGA

ACGCATCCGGATGTGCACCGTGGAACTG

And allow lsquocost freersquo gaps

GAC--ACT----AGGTCGATGC---GTGG---TGGCGAGA || | | | | | ||| |||| || ACGCA-TCCGGA--T-G-TGCACCGTGGAACTG

Clearly although the alignment has no mismatches it is obviously not biologically meaningful

To prevent this we assign a cost to adding gaps which is offset against the benefit of finding matches ndash and this is the essence of lsquofinding gapped alignmentsrsquo

We want to find the lsquoalignmentrsquo between the two (or more) sequences which shows the greatest degree of similarity while introducing the fewest gaps hellip

BLAST

gtqueryAGACGAACCTAGCACAAGCGCGTCTGGAAAGACCCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGTTCAGGAGTATTTGGACTGCAATATTGGCCCTCGTTCAAGGGCGCCTACCATCACCCGACGGTCATGCCGGTCCCCAGCAGCTGCTAATAACTTCCTTCGCTACTCAAGTTACCACGCTAGCAAAACCCACGGCATACCGTTTACCCTTTAAAATCAGCTTCAACCAGCAACGAA

There are many programs used to find similarities between sequencesThey range from relatively slow programs which find the exact best matching alignment through ones which take progressively inexact shortcuts to speed things up Of this latter class the best known and easily most widely used is BLAST developed by Stephen Altschul and others and continuously refined over the last 10-15 years

The essential idea is to compare your query sequence against a collection or lsquodatabasersquo of target sequences looking for the one(s) that match the query sequence the best

gttarget1AAAACAGGAATATTTACCGGGACCGGGTAATGATGCATCTCGAGGTACACAATATACCTG GAGAACCGAATTATGAGTTGGCCACCTTACTTAACGAAACCAGCAGAGAAAATCCAACAT GGCAACACCCCTCTGACTACACTAGAAGGAACTACTATGTAAGAAAACAGCCTGTCCCTT GCAGTTTGAATGACTGGGTGATGCGAAATGGGGGTCCTGCCATAGAGCGCTTCCATGGTT TACCTTGCACATTTCAGAGAAGTCCTATGCCAGGAGTCCTTCCTACAGGGCCTTCCTGAA ACTATATATGTGCTTATTCTTGTTTGATTTGGCTTTGCAGgttarget2CTCTTAATTTATTTCTCTTCCTGCAGCTCCCTCGCTTTTTCCTTTCCCTGTTACATTCAT CTGACTTGAAGAGTTGCAAATTTTCAGTGTTTCTGTTTTTGTTGCTGATATGTTGTAAAC TTTTTAATAAAATCTATTTCTATAG gttarget3GCAGTTTGAATGACTGGGTGATGCGAAATGGGGGTCCTGCCATAGAGCGCTTCCATGGTT TACCTTGCACATTTCAGAGAAGTCCTATGCCAGGAGTCCTTCCTACAGGGCCTTCCTGAA ACTATATATGTGCTTATTCTTGTTTGATTTGGCTTTGCAGCTAGGGTTTTCACCTTTTCT GGAAAAAAAAATACTGGCTTCC gttarget4CTGCTATTAATGGGCAAAACAACTCAAATAAAGTCCCTCTGCCACCCTCAGACACTGCCC CTGGCCCCCAGCTGCCCGCTGATCCTTGTAGCCAGAGCAGTAAAGTTTTGAAAGTGGAGC CCAAGGAGAATAAAGTTATTAAAGAAACTGGCTTTGAACAAGGTGAAAAGTCTTGTGCAG CACCTCTAGATCATACTGTGAAGGAAAATCTTGGACAAACTTCTAAAGAACAGGTGGTAG

query

database

COMPARE

LIST MATCHES

Flavours of BLAST

ACGATAGATCCCATCCATAAAT ATGACGATAGATCCCATCAT CGATAGGACCACCACA GATAGACCAGGATACATAGGATAATTA AGCTCGCTTGGCTCGATGGCT

MKJLSPWERSYTRGHYTWER MGHTVNBZY MKLPWRHGDBKJGMNDFD MBKLRPIUHDFRTASGSLKWWRTVBN

MKJLSPWERSYTRGHYTWER MGHTVNBZY MKLPWRHGDBKJGMNDFD MBKLRPIUHDFRTASGSLKWWRTVBN

ATGACGATAGATCCCATCAT CGATAGGACCACCACA GATAGACCAGGATACATAGGATAATTA AGCTCGCTTGGCTCGATGGCT

query sequence other operation database sequences

ATGACGATAGATCCCATCAT CGATAGGACCACCACA GATAGACCAGGATACATAGGATAATTA AGCTCGCTTGGCTCGATGGCT

BLASTn

BLASTp

BLASTx

tBLASTn

tBLASTx

ACGATAGATCCCATCCATAAAT

ACGATAGATCCCATCCATAAAT

MQWCGYRWTYQGYRW

MQWCGYRWTYQGYRW

FAST

FAST

SLOW

SLOWER

HORRIBLY

SLOW

6 fra

me

trans

latio

n

How does it work The main task of any sequence comparison program is to test all possible mutual alignments of two sequence and see how good the match is

CTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGT

CCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGTTC | | | | | ||||||||||||||||||||||||| CTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGTCTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGT || | | | | | | | | |||||||||||||||||||||||| | | | | | |

CCGAGCTTCTCATTGCTCTTCCTAACAGTG=TGATAGGCTAACCGTAATGGCGTTC||||||||||||||||||||||||| ||||||||||||||||||||||||

query

1st database sequence

This would actually be a very slow search process if implemented like thishellip

BLAST achieves its speed through two strategies

- it takes a WORD based approach- it pre-INDEXES database sequences

BLAST WORDS and INDEXING1 GACAAATCCAAACCCCTGAAGTTCTCCACCAGCAAAGCCA

2 TAAGCAAATTTAATTTTGTTTACATTTTC

3 GTTAAGACCTTCCCTGACATTTGCAGCAGTTTCAAATGTA

AAAAAAAA 00001 AAAAAAAC 00002 AAAAAAAG 00003 ACAAATCC 07967 ACAAATCC 07968 ACAAATCC 07979 GACAAATC 33568 GACAAATG 33569 TCCAAACC 64321 TCCAAACC 64322

sequence position word

1 1 33658

1 2 07967

1 3 16210

3 15 33568

3 16 07967

Database of sequences

Numbered list of all possible lsquowordsrsquo

Build a position index of all words in the database

Analyse the Query Sequence gtquery AGACAAATCCAAACCCCTGAAGTTCTCCACCAGCAAAGCCA

AAAAAAAA 00001 AAAAAAAC 00002 AAAAAAAG 00003 ACAAATCC 07967 ACAAATCC 07968 ACAAATCC 07979 GACAAATC 33568 GACAAATG 33569 TCCAAACC 64321 TCCAAACC 64322

QUERY SEQUENCE

Numbered list of all possible lsquowordsrsquo

position word

1 14236

2 33658

3 07967

Analyse QUERY SEQUENCE

sequence position word

1 1 33658

1 2 07967

1 3 16210

3 15 33568

3 16 07967

Index of database

Expand from Word Based Matches We lsquoinstantlyrsquo know which sequences in the database have at least a word length match with our query sequence and at what relative position

Next the potential alignments are expanded adding up a score for (total matches ndash mismatches ndash gap penalties) to make the best possible alignment But this is usually for a tiny proportion of the sequences in the database ndash so overall it is much quicker

The highest scoring alignments are reported

But we can potentially miss alignments with no word-size bits in common consider BLASTn with a default word-size of 11

TCGGAAGTGGAAGCTGAACCTGATTGTAGAGTTGGAGGCCAGTGTTCTGGCTGAGC||||||||| ||||| |||||||||| |||||||||| |||| ||||| ||||||| TCGGAAGTGTAAGCTCAACCTGATTGCAGAGTTGGAGTCCAGAGTTCTAGCTGAGC

Care is sometimes neededhellip

BLAST ndashTypical OutputINPUT

gtpartial cDNA sequence Xenopus tropicalisCGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGTTCCCACCTCTCCTCTTTCACCATGAAGCTCAAGGACAAATTCCACTCCCCCAAAATCAAGCGCACCCCGTCCAAGAAGGGGAAGCCGGCCGACCTCACCGTCAAAACAGAAGAGAAACCCGTCAACAAAACCTTAAGCCGCTTGGAGGAACAGGAGAAAGAAGTCGTTAATGCCTTGCGTTACTTTAAGACAATTGTTGACAAGATGGCGGTGGACAAGATGGTGCTGGTGATGCTGCCAGGGTCGGCGA

OUTPUTQuery= (311 letters) Database NCBI Protein Reference Sequences 954378 sequences 347895532 total letters

gtgi|41055060|ref|NP_9574201| similar to guanine nucleotide-releasing factor 2 (specific for crk proto-oncogene) [Danio rerio]

Length=691

Score = 133 bits (335)Expect = 6e-31 Identities = 7698 (77) Positives = 8298 (83) Gaps = 498 (4) Frame = +2

Query 26 MSGKIE-KADSQRSHLSSFTMKLKDKFHSPKIKRTPSKKGKPA--DLTVKTEEKPVNKTL 196 MSGKIE K +SQ+SHLSSFTMKL KFHSPKIKRTPSKKGK + VKT EKPVNK + Sbjct 1 MSGKIESKHESQKSHLSSFTMKLM-KFHSPKIKRTPSKKGKQLQPEPAVKTPEKPVNKKV 59

Query 197 SRLEEQEKEVVNALRYFKTIVDKMAVDKMVLVMLPGSA 310 SRLEEQEK+VV+ALRYFKTIVDKM VD VL MLPGSA Sbjct 60 SRLEEQEKDVVSALRYFKTIVDKMNVDTKVLQMLPGSA 97

When is a match significant

RFKISDCQHPCTYSHNQYMTNHMRECPYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV

NFSWKKTSEKETNCQFDYPNDYNEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFNMCWLEVNSS

RF---KISDCQHPCTYSH-NQYMTNHMREC----PYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV F K S+ + C + + N Y N +C P+ + +W +P + D I N M ++ NFSWKKTSEKETNCQFDYPNDY--NEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFN------MCWLEVNSS

Here is a lsquotypicalrsquo weak alignment from BLASTp

In fact the sequences were randomly generated so there is no biologically significant alignmenthellip

E-values

The number of matches like the discovered match that I would expect to find by chance

An E-value of 00 implies that I would expect no matches like this to arise by chance thereforehellip

An E-value of 1 implies I would expect 1 match like this to arise by chance so if I have a match with such an E-valuehellip

Also ldquoexpect valueldquo or ldquoexpectationrdquo

E-values From First Principles

Some database statistics (23rd July 2005)

Database NCBI RefSeq mRNA 272619 sequences 503566580 total letters (~50 x 108)

Database NCBI nr 3329110 sequences 14601814750 total letters (~14 x 1010)

Notation

12e-35 = 12 x 10-35

48 x 106 = 4800000

We will consider first searching a nucleotide sequence (lsquoACGTAGACGTrsquo) against a nucleotide database eg the RefSeq mRNA above

Then we will consider the more complex case of amino acid sequence (protein) searches Which is of course what we mostly do

Calculating an E-valueThe RefSeq mRNA database has ~ 50 x 108 letters There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA

Query = lsquoArsquo

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACrsquo

AC AC AC AC

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACGrsquo

ACG

Expected number of matches = (50 x 108) 4 = ~12 x 108

Expected number of matches = (50 x 108) (4x 4) = ~31 x 107

Expected number of matches = (50 x 108) (4 x 4 x 4) = ~81 x 106

Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer

Expected number of matches = (50 x 108) (4 x 4 x 4 x 4 hellip 60 times ) = (50 x 108) 1036 = 50 x 10-28

E-value = 50 x 10-28

E-values In PracticeSo if I take a 60 nt sequencegtsequenceACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA and actually BLAST it against the RefSeq mRNA database I get

BLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds

Length=6060 Score = 119 bits (60) Expect = 2e-26 Identities = 6060 (100) Gaps = 060 (0) Strand=PlusPlus

Query 1 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 60 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 2977 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 3036

What do I get if I BLAST it against the larger nr databaseBLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds

Length=6060 Score = 119 bits (60) Expect = 6e-25 Identities = 6060 (100) Gaps = 060 (0) Strand=PlusPlus

Query 1 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 60 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 2977 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 3036

theoretical value was 50e-28 -

E-value Exercise

Given a transcription factor binding site

ACC[TG]TA

How many would you expect to find by chance in a 10k promoter sequence

How would this differ if there was an optional additional base between the 4th and 5th positionsIeACC[TG]TAOR ACC[TG]TA

E-value Exercise AnswerACC[TG]TA

Expect lsquoArsquo every 4 ntExpect lsquoACCrsquo every 4x4x4 = 64 nt

Expect lsquoT or Grsquo every 2nd ntExpect lsquoACC[TG]rsquo every 64x2 nt = 128 nt

Expect lsquoTArsquo every 4x4 = 16 ntExpect lsquoACC[TG]TArsquo every 128x16 nt = 2048 nt (4x4x4x2x4x4)We would expect ~5 of these promoter sites every 10k by chance

If also ACC[TG]TAA allowed

The two motifs independently have the same E-valueTo allow either means we expect twice as many

E-values Effect of Database SizeThe nr mRNA database has ~ 14 x 1010 letters (was RefSeq and 50 x108)There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA

Query = lsquoArsquo

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACrsquo

AC AC AC AC

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACGrsquo

ACG

Expected number of matches = (14 x 1010) 4 = ~12 x 108

Expected number of matches = (14 x 1010) (4x 4) = ~31 x 107

Expected number of matches = (14 x 1010) (4 x 4 x 4) = ~81 x 106

Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer

Expected number of matches = (14 x 1010) (4 x 4 x 4 x 4 hellip 60 times ) = (14 x 1010) 1036 = 14 x 10-26

E-value = 14 x 10-26

(was E-value = 50 x 10-28)

E-values Effect of Database Size

The E-value is simply dependent on database size

RefSeq

nr

14 x 1010 letters

50 x108 letters

30 x bigger

BLAST the same sequenceagainst each E-value = 14e-26

E-value = 50e-28

The database was ~30 times bigger and so the E-value was ~30 times bigger

Why were the values differentOur calculated E-value for searching against the RefSeq mRNA database was 50 x 10-28But our actual BLAST search at NCBI gave a value of

20 x 10-26 - about 40x larger - why is this

Gapped alignments

If we were expecting N matches for a query sequence lsquoACGTACGTACGTrsquo imagine what would happen to N if we allowed gaps in our matches

ACGTACGTACGT

This would now give us additional possible alignments that would meet our lsquomatchrsquo criteria

ACGTACGTACGT ACGTACAGTACGT ACGTACCGTACGT etc|||||||||||| |||||| |||||| |||||| ||||||ACGTACGTACGT ACGTAC-GTACGT ACGTAC-GTACGT

We will expect many more matches in a given database if we allow our alignments to have gaps The E-value will be larger

E-values Effect of Query Length

Biologically itrsquos the same match Does it mean we are any less sure that this match didnrsquot occur by chance The E-value is simply dependent on match length

database

BLAST 500 nt sequence against a database

BLASTn Get a full length match with sequence XYZ at an E-value = 50e-160

gtsequence ACTAGTCTAGCTAGACATCGATCGATGATGCTACACAGATAGACGATAGATAGTAAGTCGATCGATCGCGCATCGATCGTCTAGATCGATCGCTCGCTGTGTAGATAGATCGGCGATAGA

database

BLAST half of the same sequence against the same database

BLASTn

gtsequence ACTAGTCTAGCTAGACATCGATCGATGATGCTACACAGATAGACGATAGATAGTAAGTCG

Get a match with sequence XYZ again but at an E-value = 50e-80

Why not just use identityAt some levels this a good question

But consider two very different searches both of which give a 75 identity match

Query1 was 60 nt longCGGAGCTCAGGGCTTAACGACTGATATCTCCGCGCATGTCGAGAAACGATACAGCCAGCG||||||||||| || | || | || || |||| | | | |||||| | ||||||||||CGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGWhich would have an E-value ~ 50 x 10-19

And Query2 only 16 nt longACGTACGTACGTACGT||| || | |||| ||ACGCACCTTCGTAGGTWhich would have an E-value ~ 30

And intuitively we feel we would expect to see that sort of number of matches in the database just by chancehellip

So whatrsquos the real problemBasically you are usually trying to answer the question

Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism

Are there any useful guidelines though at least for biological meaningfulness

Basically you are usually trying to answer the question

Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism

BLAST

The difficulty is because

ORTHOLOGY

BLAST Similarity + Probability

biological knowledge

nature of query sequence

phylogenetic relationship

match length PI size of databasehellip

Rules of ThumbHow good does an E-value have to be before we might even think we have an ortholog

largerworse smallerbetter

E-values 10-5 10-10 10-40 10-100 00

fantasy borderline encouraging

pretty good canrsquot get

better

But note that in some gene families with closely related members you can get an E-value of 00 for several different matches and then identity may be more sensitive Also bear in mind in cases like this that ideas of lsquofunctionalrsquo orthology may break down with more than one locus producing identical proteins which share the same functionhellip

Protein BLASTItrsquos (nearly) always better to make comparisons at the amino acid level between protein sequences than the DNA level because the amino acid sequence is more conserved than the underlying DNA sequenceDoes this cause us to treat expected values any differently

If we follow the argument as before then for an exact match of a 20 amino acid sequence in the RefSeq protein database each additional amino acid will reduce the E-value by 120th (there are 20 different amino acids) And as there are 347895532 letters in that databaseE-value = ~35 x 108 (20 x 20 x 20 hellip20 times) = ~35 x 10-18

But this is what we get if we run the blast at NCBI

Score = 431 bits (100) Expect = 8e-04 Identities = 2020 (100) Positives = 2020 (100) Gaps = 020 (0) Frame = +3

Query 3 SSSSFRAYRAALSEVEPPCI 62 SSSSFRAYRAALSEVEPPCISbjct 972 SSSSFRAYRAALSEVEPPCI 991

Really too big a discrepancy to easily explain with hand wavinghellip

Amino Acid Substitutions

A SC F LWYG I LMVL IMFVM ILVP V ILMW FY

N DHSQ REHKS ANTT SY HFW

H NQYK RQER QK

D NEE DQK

In fact we need to take into account both amino acid substitutability as well as as before allowing gapped alignments On average any residue can be substituted for by about 2 others so each position has about 17th chance of lsquomatchingrsquo rather than 120th

So now we getE-value = ~35 x 108 (7 x 7 x 7 hellip20 times) = ~44 x 10-9which is much closer to the actual BLAST value

These substitutabilities are dealt with by the BLOSUM and PAM matrices

ExercisesGo to the file random-DNA-sequenceshtml select one of the 20 randomly generated nucleotide sequences and do a BLASTx (translated DNA-gtprotein) at NCBI against the nr protein database

Did you find any lsquosignificantrsquo hits

Repeat with a second sequence

What conclusions might you draw from this exercise

Try the same sequence(s) against the nr nucleotide database

Is there any general difference

Part 4 Tweaking BLASTAlthough you normally see BLAST as a web page with boxes to place data in and tick boxes etc it is actually a command line program that can be run just by typing the appropriate command and options eg

promptgt blastall ndashp ltblast typegt ndashi ltinput sequencegt ndashd ltdatabasegt

This is the simplest form where the basic program lsquoblastallrsquo takes a number of different options or parameters indicated by the ndashx and followed by its value -p ltwhich blast flavour to rungt-i ltfile with query sequence ingt-d ltpre-indexed database namegt

Not All Parameters are hellipThere are many other parameters and if not listed explicitly they will use a default value most appropriate to the blast flavour requested Eg for ndashW ltword sizegt blastn uses ndashW 11 where blastx uses ndashW 3

There are also some options that appear on the web pages that are not really parameters but manage the job in a similar way One of the most useful of these is on the NCBI blast pages where you can use Entrez queries or pick from an organism list to modify your search

The Many Parameters of BLASTThere are almost literally hundreds of parameters but most are way too obscure even for die-hard techies like me Very few of them are regularly useful in any but their default value but just occasionally they are very necessary

Here are some of the ones that I have used

-e max expected value -m output format (graphical or tabularspreadsheet)-F filter query sequence for low complexity (default TRUE)-U use only upper case regions of query (default FALSE)-G gap opening cost-E gap extension cost-q nucleotide mismatch penalty (BLASTx uses matrices)-r nucleotide match reward-b number of matching sequences to report-g allow gaps (default TRUE)-W word size-z effective database size (removes effect of actual database size)-S query strands to search (default both directions)-l restrict database sequences to given list of lsquogilsquo numbers

BLAST Parameters Exercises1 BLASTn vs BLASTx

Open the file example-sequenceshtml copy the sequence gtblastn-vs-blastxThis is a Xenopus tropicalis cDNA sequence

Go to the NCBI BLAST Home PageNucleotide-nucleotide BLAST (blastn) section Paste your sequence into the box

Run BLASTn against the nr nucleotide database using all default optionsThen hit [format] to wait for the results in a new page(hint if you paste the sequence definition line lsquogtnamersquo into the box as well your results will be labelled accordingly which can be useful)

Now repeat but go to the TRANSLATED BLAST section and BLAST against the nr protein database using BLASTx

How might the different results help us view the presence of this gene in other vertebrates

Results for Exercise 1 BLASTn

BLASTx

BLAST Parameters Exercises2 Low complexity filtering

Open the file example-sequenceshtml copy the sequence gtlow-complexity-filtering-A This sequence contains a long AT tandem repeat

Go to the NCBI BLAST Home PageTRANSLATED BLAST sectionBLASTx Paste your sequence into the box

Carefully UNTICK the ldquoChoose filter [ ] Low complexityrdquo BOX in the second section And then run BLASTx against the nr database

What do you feel about these alignmentsRe-run but leave the low-complexity filter ON this timeDoes this change our view of the protein matches

Now continue with gtlow-complexity-filtering-B and ndashCC is an especially interesting case ndash what can we deduce about the cDNA sequence Annotators beware

Results for Exercise 2A (OFF) BLASTn ndash low complexity filtering OFF

Results for Exercise 2A (ON) BLASTn ndash low complexity filtering ON

Results for Exercise 2B

ON OFF

Results for Exercise 2C

There is a sequence error an extra G at position 117 in the sequence cDNA (117)AGAAAAGAAGAAACATGGCAATGGATCAGAA|||||||||||||||| ||||||||||||||AGAAAAGAAGAAACAT-GCAATGGATCAGAA

Genomic sequence

ON

OFF

BLAST Parameters Exercises3 Limit by Entrez query

Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching sequences in fruit fly enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT

Open the file example-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home Page TRANSLATED BLAST sectionBLASTx and paste the sequence

Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip

BLAST Parameters Exercises4 BLASTn vs tBLASTx and nucleotide mismatch penalties

Open the file example-sequenceshtml

Also open the NCBI BLAST Home PageSPECIAL ndash Align two sequences section

There are several Xenopus tropicalis cyclins in the examples fileCopy the sequence gtcyclin-A1-Xt to the Sequence 1 BLAST windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 BLAST window(i) Run the default comparison should be BLASTn Note the alignmentNow run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these

(ii) Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1 Then try reducing the gap open and gap extension penaltieshellipWhat do we learn from this

(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip

Results for Exercise 4 (i)

BLASTn tBLASTx

Results for Exercise 4 (ii)

Mismatch penalty = -2 (default) Mismatch penalty = -1

BLAST Parameters Exercises5 E-Value maximum for reporting

Open the file example-sequenceshtml

Copy the sequence gtsumo-binding-motif and go to the NCBI BLAST Home PageGo to the PROTEIN BLAST section BLASTp and paste the sequence

Run the search with the default values

Now re-run the search setting the maximum E-value in the box

Expect 100

What difference does this make

BLAST Parameters Exercises6 Word Size

Open the file example-sequenceshtml

Copy the sequence gtmorpholino and go to the NCBI BLAST Home PageGo to the NUCLEOTIDE BLAST section BLASTn and paste the sequence

Check OFF the low complexity filter and then run the search

Now re-run the search setting the following parameters

Low complexity OFFExpect 100Word Size 7Other advanced -q-1 (mismatch penalty -1 instead of default -3)

What difference does this make

END

  • Bioinformatics Workshop 1 Sequences and Similarity Searches
  • The Basic Questions
  • Part 1 Structural Genomics
  • Chromosomes and Genes
  • Gene to Protein
  • Sequence Signals
  • Genomic Signals
  • Derivative Sequences
  • Gene Models
  • Sequences and Genes (Accession Numbers and Names)
  • Gene Symbols Names Etc
  • A Gene-Centric View
  • Sequences and Accession Numbers
  • mRNA Splicing Signals
  • Gene Predictions
  • Supporting Evidence
  • TheoreticalPredicted Sequences
  • Sequences for a model organism
  • So Whatrsquos in the Databases Now
  • Part 2 Comparative Genomics
  • Speciation
  • Residual Similarity
  • Computers Can Detect Homology
  • Orthologs
  • Paralogs
  • lsquoOtherrsquo-logs
  • The Essential Paradigm
  • Function Conserved Longer than Detectable Similarity
  • Redundancy in the Genetic Code
  • Protein Similarity Persists Longer
  • Always Compare Protein Sequences
  • Exercise 1 nucleotide vs amino acid search
  • Answers Exercise 1
  • The Essential Task
  • Functional Orthologs
  • Finding Orthologs
  • Using Synteny is Better
  • Metazome
  • Metazome Exercise
  • Part 3 Finding Sequence Similarities
  • Gaps in Alignments
  • The Downside of Gaps
  • BLAST
  • Flavours of BLAST
  • How does it work
  • BLAST WORDS and INDEXING
  • Analyse the Query Sequence
  • Expand from Word Based Matches
  • BLAST ndashTypical Output
  • When is a match significant
  • E-values
  • E-values From First Principles
  • Calculating an E-value
  • E-values In Practice
  • E-value Exercise
  • E-value Exercise Answer
  • E-values Effect of Database Size
  • Slide 58
  • Why were the values different
  • E-values Effect of Query Length
  • Why not just use identity
  • So whatrsquos the real problem
  • Rules of Thumb
  • Protein BLAST
  • Amino Acid Substitutions
  • Exercises
  • Part 4 Tweaking BLAST
  • Not All Parameters are hellip
  • The Many Parameters of BLAST
  • Slide 70
  • BLAST Parameters Exercises
  • Results for Exercise 1
  • Slide 73
  • Results for Exercise 2A (OFF)
  • Results for Exercise 2A (ON)
  • Results for Exercise 2B
  • Results for Exercise 2C
  • Slide 78
  • Slide 79
  • Results for Exercise 4 (i)
  • Results for Exercise 4 (ii)
  • Slide 82
  • Slide 83
  • END
Page 6: Bioinformatics Workshop 1 Sequences and Similarity Searches

CTACCATCCATGCTAACCATTCTACTAGCATAACTGGCTA

Sequence Signals

CTACCATCCATGCTAACCATTCTACTAGCATAACTGGCTA

mRNA

MLTIL AL

Genomic Signals

transcription start site

===CGCTATAAGCG====================

===CGCAATAAAGCG===================

polyadenylation signal

===CACGATCGAGTC===================

promoters

enhancers

==ACGTAhelliphelliphelliphellipCAGTA====================

splice sites

Derivative Sequences

mRNA

capture by cloning into cDNA library

3rsquo EST

5rsquo EST

cDNA sequence

EST single pass sequence from each end of the clone

cDNA multiple pass sequencing over whole length of the clone

5rsquo 3rsquo

Gene Models

gene modelexons

Sequences and Genes(Accession Numbers and Names)

AAB229701

AAP212451

CAA415451

NP_1877592

proteins

S431051mRNAscDNAs lsquosimilar to Cyclin B1 [mus musculus]rsquo

gene

BT0064371 lsquoCyclin B1 isoform 1 [mus musculus]rsquo

X587081

NM_1119853 lsquoCCNB1 Cyclin B1 [mus musculus]rsquo

lsquoCyclin B1 isoform 2 [mus musculus]rsquo

Gene Symbols Names Etc

Gene Symbol CCNB1

Gene Name cyclin B1 [Homo sapiens]

Description G2mitotic-specific cyclin B1

Aliases CCNB CYCB1

A Gene-Centric View

Entrez Genehttpwwwncbinlmnihgov

Cyclin B1

S431051

BT0064371

X587081

NM_1119853

AAB229701

AAP212451

CAA415451

NP_1877592

Exercise 1

Go to Entrez Gene and look for your favourite gene or genes

genomic location

expression data

Sequences and Accession Numbers

NM_0010159221 gi=62860271

GATCGTTCGATTAGCTAGGGACACCACCGATCGATATGACCACAAAAA

BC0096381 gi=16307106

GTTCGATTAGCTAGGGACACCACCGATCGATATGACCACAAAA

NM_0010159222 gi=62860589

GACCGTTCGATTAGCTAGGGACACCACCGATCGATATGACCACAAA

NP_0010159221 protein translated from mRNA

XM_0011025671 predicted mRNA

XP_0010897651 predicted protein translated from predicted mRNA

mRNA Splicing Signals

gene model

genome

CTACCATCCATGCTAACCATTCTACCATTTTATACTCATGCAACGGACCGTAGCGTAGTCGCTTAGCATCCTTTATAACTGGCTA

CTACCATCCATGCTAACCATTCTAC CATTTTATACTCATGCAACGGACCGT AGCGTAGTCGCTTAGCATCCTTTATAACTGGCTA

CTACCATCCATGCTAACCATTCTACGTAAGTCATCTATATCAATATTATTTCAGCATTTTATACTCATGCAACGGACCGTGTCAGTATTACAGAGCGTAGTCGCTTAGCATCCTTTATAACTGGCTA

GTAAGdonor

TTTCAG acceptor

mRNA

exon intron exon intron exon

splice sites

Gene PredictionsGiven- coding sequence must run from ATG ndash STOP codon in-frame- introns GT AG can be spliced out

Also take a statistical approach- coding and non-coding sequence are slightly different in composition- some lsquopossiblersquo splice sites are more likely than others

CGTCGTATGGCTTCGATGTAGTACATCGGATCGGTATGGAATCATTTCAGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA

CGTCGTATGGCTTCGATGTAGTACATCGGATCGGTATGGAATCATTTCAGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA CGTCGTATGGCTTCGATGTAGTACATCGGATCGGTATGGAATCATTTCAGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA

CGTCGTATGGCTTCGATGTAGTACATCGGATCGGTATGGAATCATTTCAGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA

CGTCGTATGGCTTCGATTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA

CGTCGTATGGCTTCGATGTAGTACATCGGATCGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA CGTCGTATGGCTTCGATGTAGTACATCGGATCGGTATGGAATCATTTCAGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA

CGTCGTATGGCTTCGATTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA

CGTCGTATGGCTTCGATGTAGTACATCGGATCGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA

scan genomic sequence hellip

CGTCGTATGGCTTCGATGTAGTACATCGGATCGGTATGGAATCATTTCAGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA

most likely gene model

Supporting Evidence

EST evidence

genome

gene model

We note that in the absence of EST evidence it is only really possible to predict coding sequence with any confidence (and even thenhellip)

So predicted genes based on computational gene models alone will usually lack UTR regions which has some important consequences

exons 1 2 3 4

TheoreticalPredicted Sequences

genome

predicted gene modelexons 1 2 3 4

Wersquove now reversed the process of working out exon structure from aligning cDNA sequences against the genome sequence but we shouldnrsquot lose sight of the fact that we donrsquot really know if these predicted proteins exists ndash especially where supporting EST evidence is weak or non-existent

predicted transcript

predicted protein

Sequences for a model organism

ESTs ndash millions pound10 eachCheap to sequence ndash so we get millions per organismBut lots of errorsAnd incomplete gene sequencesCan give us relative expression levels

cDNAs ndash tens of thousands pound1000 eachExpensive ndash but only need to do one (or a small number) per geneFew errors with multipass sequencingGives us protein sequences

Genomes ndash one pound30000000Extremely expensiveBut the only way to get the whole pictureGives us gene regulation

So Whatrsquos in the Databases Now

15000000ESTs

3300000cDNAs

NCBI July 2005

2700000proteins

950000proteins

nrRefSeq

DNA

Proteins

Part 2 Comparative Genomics

ATGAAGGCTGCCTACGACTGCCGTGATGCAGGCTGCCTACGACTGCCGTGATGCAGGCTGCCAACGACTGCCGTGATGCATGCTGCCAACGACTGCCGTGATGCATGCTGCCAACGACTGCCCTGATGCATGCTGCCAACGGCTGCCCTGATGCATGCTGCCAACGGATGCCCTGATGCATGCCGCCAACGGATGCCCTGATGCATGCCGCCAACGGATGTCCTG

Imagine one mutation gets fixed every 100000 years in this gene sequencehellip

Gene sequence

Evolution by sequence mutation

Speciation

ATGAAGGCTGCCTACGACTGCCGTG

ATGCAGGCTGCCTACGACTGCCGTGATGCAGGCTGCCAACGACTGCCGTGATGCATGCTGCCAACGACTGCCGTGATGCATGCTGCCAACGACTGCCCTGATGCATGCTGCCAACGGCTGCCCTGATGCATGCTGCCAACGGATGCCCTG

Gene AATGAAGGCTGCCTACGACTGCCGTG

ATGAAGGCCGCCTACGACTGCCGTGATGAAGGCCGCCAACGACTGTCGTGATGAAAGCCGCCAACGACTGTCGTGATGAAAGCCGCCAACGACAGTCGTGATGAAAGCCGCCTACGACAGTCGTGATGAAAGCCGCCTACGACAGTCCTG

ATGCATGCTGCCAACGGATGCCCTG

ATGAAAGCCGCCTACGACAGTCCTG||| | || ||| ||| | ||||

If the genetic difference means they can no longer interbreed with fertile offspring ndash then we have a new specieshellip

Residual Similarity

ATGCATGCTGCCAACGGATGCCCTG

ATGAAAGCCGCCTACGACAGTCCTG||| | || ||| ||| | ||||

ATGCATGCTGCCAACGGATGCCCTG

ATGGAAGGCGCTTAGGATAGTCCAG||| | | || | | | || |

After longer periods of evolution homology may no longer be detectable in the DNA sequencehellip

We can still easily detect residual similarity between these sequences this is what we call homology ndash detectable similarity because of common evolutionary origin

Computers Can Detect Homology

In fact computers are very good at this task ndash the two primary challenges are

(a) performing the search fast enough to look through millions of sequence in a timescale compatible with a lab scientistrsquos attention span

(b) at low levels of similarity being able to distinguish between biologically related sequences and chance matcheshellip

ATGCATGCTGCCAACGGATGCCCTG

ATGAAAGCCGCCTACGACAGTCCTG||| | || ||| ||| | ||||

GCTGACTCGTAGCGCTTAGCTAGCT

CCAACATCTAGCCAGATTAGTTAGT | || | | | |

Orthologs

A A

A Gene duplication though speciation The two copies of Gene

A will now evolve independently but will continue to have the ~same function

They are ORTHOLOGS

Paralogs

A

Gene duplication though internal genome duplication

The two copies of Gene A will now evolve independently but will probably not continue to have exactly the same function

They are PARALOGS

A

A Arsquo

A

lsquoOtherrsquo-logsWhat about gene duplication after speciation

How can we describe the relationship(s) between the various copies of gene A in the two frogs

Bear in mind that understanding gene function is more important than semanticshellip

The two copies of A in the orange frog are sometimes called IN-PARALOGS

If they were also present in the green frog (and therefore were in the ancestor species) they would be OUT-PARALOGS

A

A

A

Arsquo A

The Essential Paradigm

1 any group of modern species can be traced back to some extinct common ancestor

A

A

2 in all likelihood they share orthologous genes which have the same function in the modern animal as in the extinct ancestor

3 If we can experimentally determine the function of a gene in one of these organisms then there is a good chance the ORTHOLOGOUS gene in another organism will have the same function

A A

cyclin b1

cyclin b1

Function Conserved Longer than Detectable Similarity

start from first self-replicating sequence

same function detectable similarity

living organisms

whole genome duplication local duplication

Redundancy in the Genetic Code

GCA A alanine GCC A GCG A GCT A

TGC C cystine TGT C

GAC D aspartate GAT D

GGA G glycine GGC G GGG G GGT G

lsquoSynonymousrsquo or lsquosilentrsquo mutations in the third position of the codon triplets have no effect on the amino acid coded for ndash so there is no evolutionary pressure against thishellip

Protein Similarity Persists Longer

CTATCACGAGAACCTGTGCTATCCCGAGAACCTGTGCTATCCCGAGAACCAGTGCTATCCCGTGAACCAGTGCTATCCCGTGAGCCAGTGCTATCCCGTGAGCCAGTTCTGTCCCGTGAGCCAGTT

CTATCACGAGAACCTGTG

CTGTCCCGTGAGCCAGTT|| || || || || ||

LSREPV

LSREPV||||||

CTATCACGAGAACCTGTG

TTGTCCCGGTCGCCAGTT | || | || ||

LSREPV

LSRFPV||| ||

67 100

44 80

Always Compare Protein Sequences

ATGAATGCAGCCTATGATTGCCGAGCCAGAATGCTAAGG MNAAYDCRARMLR ||||| || || || || || || || ||||| || || | ||||||||+||ATGAAGGCCGCATACGACTGTCGTGCTAGAATCCTGAGA MKAAYDCRARILR

DNA comparison amino acid comparison

The DNA sequence can change while the amino acid sequence stays the same so always look for similarities by comparing amino acid sequences

Exercise 1nucleotide vs amino acid search

Go to the file example-sequenceshtml and locate the section for this exercise There should be two sequences lsquosurfeit1rsquo for frog and fly

Go to NCBI Blast home page then lsquoAlign two sequencesrsquo (bottom left lsquospecialrsquo panel) paste one sequence into each window and hit lsquoAlignrsquo ndash this will do a direct DNADNA comparison

Now find the open reading frames of the two genes and translate them into amino acid protein sequences then repeat the two sequences comparison

Go to NCBI ORF Finder ndash paste sequence ndash hit OrfFind ndash identify longest ORF ndash click on it ndash next screen hit Accept ndash change View to Fasta protein ndash hit View ndash copy sequence to Blast2Seqs Do the same with the other sequence

Before you hit lsquoAlignrsquo change the lsquoProgramrsquo (top left) to blastphellip

Answers Exercise 1

The Essential Taskexperiment data mining

gene sequence what is its function

database of proteins in other species

Cyclin-AFoxA1

cdc25

alpha-tubulin

Predicted protein

Gravin-like

Sprouty-2

calmodulin

KIAA10786568

frizzled

Wint8

Troponin T3

Gravin-like

we can only do this because of implied function based on orthology

Functional Orthologs

function known annotation lsquoGravinrsquo available

Human geneXenopus genefunction unknown

sequence similarityorthologs

same function But we know that function is largely determined by shape

similar shape

Which in general we cannot determine ndash but it is probably SHAPE not SEQUENCE that is conserved

We make an assumption that the same gene function is likely to be present in the two organisms and the ones that have this function are likely to be the most similar in sequence

Finding OrthologsSo how do we find orthologs and can we know when we have

The simplest is Reciprocal Best BLAST but it implicitly relies on having all the protein sequences of you own organism and the one you wish to find an ortholog in

frog proteindatabase of human proteins

best match human protein

database of frog proteins

x

Using Synteny is Better

We know that large regions of (say) vertebrate genomes have preserved their overall organisation from one organism to another

And we find the same genes (ie orthologs) in more or less the same order in the syntenic sectionsThese of course represent chromosomal re-arrangements since these organisms diverged

Human chromosome 5

Mouse chromosome 10

Mouse chromosome 2

MetazomeFortunately someone has done all the hard work for ushellip Dan Rokhsar httpwwwmetazomenet

Metazome Exercise

Go back to Entrez Gene and look for your favourite gene again

Pick probable ortholog vertebrate genes from common organisms (human mouse rat chicken frog fish) and paste their protein sequences into a temporary space

Go to Metazome (httpwwwmetazomenet) find the blast window open two versions of it and blast your sequences against the Tetrapod or Jawed vertebrate node

See if you get the same cluster ID as best top hit and have a look at the Metazome alignment(s)hellip

Part 3 Finding Sequence Similarities

We want computer programs which will compare sequences at all possible different alignments looking for a degree of similarity greater than we would expect to find by chance

But first we have to consider the implication of gapshellip

Insertions and deletions are other possible forms of mutations and they can really mess up our simple alignments

ATGCATGCTGCCAACGGATGTCCTG

ATGAAAGCCGCCTACGAAAGTCCTG||| | || ||| ||| | ||||||

ATGCATGCTGGCCAACGGATGTCCTG

ATGAAAGCCGCCTACGAAAGTCCTG||| | || ||| | | | |

Gaps in Alignments

Consider these two obviously similar sequences

TTCCCAACTCTCCTCTTTCACCATGAAGCTCAAGGACAGATTCCACTCGCCCCAAAATCAAGCTCACCCCGTCCAAGAA | || | || |||||||||||||||||||| ||||||||| ||| ||| | ||| | | |TTCCCACCTCTCCTCTTTGCACCATGAAGCTCAAGGACAAATTCCACTCCCCCAAAATCAAGCGCACCCCGTCCCAGAA

In fact we realise that the most probable alignment (regarding biological origin) is with a small gap in each sequence

TTCCCAACTCTCCTCTTT=CACCATGAAGCTCAAGGACAGATTCCACTCGCCCCAAAATCAAGCTCACCCCGTCCAAGAA |||||| ||||||||||| |||||||||||||||||||| ||||||||| |||||||||||||| |||||||||| ||||TTCCCACCTCTCCTCTTTGCACCATGAAGCTCAAGGACAAATTCCACTC=CCCCAAAATCAAGCGCACCCCGTCCCAGAA

So in general we allow ourselves to insert gaps until we find the optimal alignment

But where should this process stop

The Downside of GapsTake two random sequences with no lsquorealrsquo similarity

GACACTAGGTCGATGCGTGGTGGCGAGA

ACGCATCCGGATGTGCACCGTGGAACTG

And allow lsquocost freersquo gaps

GAC--ACT----AGGTCGATGC---GTGG---TGGCGAGA || | | | | | ||| |||| || ACGCA-TCCGGA--T-G-TGCACCGTGGAACTG

Clearly although the alignment has no mismatches it is obviously not biologically meaningful

To prevent this we assign a cost to adding gaps which is offset against the benefit of finding matches ndash and this is the essence of lsquofinding gapped alignmentsrsquo

We want to find the lsquoalignmentrsquo between the two (or more) sequences which shows the greatest degree of similarity while introducing the fewest gaps hellip

BLAST

gtqueryAGACGAACCTAGCACAAGCGCGTCTGGAAAGACCCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGTTCAGGAGTATTTGGACTGCAATATTGGCCCTCGTTCAAGGGCGCCTACCATCACCCGACGGTCATGCCGGTCCCCAGCAGCTGCTAATAACTTCCTTCGCTACTCAAGTTACCACGCTAGCAAAACCCACGGCATACCGTTTACCCTTTAAAATCAGCTTCAACCAGCAACGAA

There are many programs used to find similarities between sequencesThey range from relatively slow programs which find the exact best matching alignment through ones which take progressively inexact shortcuts to speed things up Of this latter class the best known and easily most widely used is BLAST developed by Stephen Altschul and others and continuously refined over the last 10-15 years

The essential idea is to compare your query sequence against a collection or lsquodatabasersquo of target sequences looking for the one(s) that match the query sequence the best

gttarget1AAAACAGGAATATTTACCGGGACCGGGTAATGATGCATCTCGAGGTACACAATATACCTG GAGAACCGAATTATGAGTTGGCCACCTTACTTAACGAAACCAGCAGAGAAAATCCAACAT GGCAACACCCCTCTGACTACACTAGAAGGAACTACTATGTAAGAAAACAGCCTGTCCCTT GCAGTTTGAATGACTGGGTGATGCGAAATGGGGGTCCTGCCATAGAGCGCTTCCATGGTT TACCTTGCACATTTCAGAGAAGTCCTATGCCAGGAGTCCTTCCTACAGGGCCTTCCTGAA ACTATATATGTGCTTATTCTTGTTTGATTTGGCTTTGCAGgttarget2CTCTTAATTTATTTCTCTTCCTGCAGCTCCCTCGCTTTTTCCTTTCCCTGTTACATTCAT CTGACTTGAAGAGTTGCAAATTTTCAGTGTTTCTGTTTTTGTTGCTGATATGTTGTAAAC TTTTTAATAAAATCTATTTCTATAG gttarget3GCAGTTTGAATGACTGGGTGATGCGAAATGGGGGTCCTGCCATAGAGCGCTTCCATGGTT TACCTTGCACATTTCAGAGAAGTCCTATGCCAGGAGTCCTTCCTACAGGGCCTTCCTGAA ACTATATATGTGCTTATTCTTGTTTGATTTGGCTTTGCAGCTAGGGTTTTCACCTTTTCT GGAAAAAAAAATACTGGCTTCC gttarget4CTGCTATTAATGGGCAAAACAACTCAAATAAAGTCCCTCTGCCACCCTCAGACACTGCCC CTGGCCCCCAGCTGCCCGCTGATCCTTGTAGCCAGAGCAGTAAAGTTTTGAAAGTGGAGC CCAAGGAGAATAAAGTTATTAAAGAAACTGGCTTTGAACAAGGTGAAAAGTCTTGTGCAG CACCTCTAGATCATACTGTGAAGGAAAATCTTGGACAAACTTCTAAAGAACAGGTGGTAG

query

database

COMPARE

LIST MATCHES

Flavours of BLAST

ACGATAGATCCCATCCATAAAT ATGACGATAGATCCCATCAT CGATAGGACCACCACA GATAGACCAGGATACATAGGATAATTA AGCTCGCTTGGCTCGATGGCT

MKJLSPWERSYTRGHYTWER MGHTVNBZY MKLPWRHGDBKJGMNDFD MBKLRPIUHDFRTASGSLKWWRTVBN

MKJLSPWERSYTRGHYTWER MGHTVNBZY MKLPWRHGDBKJGMNDFD MBKLRPIUHDFRTASGSLKWWRTVBN

ATGACGATAGATCCCATCAT CGATAGGACCACCACA GATAGACCAGGATACATAGGATAATTA AGCTCGCTTGGCTCGATGGCT

query sequence other operation database sequences

ATGACGATAGATCCCATCAT CGATAGGACCACCACA GATAGACCAGGATACATAGGATAATTA AGCTCGCTTGGCTCGATGGCT

BLASTn

BLASTp

BLASTx

tBLASTn

tBLASTx

ACGATAGATCCCATCCATAAAT

ACGATAGATCCCATCCATAAAT

MQWCGYRWTYQGYRW

MQWCGYRWTYQGYRW

FAST

FAST

SLOW

SLOWER

HORRIBLY

SLOW

6 fra

me

trans

latio

n

How does it work The main task of any sequence comparison program is to test all possible mutual alignments of two sequence and see how good the match is

CTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGT

CCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGTTC | | | | | ||||||||||||||||||||||||| CTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGTCTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGT || | | | | | | | | |||||||||||||||||||||||| | | | | | |

CCGAGCTTCTCATTGCTCTTCCTAACAGTG=TGATAGGCTAACCGTAATGGCGTTC||||||||||||||||||||||||| ||||||||||||||||||||||||

query

1st database sequence

This would actually be a very slow search process if implemented like thishellip

BLAST achieves its speed through two strategies

- it takes a WORD based approach- it pre-INDEXES database sequences

BLAST WORDS and INDEXING1 GACAAATCCAAACCCCTGAAGTTCTCCACCAGCAAAGCCA

2 TAAGCAAATTTAATTTTGTTTACATTTTC

3 GTTAAGACCTTCCCTGACATTTGCAGCAGTTTCAAATGTA

AAAAAAAA 00001 AAAAAAAC 00002 AAAAAAAG 00003 ACAAATCC 07967 ACAAATCC 07968 ACAAATCC 07979 GACAAATC 33568 GACAAATG 33569 TCCAAACC 64321 TCCAAACC 64322

sequence position word

1 1 33658

1 2 07967

1 3 16210

3 15 33568

3 16 07967

Database of sequences

Numbered list of all possible lsquowordsrsquo

Build a position index of all words in the database

Analyse the Query Sequence gtquery AGACAAATCCAAACCCCTGAAGTTCTCCACCAGCAAAGCCA

AAAAAAAA 00001 AAAAAAAC 00002 AAAAAAAG 00003 ACAAATCC 07967 ACAAATCC 07968 ACAAATCC 07979 GACAAATC 33568 GACAAATG 33569 TCCAAACC 64321 TCCAAACC 64322

QUERY SEQUENCE

Numbered list of all possible lsquowordsrsquo

position word

1 14236

2 33658

3 07967

Analyse QUERY SEQUENCE

sequence position word

1 1 33658

1 2 07967

1 3 16210

3 15 33568

3 16 07967

Index of database

Expand from Word Based Matches We lsquoinstantlyrsquo know which sequences in the database have at least a word length match with our query sequence and at what relative position

Next the potential alignments are expanded adding up a score for (total matches ndash mismatches ndash gap penalties) to make the best possible alignment But this is usually for a tiny proportion of the sequences in the database ndash so overall it is much quicker

The highest scoring alignments are reported

But we can potentially miss alignments with no word-size bits in common consider BLASTn with a default word-size of 11

TCGGAAGTGGAAGCTGAACCTGATTGTAGAGTTGGAGGCCAGTGTTCTGGCTGAGC||||||||| ||||| |||||||||| |||||||||| |||| ||||| ||||||| TCGGAAGTGTAAGCTCAACCTGATTGCAGAGTTGGAGTCCAGAGTTCTAGCTGAGC

Care is sometimes neededhellip

BLAST ndashTypical OutputINPUT

gtpartial cDNA sequence Xenopus tropicalisCGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGTTCCCACCTCTCCTCTTTCACCATGAAGCTCAAGGACAAATTCCACTCCCCCAAAATCAAGCGCACCCCGTCCAAGAAGGGGAAGCCGGCCGACCTCACCGTCAAAACAGAAGAGAAACCCGTCAACAAAACCTTAAGCCGCTTGGAGGAACAGGAGAAAGAAGTCGTTAATGCCTTGCGTTACTTTAAGACAATTGTTGACAAGATGGCGGTGGACAAGATGGTGCTGGTGATGCTGCCAGGGTCGGCGA

OUTPUTQuery= (311 letters) Database NCBI Protein Reference Sequences 954378 sequences 347895532 total letters

gtgi|41055060|ref|NP_9574201| similar to guanine nucleotide-releasing factor 2 (specific for crk proto-oncogene) [Danio rerio]

Length=691

Score = 133 bits (335)Expect = 6e-31 Identities = 7698 (77) Positives = 8298 (83) Gaps = 498 (4) Frame = +2

Query 26 MSGKIE-KADSQRSHLSSFTMKLKDKFHSPKIKRTPSKKGKPA--DLTVKTEEKPVNKTL 196 MSGKIE K +SQ+SHLSSFTMKL KFHSPKIKRTPSKKGK + VKT EKPVNK + Sbjct 1 MSGKIESKHESQKSHLSSFTMKLM-KFHSPKIKRTPSKKGKQLQPEPAVKTPEKPVNKKV 59

Query 197 SRLEEQEKEVVNALRYFKTIVDKMAVDKMVLVMLPGSA 310 SRLEEQEK+VV+ALRYFKTIVDKM VD VL MLPGSA Sbjct 60 SRLEEQEKDVVSALRYFKTIVDKMNVDTKVLQMLPGSA 97

When is a match significant

RFKISDCQHPCTYSHNQYMTNHMRECPYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV

NFSWKKTSEKETNCQFDYPNDYNEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFNMCWLEVNSS

RF---KISDCQHPCTYSH-NQYMTNHMREC----PYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV F K S+ + C + + N Y N +C P+ + +W +P + D I N M ++ NFSWKKTSEKETNCQFDYPNDY--NEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFN------MCWLEVNSS

Here is a lsquotypicalrsquo weak alignment from BLASTp

In fact the sequences were randomly generated so there is no biologically significant alignmenthellip

E-values

The number of matches like the discovered match that I would expect to find by chance

An E-value of 00 implies that I would expect no matches like this to arise by chance thereforehellip

An E-value of 1 implies I would expect 1 match like this to arise by chance so if I have a match with such an E-valuehellip

Also ldquoexpect valueldquo or ldquoexpectationrdquo

E-values From First Principles

Some database statistics (23rd July 2005)

Database NCBI RefSeq mRNA 272619 sequences 503566580 total letters (~50 x 108)

Database NCBI nr 3329110 sequences 14601814750 total letters (~14 x 1010)

Notation

12e-35 = 12 x 10-35

48 x 106 = 4800000

We will consider first searching a nucleotide sequence (lsquoACGTAGACGTrsquo) against a nucleotide database eg the RefSeq mRNA above

Then we will consider the more complex case of amino acid sequence (protein) searches Which is of course what we mostly do

Calculating an E-valueThe RefSeq mRNA database has ~ 50 x 108 letters There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA

Query = lsquoArsquo

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACrsquo

AC AC AC AC

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACGrsquo

ACG

Expected number of matches = (50 x 108) 4 = ~12 x 108

Expected number of matches = (50 x 108) (4x 4) = ~31 x 107

Expected number of matches = (50 x 108) (4 x 4 x 4) = ~81 x 106

Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer

Expected number of matches = (50 x 108) (4 x 4 x 4 x 4 hellip 60 times ) = (50 x 108) 1036 = 50 x 10-28

E-value = 50 x 10-28

E-values In PracticeSo if I take a 60 nt sequencegtsequenceACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA and actually BLAST it against the RefSeq mRNA database I get

BLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds

Length=6060 Score = 119 bits (60) Expect = 2e-26 Identities = 6060 (100) Gaps = 060 (0) Strand=PlusPlus

Query 1 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 60 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 2977 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 3036

What do I get if I BLAST it against the larger nr databaseBLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds

Length=6060 Score = 119 bits (60) Expect = 6e-25 Identities = 6060 (100) Gaps = 060 (0) Strand=PlusPlus

Query 1 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 60 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 2977 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 3036

theoretical value was 50e-28 -

E-value Exercise

Given a transcription factor binding site

ACC[TG]TA

How many would you expect to find by chance in a 10k promoter sequence

How would this differ if there was an optional additional base between the 4th and 5th positionsIeACC[TG]TAOR ACC[TG]TA

E-value Exercise AnswerACC[TG]TA

Expect lsquoArsquo every 4 ntExpect lsquoACCrsquo every 4x4x4 = 64 nt

Expect lsquoT or Grsquo every 2nd ntExpect lsquoACC[TG]rsquo every 64x2 nt = 128 nt

Expect lsquoTArsquo every 4x4 = 16 ntExpect lsquoACC[TG]TArsquo every 128x16 nt = 2048 nt (4x4x4x2x4x4)We would expect ~5 of these promoter sites every 10k by chance

If also ACC[TG]TAA allowed

The two motifs independently have the same E-valueTo allow either means we expect twice as many

E-values Effect of Database SizeThe nr mRNA database has ~ 14 x 1010 letters (was RefSeq and 50 x108)There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA

Query = lsquoArsquo

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACrsquo

AC AC AC AC

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACGrsquo

ACG

Expected number of matches = (14 x 1010) 4 = ~12 x 108

Expected number of matches = (14 x 1010) (4x 4) = ~31 x 107

Expected number of matches = (14 x 1010) (4 x 4 x 4) = ~81 x 106

Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer

Expected number of matches = (14 x 1010) (4 x 4 x 4 x 4 hellip 60 times ) = (14 x 1010) 1036 = 14 x 10-26

E-value = 14 x 10-26

(was E-value = 50 x 10-28)

E-values Effect of Database Size

The E-value is simply dependent on database size

RefSeq

nr

14 x 1010 letters

50 x108 letters

30 x bigger

BLAST the same sequenceagainst each E-value = 14e-26

E-value = 50e-28

The database was ~30 times bigger and so the E-value was ~30 times bigger

Why were the values differentOur calculated E-value for searching against the RefSeq mRNA database was 50 x 10-28But our actual BLAST search at NCBI gave a value of

20 x 10-26 - about 40x larger - why is this

Gapped alignments

If we were expecting N matches for a query sequence lsquoACGTACGTACGTrsquo imagine what would happen to N if we allowed gaps in our matches

ACGTACGTACGT

This would now give us additional possible alignments that would meet our lsquomatchrsquo criteria

ACGTACGTACGT ACGTACAGTACGT ACGTACCGTACGT etc|||||||||||| |||||| |||||| |||||| ||||||ACGTACGTACGT ACGTAC-GTACGT ACGTAC-GTACGT

We will expect many more matches in a given database if we allow our alignments to have gaps The E-value will be larger

E-values Effect of Query Length

Biologically itrsquos the same match Does it mean we are any less sure that this match didnrsquot occur by chance The E-value is simply dependent on match length

database

BLAST 500 nt sequence against a database

BLASTn Get a full length match with sequence XYZ at an E-value = 50e-160

gtsequence ACTAGTCTAGCTAGACATCGATCGATGATGCTACACAGATAGACGATAGATAGTAAGTCGATCGATCGCGCATCGATCGTCTAGATCGATCGCTCGCTGTGTAGATAGATCGGCGATAGA

database

BLAST half of the same sequence against the same database

BLASTn

gtsequence ACTAGTCTAGCTAGACATCGATCGATGATGCTACACAGATAGACGATAGATAGTAAGTCG

Get a match with sequence XYZ again but at an E-value = 50e-80

Why not just use identityAt some levels this a good question

But consider two very different searches both of which give a 75 identity match

Query1 was 60 nt longCGGAGCTCAGGGCTTAACGACTGATATCTCCGCGCATGTCGAGAAACGATACAGCCAGCG||||||||||| || | || | || || |||| | | | |||||| | ||||||||||CGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGWhich would have an E-value ~ 50 x 10-19

And Query2 only 16 nt longACGTACGTACGTACGT||| || | |||| ||ACGCACCTTCGTAGGTWhich would have an E-value ~ 30

And intuitively we feel we would expect to see that sort of number of matches in the database just by chancehellip

So whatrsquos the real problemBasically you are usually trying to answer the question

Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism

Are there any useful guidelines though at least for biological meaningfulness

Basically you are usually trying to answer the question

Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism

BLAST

The difficulty is because

ORTHOLOGY

BLAST Similarity + Probability

biological knowledge

nature of query sequence

phylogenetic relationship

match length PI size of databasehellip

Rules of ThumbHow good does an E-value have to be before we might even think we have an ortholog

largerworse smallerbetter

E-values 10-5 10-10 10-40 10-100 00

fantasy borderline encouraging

pretty good canrsquot get

better

But note that in some gene families with closely related members you can get an E-value of 00 for several different matches and then identity may be more sensitive Also bear in mind in cases like this that ideas of lsquofunctionalrsquo orthology may break down with more than one locus producing identical proteins which share the same functionhellip

Protein BLASTItrsquos (nearly) always better to make comparisons at the amino acid level between protein sequences than the DNA level because the amino acid sequence is more conserved than the underlying DNA sequenceDoes this cause us to treat expected values any differently

If we follow the argument as before then for an exact match of a 20 amino acid sequence in the RefSeq protein database each additional amino acid will reduce the E-value by 120th (there are 20 different amino acids) And as there are 347895532 letters in that databaseE-value = ~35 x 108 (20 x 20 x 20 hellip20 times) = ~35 x 10-18

But this is what we get if we run the blast at NCBI

Score = 431 bits (100) Expect = 8e-04 Identities = 2020 (100) Positives = 2020 (100) Gaps = 020 (0) Frame = +3

Query 3 SSSSFRAYRAALSEVEPPCI 62 SSSSFRAYRAALSEVEPPCISbjct 972 SSSSFRAYRAALSEVEPPCI 991

Really too big a discrepancy to easily explain with hand wavinghellip

Amino Acid Substitutions

A SC F LWYG I LMVL IMFVM ILVP V ILMW FY

N DHSQ REHKS ANTT SY HFW

H NQYK RQER QK

D NEE DQK

In fact we need to take into account both amino acid substitutability as well as as before allowing gapped alignments On average any residue can be substituted for by about 2 others so each position has about 17th chance of lsquomatchingrsquo rather than 120th

So now we getE-value = ~35 x 108 (7 x 7 x 7 hellip20 times) = ~44 x 10-9which is much closer to the actual BLAST value

These substitutabilities are dealt with by the BLOSUM and PAM matrices

ExercisesGo to the file random-DNA-sequenceshtml select one of the 20 randomly generated nucleotide sequences and do a BLASTx (translated DNA-gtprotein) at NCBI against the nr protein database

Did you find any lsquosignificantrsquo hits

Repeat with a second sequence

What conclusions might you draw from this exercise

Try the same sequence(s) against the nr nucleotide database

Is there any general difference

Part 4 Tweaking BLASTAlthough you normally see BLAST as a web page with boxes to place data in and tick boxes etc it is actually a command line program that can be run just by typing the appropriate command and options eg

promptgt blastall ndashp ltblast typegt ndashi ltinput sequencegt ndashd ltdatabasegt

This is the simplest form where the basic program lsquoblastallrsquo takes a number of different options or parameters indicated by the ndashx and followed by its value -p ltwhich blast flavour to rungt-i ltfile with query sequence ingt-d ltpre-indexed database namegt

Not All Parameters are hellipThere are many other parameters and if not listed explicitly they will use a default value most appropriate to the blast flavour requested Eg for ndashW ltword sizegt blastn uses ndashW 11 where blastx uses ndashW 3

There are also some options that appear on the web pages that are not really parameters but manage the job in a similar way One of the most useful of these is on the NCBI blast pages where you can use Entrez queries or pick from an organism list to modify your search

The Many Parameters of BLASTThere are almost literally hundreds of parameters but most are way too obscure even for die-hard techies like me Very few of them are regularly useful in any but their default value but just occasionally they are very necessary

Here are some of the ones that I have used

-e max expected value -m output format (graphical or tabularspreadsheet)-F filter query sequence for low complexity (default TRUE)-U use only upper case regions of query (default FALSE)-G gap opening cost-E gap extension cost-q nucleotide mismatch penalty (BLASTx uses matrices)-r nucleotide match reward-b number of matching sequences to report-g allow gaps (default TRUE)-W word size-z effective database size (removes effect of actual database size)-S query strands to search (default both directions)-l restrict database sequences to given list of lsquogilsquo numbers

BLAST Parameters Exercises1 BLASTn vs BLASTx

Open the file example-sequenceshtml copy the sequence gtblastn-vs-blastxThis is a Xenopus tropicalis cDNA sequence

Go to the NCBI BLAST Home PageNucleotide-nucleotide BLAST (blastn) section Paste your sequence into the box

Run BLASTn against the nr nucleotide database using all default optionsThen hit [format] to wait for the results in a new page(hint if you paste the sequence definition line lsquogtnamersquo into the box as well your results will be labelled accordingly which can be useful)

Now repeat but go to the TRANSLATED BLAST section and BLAST against the nr protein database using BLASTx

How might the different results help us view the presence of this gene in other vertebrates

Results for Exercise 1 BLASTn

BLASTx

BLAST Parameters Exercises2 Low complexity filtering

Open the file example-sequenceshtml copy the sequence gtlow-complexity-filtering-A This sequence contains a long AT tandem repeat

Go to the NCBI BLAST Home PageTRANSLATED BLAST sectionBLASTx Paste your sequence into the box

Carefully UNTICK the ldquoChoose filter [ ] Low complexityrdquo BOX in the second section And then run BLASTx against the nr database

What do you feel about these alignmentsRe-run but leave the low-complexity filter ON this timeDoes this change our view of the protein matches

Now continue with gtlow-complexity-filtering-B and ndashCC is an especially interesting case ndash what can we deduce about the cDNA sequence Annotators beware

Results for Exercise 2A (OFF) BLASTn ndash low complexity filtering OFF

Results for Exercise 2A (ON) BLASTn ndash low complexity filtering ON

Results for Exercise 2B

ON OFF

Results for Exercise 2C

There is a sequence error an extra G at position 117 in the sequence cDNA (117)AGAAAAGAAGAAACATGGCAATGGATCAGAA|||||||||||||||| ||||||||||||||AGAAAAGAAGAAACAT-GCAATGGATCAGAA

Genomic sequence

ON

OFF

BLAST Parameters Exercises3 Limit by Entrez query

Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching sequences in fruit fly enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT

Open the file example-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home Page TRANSLATED BLAST sectionBLASTx and paste the sequence

Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip

BLAST Parameters Exercises4 BLASTn vs tBLASTx and nucleotide mismatch penalties

Open the file example-sequenceshtml

Also open the NCBI BLAST Home PageSPECIAL ndash Align two sequences section

There are several Xenopus tropicalis cyclins in the examples fileCopy the sequence gtcyclin-A1-Xt to the Sequence 1 BLAST windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 BLAST window(i) Run the default comparison should be BLASTn Note the alignmentNow run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these

(ii) Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1 Then try reducing the gap open and gap extension penaltieshellipWhat do we learn from this

(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip

Results for Exercise 4 (i)

BLASTn tBLASTx

Results for Exercise 4 (ii)

Mismatch penalty = -2 (default) Mismatch penalty = -1

BLAST Parameters Exercises5 E-Value maximum for reporting

Open the file example-sequenceshtml

Copy the sequence gtsumo-binding-motif and go to the NCBI BLAST Home PageGo to the PROTEIN BLAST section BLASTp and paste the sequence

Run the search with the default values

Now re-run the search setting the maximum E-value in the box

Expect 100

What difference does this make

BLAST Parameters Exercises6 Word Size

Open the file example-sequenceshtml

Copy the sequence gtmorpholino and go to the NCBI BLAST Home PageGo to the NUCLEOTIDE BLAST section BLASTn and paste the sequence

Check OFF the low complexity filter and then run the search

Now re-run the search setting the following parameters

Low complexity OFFExpect 100Word Size 7Other advanced -q-1 (mismatch penalty -1 instead of default -3)

What difference does this make

END

  • Bioinformatics Workshop 1 Sequences and Similarity Searches
  • The Basic Questions
  • Part 1 Structural Genomics
  • Chromosomes and Genes
  • Gene to Protein
  • Sequence Signals
  • Genomic Signals
  • Derivative Sequences
  • Gene Models
  • Sequences and Genes (Accession Numbers and Names)
  • Gene Symbols Names Etc
  • A Gene-Centric View
  • Sequences and Accession Numbers
  • mRNA Splicing Signals
  • Gene Predictions
  • Supporting Evidence
  • TheoreticalPredicted Sequences
  • Sequences for a model organism
  • So Whatrsquos in the Databases Now
  • Part 2 Comparative Genomics
  • Speciation
  • Residual Similarity
  • Computers Can Detect Homology
  • Orthologs
  • Paralogs
  • lsquoOtherrsquo-logs
  • The Essential Paradigm
  • Function Conserved Longer than Detectable Similarity
  • Redundancy in the Genetic Code
  • Protein Similarity Persists Longer
  • Always Compare Protein Sequences
  • Exercise 1 nucleotide vs amino acid search
  • Answers Exercise 1
  • The Essential Task
  • Functional Orthologs
  • Finding Orthologs
  • Using Synteny is Better
  • Metazome
  • Metazome Exercise
  • Part 3 Finding Sequence Similarities
  • Gaps in Alignments
  • The Downside of Gaps
  • BLAST
  • Flavours of BLAST
  • How does it work
  • BLAST WORDS and INDEXING
  • Analyse the Query Sequence
  • Expand from Word Based Matches
  • BLAST ndashTypical Output
  • When is a match significant
  • E-values
  • E-values From First Principles
  • Calculating an E-value
  • E-values In Practice
  • E-value Exercise
  • E-value Exercise Answer
  • E-values Effect of Database Size
  • Slide 58
  • Why were the values different
  • E-values Effect of Query Length
  • Why not just use identity
  • So whatrsquos the real problem
  • Rules of Thumb
  • Protein BLAST
  • Amino Acid Substitutions
  • Exercises
  • Part 4 Tweaking BLAST
  • Not All Parameters are hellip
  • The Many Parameters of BLAST
  • Slide 70
  • BLAST Parameters Exercises
  • Results for Exercise 1
  • Slide 73
  • Results for Exercise 2A (OFF)
  • Results for Exercise 2A (ON)
  • Results for Exercise 2B
  • Results for Exercise 2C
  • Slide 78
  • Slide 79
  • Results for Exercise 4 (i)
  • Results for Exercise 4 (ii)
  • Slide 82
  • Slide 83
  • END
Page 7: Bioinformatics Workshop 1 Sequences and Similarity Searches

Genomic Signals

transcription start site

===CGCTATAAGCG====================

===CGCAATAAAGCG===================

polyadenylation signal

===CACGATCGAGTC===================

promoters

enhancers

==ACGTAhelliphelliphelliphellipCAGTA====================

splice sites

Derivative Sequences

mRNA

capture by cloning into cDNA library

3rsquo EST

5rsquo EST

cDNA sequence

EST single pass sequence from each end of the clone

cDNA multiple pass sequencing over whole length of the clone

5rsquo 3rsquo

Gene Models

gene modelexons

Sequences and Genes(Accession Numbers and Names)

AAB229701

AAP212451

CAA415451

NP_1877592

proteins

S431051mRNAscDNAs lsquosimilar to Cyclin B1 [mus musculus]rsquo

gene

BT0064371 lsquoCyclin B1 isoform 1 [mus musculus]rsquo

X587081

NM_1119853 lsquoCCNB1 Cyclin B1 [mus musculus]rsquo

lsquoCyclin B1 isoform 2 [mus musculus]rsquo

Gene Symbols Names Etc

Gene Symbol CCNB1

Gene Name cyclin B1 [Homo sapiens]

Description G2mitotic-specific cyclin B1

Aliases CCNB CYCB1

A Gene-Centric View

Entrez Genehttpwwwncbinlmnihgov

Cyclin B1

S431051

BT0064371

X587081

NM_1119853

AAB229701

AAP212451

CAA415451

NP_1877592

Exercise 1

Go to Entrez Gene and look for your favourite gene or genes

genomic location

expression data

Sequences and Accession Numbers

NM_0010159221 gi=62860271

GATCGTTCGATTAGCTAGGGACACCACCGATCGATATGACCACAAAAA

BC0096381 gi=16307106

GTTCGATTAGCTAGGGACACCACCGATCGATATGACCACAAAA

NM_0010159222 gi=62860589

GACCGTTCGATTAGCTAGGGACACCACCGATCGATATGACCACAAA

NP_0010159221 protein translated from mRNA

XM_0011025671 predicted mRNA

XP_0010897651 predicted protein translated from predicted mRNA

mRNA Splicing Signals

gene model

genome

CTACCATCCATGCTAACCATTCTACCATTTTATACTCATGCAACGGACCGTAGCGTAGTCGCTTAGCATCCTTTATAACTGGCTA

CTACCATCCATGCTAACCATTCTAC CATTTTATACTCATGCAACGGACCGT AGCGTAGTCGCTTAGCATCCTTTATAACTGGCTA

CTACCATCCATGCTAACCATTCTACGTAAGTCATCTATATCAATATTATTTCAGCATTTTATACTCATGCAACGGACCGTGTCAGTATTACAGAGCGTAGTCGCTTAGCATCCTTTATAACTGGCTA

GTAAGdonor

TTTCAG acceptor

mRNA

exon intron exon intron exon

splice sites

Gene PredictionsGiven- coding sequence must run from ATG ndash STOP codon in-frame- introns GT AG can be spliced out

Also take a statistical approach- coding and non-coding sequence are slightly different in composition- some lsquopossiblersquo splice sites are more likely than others

CGTCGTATGGCTTCGATGTAGTACATCGGATCGGTATGGAATCATTTCAGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA

CGTCGTATGGCTTCGATGTAGTACATCGGATCGGTATGGAATCATTTCAGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA CGTCGTATGGCTTCGATGTAGTACATCGGATCGGTATGGAATCATTTCAGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA

CGTCGTATGGCTTCGATGTAGTACATCGGATCGGTATGGAATCATTTCAGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA

CGTCGTATGGCTTCGATTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA

CGTCGTATGGCTTCGATGTAGTACATCGGATCGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA CGTCGTATGGCTTCGATGTAGTACATCGGATCGGTATGGAATCATTTCAGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA

CGTCGTATGGCTTCGATTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA

CGTCGTATGGCTTCGATGTAGTACATCGGATCGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA

scan genomic sequence hellip

CGTCGTATGGCTTCGATGTAGTACATCGGATCGGTATGGAATCATTTCAGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA

most likely gene model

Supporting Evidence

EST evidence

genome

gene model

We note that in the absence of EST evidence it is only really possible to predict coding sequence with any confidence (and even thenhellip)

So predicted genes based on computational gene models alone will usually lack UTR regions which has some important consequences

exons 1 2 3 4

TheoreticalPredicted Sequences

genome

predicted gene modelexons 1 2 3 4

Wersquove now reversed the process of working out exon structure from aligning cDNA sequences against the genome sequence but we shouldnrsquot lose sight of the fact that we donrsquot really know if these predicted proteins exists ndash especially where supporting EST evidence is weak or non-existent

predicted transcript

predicted protein

Sequences for a model organism

ESTs ndash millions pound10 eachCheap to sequence ndash so we get millions per organismBut lots of errorsAnd incomplete gene sequencesCan give us relative expression levels

cDNAs ndash tens of thousands pound1000 eachExpensive ndash but only need to do one (or a small number) per geneFew errors with multipass sequencingGives us protein sequences

Genomes ndash one pound30000000Extremely expensiveBut the only way to get the whole pictureGives us gene regulation

So Whatrsquos in the Databases Now

15000000ESTs

3300000cDNAs

NCBI July 2005

2700000proteins

950000proteins

nrRefSeq

DNA

Proteins

Part 2 Comparative Genomics

ATGAAGGCTGCCTACGACTGCCGTGATGCAGGCTGCCTACGACTGCCGTGATGCAGGCTGCCAACGACTGCCGTGATGCATGCTGCCAACGACTGCCGTGATGCATGCTGCCAACGACTGCCCTGATGCATGCTGCCAACGGCTGCCCTGATGCATGCTGCCAACGGATGCCCTGATGCATGCCGCCAACGGATGCCCTGATGCATGCCGCCAACGGATGTCCTG

Imagine one mutation gets fixed every 100000 years in this gene sequencehellip

Gene sequence

Evolution by sequence mutation

Speciation

ATGAAGGCTGCCTACGACTGCCGTG

ATGCAGGCTGCCTACGACTGCCGTGATGCAGGCTGCCAACGACTGCCGTGATGCATGCTGCCAACGACTGCCGTGATGCATGCTGCCAACGACTGCCCTGATGCATGCTGCCAACGGCTGCCCTGATGCATGCTGCCAACGGATGCCCTG

Gene AATGAAGGCTGCCTACGACTGCCGTG

ATGAAGGCCGCCTACGACTGCCGTGATGAAGGCCGCCAACGACTGTCGTGATGAAAGCCGCCAACGACTGTCGTGATGAAAGCCGCCAACGACAGTCGTGATGAAAGCCGCCTACGACAGTCGTGATGAAAGCCGCCTACGACAGTCCTG

ATGCATGCTGCCAACGGATGCCCTG

ATGAAAGCCGCCTACGACAGTCCTG||| | || ||| ||| | ||||

If the genetic difference means they can no longer interbreed with fertile offspring ndash then we have a new specieshellip

Residual Similarity

ATGCATGCTGCCAACGGATGCCCTG

ATGAAAGCCGCCTACGACAGTCCTG||| | || ||| ||| | ||||

ATGCATGCTGCCAACGGATGCCCTG

ATGGAAGGCGCTTAGGATAGTCCAG||| | | || | | | || |

After longer periods of evolution homology may no longer be detectable in the DNA sequencehellip

We can still easily detect residual similarity between these sequences this is what we call homology ndash detectable similarity because of common evolutionary origin

Computers Can Detect Homology

In fact computers are very good at this task ndash the two primary challenges are

(a) performing the search fast enough to look through millions of sequence in a timescale compatible with a lab scientistrsquos attention span

(b) at low levels of similarity being able to distinguish between biologically related sequences and chance matcheshellip

ATGCATGCTGCCAACGGATGCCCTG

ATGAAAGCCGCCTACGACAGTCCTG||| | || ||| ||| | ||||

GCTGACTCGTAGCGCTTAGCTAGCT

CCAACATCTAGCCAGATTAGTTAGT | || | | | |

Orthologs

A A

A Gene duplication though speciation The two copies of Gene

A will now evolve independently but will continue to have the ~same function

They are ORTHOLOGS

Paralogs

A

Gene duplication though internal genome duplication

The two copies of Gene A will now evolve independently but will probably not continue to have exactly the same function

They are PARALOGS

A

A Arsquo

A

lsquoOtherrsquo-logsWhat about gene duplication after speciation

How can we describe the relationship(s) between the various copies of gene A in the two frogs

Bear in mind that understanding gene function is more important than semanticshellip

The two copies of A in the orange frog are sometimes called IN-PARALOGS

If they were also present in the green frog (and therefore were in the ancestor species) they would be OUT-PARALOGS

A

A

A

Arsquo A

The Essential Paradigm

1 any group of modern species can be traced back to some extinct common ancestor

A

A

2 in all likelihood they share orthologous genes which have the same function in the modern animal as in the extinct ancestor

3 If we can experimentally determine the function of a gene in one of these organisms then there is a good chance the ORTHOLOGOUS gene in another organism will have the same function

A A

cyclin b1

cyclin b1

Function Conserved Longer than Detectable Similarity

start from first self-replicating sequence

same function detectable similarity

living organisms

whole genome duplication local duplication

Redundancy in the Genetic Code

GCA A alanine GCC A GCG A GCT A

TGC C cystine TGT C

GAC D aspartate GAT D

GGA G glycine GGC G GGG G GGT G

lsquoSynonymousrsquo or lsquosilentrsquo mutations in the third position of the codon triplets have no effect on the amino acid coded for ndash so there is no evolutionary pressure against thishellip

Protein Similarity Persists Longer

CTATCACGAGAACCTGTGCTATCCCGAGAACCTGTGCTATCCCGAGAACCAGTGCTATCCCGTGAACCAGTGCTATCCCGTGAGCCAGTGCTATCCCGTGAGCCAGTTCTGTCCCGTGAGCCAGTT

CTATCACGAGAACCTGTG

CTGTCCCGTGAGCCAGTT|| || || || || ||

LSREPV

LSREPV||||||

CTATCACGAGAACCTGTG

TTGTCCCGGTCGCCAGTT | || | || ||

LSREPV

LSRFPV||| ||

67 100

44 80

Always Compare Protein Sequences

ATGAATGCAGCCTATGATTGCCGAGCCAGAATGCTAAGG MNAAYDCRARMLR ||||| || || || || || || || ||||| || || | ||||||||+||ATGAAGGCCGCATACGACTGTCGTGCTAGAATCCTGAGA MKAAYDCRARILR

DNA comparison amino acid comparison

The DNA sequence can change while the amino acid sequence stays the same so always look for similarities by comparing amino acid sequences

Exercise 1nucleotide vs amino acid search

Go to the file example-sequenceshtml and locate the section for this exercise There should be two sequences lsquosurfeit1rsquo for frog and fly

Go to NCBI Blast home page then lsquoAlign two sequencesrsquo (bottom left lsquospecialrsquo panel) paste one sequence into each window and hit lsquoAlignrsquo ndash this will do a direct DNADNA comparison

Now find the open reading frames of the two genes and translate them into amino acid protein sequences then repeat the two sequences comparison

Go to NCBI ORF Finder ndash paste sequence ndash hit OrfFind ndash identify longest ORF ndash click on it ndash next screen hit Accept ndash change View to Fasta protein ndash hit View ndash copy sequence to Blast2Seqs Do the same with the other sequence

Before you hit lsquoAlignrsquo change the lsquoProgramrsquo (top left) to blastphellip

Answers Exercise 1

The Essential Taskexperiment data mining

gene sequence what is its function

database of proteins in other species

Cyclin-AFoxA1

cdc25

alpha-tubulin

Predicted protein

Gravin-like

Sprouty-2

calmodulin

KIAA10786568

frizzled

Wint8

Troponin T3

Gravin-like

we can only do this because of implied function based on orthology

Functional Orthologs

function known annotation lsquoGravinrsquo available

Human geneXenopus genefunction unknown

sequence similarityorthologs

same function But we know that function is largely determined by shape

similar shape

Which in general we cannot determine ndash but it is probably SHAPE not SEQUENCE that is conserved

We make an assumption that the same gene function is likely to be present in the two organisms and the ones that have this function are likely to be the most similar in sequence

Finding OrthologsSo how do we find orthologs and can we know when we have

The simplest is Reciprocal Best BLAST but it implicitly relies on having all the protein sequences of you own organism and the one you wish to find an ortholog in

frog proteindatabase of human proteins

best match human protein

database of frog proteins

x

Using Synteny is Better

We know that large regions of (say) vertebrate genomes have preserved their overall organisation from one organism to another

And we find the same genes (ie orthologs) in more or less the same order in the syntenic sectionsThese of course represent chromosomal re-arrangements since these organisms diverged

Human chromosome 5

Mouse chromosome 10

Mouse chromosome 2

MetazomeFortunately someone has done all the hard work for ushellip Dan Rokhsar httpwwwmetazomenet

Metazome Exercise

Go back to Entrez Gene and look for your favourite gene again

Pick probable ortholog vertebrate genes from common organisms (human mouse rat chicken frog fish) and paste their protein sequences into a temporary space

Go to Metazome (httpwwwmetazomenet) find the blast window open two versions of it and blast your sequences against the Tetrapod or Jawed vertebrate node

See if you get the same cluster ID as best top hit and have a look at the Metazome alignment(s)hellip

Part 3 Finding Sequence Similarities

We want computer programs which will compare sequences at all possible different alignments looking for a degree of similarity greater than we would expect to find by chance

But first we have to consider the implication of gapshellip

Insertions and deletions are other possible forms of mutations and they can really mess up our simple alignments

ATGCATGCTGCCAACGGATGTCCTG

ATGAAAGCCGCCTACGAAAGTCCTG||| | || ||| ||| | ||||||

ATGCATGCTGGCCAACGGATGTCCTG

ATGAAAGCCGCCTACGAAAGTCCTG||| | || ||| | | | |

Gaps in Alignments

Consider these two obviously similar sequences

TTCCCAACTCTCCTCTTTCACCATGAAGCTCAAGGACAGATTCCACTCGCCCCAAAATCAAGCTCACCCCGTCCAAGAA | || | || |||||||||||||||||||| ||||||||| ||| ||| | ||| | | |TTCCCACCTCTCCTCTTTGCACCATGAAGCTCAAGGACAAATTCCACTCCCCCAAAATCAAGCGCACCCCGTCCCAGAA

In fact we realise that the most probable alignment (regarding biological origin) is with a small gap in each sequence

TTCCCAACTCTCCTCTTT=CACCATGAAGCTCAAGGACAGATTCCACTCGCCCCAAAATCAAGCTCACCCCGTCCAAGAA |||||| ||||||||||| |||||||||||||||||||| ||||||||| |||||||||||||| |||||||||| ||||TTCCCACCTCTCCTCTTTGCACCATGAAGCTCAAGGACAAATTCCACTC=CCCCAAAATCAAGCGCACCCCGTCCCAGAA

So in general we allow ourselves to insert gaps until we find the optimal alignment

But where should this process stop

The Downside of GapsTake two random sequences with no lsquorealrsquo similarity

GACACTAGGTCGATGCGTGGTGGCGAGA

ACGCATCCGGATGTGCACCGTGGAACTG

And allow lsquocost freersquo gaps

GAC--ACT----AGGTCGATGC---GTGG---TGGCGAGA || | | | | | ||| |||| || ACGCA-TCCGGA--T-G-TGCACCGTGGAACTG

Clearly although the alignment has no mismatches it is obviously not biologically meaningful

To prevent this we assign a cost to adding gaps which is offset against the benefit of finding matches ndash and this is the essence of lsquofinding gapped alignmentsrsquo

We want to find the lsquoalignmentrsquo between the two (or more) sequences which shows the greatest degree of similarity while introducing the fewest gaps hellip

BLAST

gtqueryAGACGAACCTAGCACAAGCGCGTCTGGAAAGACCCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGTTCAGGAGTATTTGGACTGCAATATTGGCCCTCGTTCAAGGGCGCCTACCATCACCCGACGGTCATGCCGGTCCCCAGCAGCTGCTAATAACTTCCTTCGCTACTCAAGTTACCACGCTAGCAAAACCCACGGCATACCGTTTACCCTTTAAAATCAGCTTCAACCAGCAACGAA

There are many programs used to find similarities between sequencesThey range from relatively slow programs which find the exact best matching alignment through ones which take progressively inexact shortcuts to speed things up Of this latter class the best known and easily most widely used is BLAST developed by Stephen Altschul and others and continuously refined over the last 10-15 years

The essential idea is to compare your query sequence against a collection or lsquodatabasersquo of target sequences looking for the one(s) that match the query sequence the best

gttarget1AAAACAGGAATATTTACCGGGACCGGGTAATGATGCATCTCGAGGTACACAATATACCTG GAGAACCGAATTATGAGTTGGCCACCTTACTTAACGAAACCAGCAGAGAAAATCCAACAT GGCAACACCCCTCTGACTACACTAGAAGGAACTACTATGTAAGAAAACAGCCTGTCCCTT GCAGTTTGAATGACTGGGTGATGCGAAATGGGGGTCCTGCCATAGAGCGCTTCCATGGTT TACCTTGCACATTTCAGAGAAGTCCTATGCCAGGAGTCCTTCCTACAGGGCCTTCCTGAA ACTATATATGTGCTTATTCTTGTTTGATTTGGCTTTGCAGgttarget2CTCTTAATTTATTTCTCTTCCTGCAGCTCCCTCGCTTTTTCCTTTCCCTGTTACATTCAT CTGACTTGAAGAGTTGCAAATTTTCAGTGTTTCTGTTTTTGTTGCTGATATGTTGTAAAC TTTTTAATAAAATCTATTTCTATAG gttarget3GCAGTTTGAATGACTGGGTGATGCGAAATGGGGGTCCTGCCATAGAGCGCTTCCATGGTT TACCTTGCACATTTCAGAGAAGTCCTATGCCAGGAGTCCTTCCTACAGGGCCTTCCTGAA ACTATATATGTGCTTATTCTTGTTTGATTTGGCTTTGCAGCTAGGGTTTTCACCTTTTCT GGAAAAAAAAATACTGGCTTCC gttarget4CTGCTATTAATGGGCAAAACAACTCAAATAAAGTCCCTCTGCCACCCTCAGACACTGCCC CTGGCCCCCAGCTGCCCGCTGATCCTTGTAGCCAGAGCAGTAAAGTTTTGAAAGTGGAGC CCAAGGAGAATAAAGTTATTAAAGAAACTGGCTTTGAACAAGGTGAAAAGTCTTGTGCAG CACCTCTAGATCATACTGTGAAGGAAAATCTTGGACAAACTTCTAAAGAACAGGTGGTAG

query

database

COMPARE

LIST MATCHES

Flavours of BLAST

ACGATAGATCCCATCCATAAAT ATGACGATAGATCCCATCAT CGATAGGACCACCACA GATAGACCAGGATACATAGGATAATTA AGCTCGCTTGGCTCGATGGCT

MKJLSPWERSYTRGHYTWER MGHTVNBZY MKLPWRHGDBKJGMNDFD MBKLRPIUHDFRTASGSLKWWRTVBN

MKJLSPWERSYTRGHYTWER MGHTVNBZY MKLPWRHGDBKJGMNDFD MBKLRPIUHDFRTASGSLKWWRTVBN

ATGACGATAGATCCCATCAT CGATAGGACCACCACA GATAGACCAGGATACATAGGATAATTA AGCTCGCTTGGCTCGATGGCT

query sequence other operation database sequences

ATGACGATAGATCCCATCAT CGATAGGACCACCACA GATAGACCAGGATACATAGGATAATTA AGCTCGCTTGGCTCGATGGCT

BLASTn

BLASTp

BLASTx

tBLASTn

tBLASTx

ACGATAGATCCCATCCATAAAT

ACGATAGATCCCATCCATAAAT

MQWCGYRWTYQGYRW

MQWCGYRWTYQGYRW

FAST

FAST

SLOW

SLOWER

HORRIBLY

SLOW

6 fra

me

trans

latio

n

How does it work The main task of any sequence comparison program is to test all possible mutual alignments of two sequence and see how good the match is

CTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGT

CCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGTTC | | | | | ||||||||||||||||||||||||| CTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGTCTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGT || | | | | | | | | |||||||||||||||||||||||| | | | | | |

CCGAGCTTCTCATTGCTCTTCCTAACAGTG=TGATAGGCTAACCGTAATGGCGTTC||||||||||||||||||||||||| ||||||||||||||||||||||||

query

1st database sequence

This would actually be a very slow search process if implemented like thishellip

BLAST achieves its speed through two strategies

- it takes a WORD based approach- it pre-INDEXES database sequences

BLAST WORDS and INDEXING1 GACAAATCCAAACCCCTGAAGTTCTCCACCAGCAAAGCCA

2 TAAGCAAATTTAATTTTGTTTACATTTTC

3 GTTAAGACCTTCCCTGACATTTGCAGCAGTTTCAAATGTA

AAAAAAAA 00001 AAAAAAAC 00002 AAAAAAAG 00003 ACAAATCC 07967 ACAAATCC 07968 ACAAATCC 07979 GACAAATC 33568 GACAAATG 33569 TCCAAACC 64321 TCCAAACC 64322

sequence position word

1 1 33658

1 2 07967

1 3 16210

3 15 33568

3 16 07967

Database of sequences

Numbered list of all possible lsquowordsrsquo

Build a position index of all words in the database

Analyse the Query Sequence gtquery AGACAAATCCAAACCCCTGAAGTTCTCCACCAGCAAAGCCA

AAAAAAAA 00001 AAAAAAAC 00002 AAAAAAAG 00003 ACAAATCC 07967 ACAAATCC 07968 ACAAATCC 07979 GACAAATC 33568 GACAAATG 33569 TCCAAACC 64321 TCCAAACC 64322

QUERY SEQUENCE

Numbered list of all possible lsquowordsrsquo

position word

1 14236

2 33658

3 07967

Analyse QUERY SEQUENCE

sequence position word

1 1 33658

1 2 07967

1 3 16210

3 15 33568

3 16 07967

Index of database

Expand from Word Based Matches We lsquoinstantlyrsquo know which sequences in the database have at least a word length match with our query sequence and at what relative position

Next the potential alignments are expanded adding up a score for (total matches ndash mismatches ndash gap penalties) to make the best possible alignment But this is usually for a tiny proportion of the sequences in the database ndash so overall it is much quicker

The highest scoring alignments are reported

But we can potentially miss alignments with no word-size bits in common consider BLASTn with a default word-size of 11

TCGGAAGTGGAAGCTGAACCTGATTGTAGAGTTGGAGGCCAGTGTTCTGGCTGAGC||||||||| ||||| |||||||||| |||||||||| |||| ||||| ||||||| TCGGAAGTGTAAGCTCAACCTGATTGCAGAGTTGGAGTCCAGAGTTCTAGCTGAGC

Care is sometimes neededhellip

BLAST ndashTypical OutputINPUT

gtpartial cDNA sequence Xenopus tropicalisCGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGTTCCCACCTCTCCTCTTTCACCATGAAGCTCAAGGACAAATTCCACTCCCCCAAAATCAAGCGCACCCCGTCCAAGAAGGGGAAGCCGGCCGACCTCACCGTCAAAACAGAAGAGAAACCCGTCAACAAAACCTTAAGCCGCTTGGAGGAACAGGAGAAAGAAGTCGTTAATGCCTTGCGTTACTTTAAGACAATTGTTGACAAGATGGCGGTGGACAAGATGGTGCTGGTGATGCTGCCAGGGTCGGCGA

OUTPUTQuery= (311 letters) Database NCBI Protein Reference Sequences 954378 sequences 347895532 total letters

gtgi|41055060|ref|NP_9574201| similar to guanine nucleotide-releasing factor 2 (specific for crk proto-oncogene) [Danio rerio]

Length=691

Score = 133 bits (335)Expect = 6e-31 Identities = 7698 (77) Positives = 8298 (83) Gaps = 498 (4) Frame = +2

Query 26 MSGKIE-KADSQRSHLSSFTMKLKDKFHSPKIKRTPSKKGKPA--DLTVKTEEKPVNKTL 196 MSGKIE K +SQ+SHLSSFTMKL KFHSPKIKRTPSKKGK + VKT EKPVNK + Sbjct 1 MSGKIESKHESQKSHLSSFTMKLM-KFHSPKIKRTPSKKGKQLQPEPAVKTPEKPVNKKV 59

Query 197 SRLEEQEKEVVNALRYFKTIVDKMAVDKMVLVMLPGSA 310 SRLEEQEK+VV+ALRYFKTIVDKM VD VL MLPGSA Sbjct 60 SRLEEQEKDVVSALRYFKTIVDKMNVDTKVLQMLPGSA 97

When is a match significant

RFKISDCQHPCTYSHNQYMTNHMRECPYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV

NFSWKKTSEKETNCQFDYPNDYNEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFNMCWLEVNSS

RF---KISDCQHPCTYSH-NQYMTNHMREC----PYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV F K S+ + C + + N Y N +C P+ + +W +P + D I N M ++ NFSWKKTSEKETNCQFDYPNDY--NEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFN------MCWLEVNSS

Here is a lsquotypicalrsquo weak alignment from BLASTp

In fact the sequences were randomly generated so there is no biologically significant alignmenthellip

E-values

The number of matches like the discovered match that I would expect to find by chance

An E-value of 00 implies that I would expect no matches like this to arise by chance thereforehellip

An E-value of 1 implies I would expect 1 match like this to arise by chance so if I have a match with such an E-valuehellip

Also ldquoexpect valueldquo or ldquoexpectationrdquo

E-values From First Principles

Some database statistics (23rd July 2005)

Database NCBI RefSeq mRNA 272619 sequences 503566580 total letters (~50 x 108)

Database NCBI nr 3329110 sequences 14601814750 total letters (~14 x 1010)

Notation

12e-35 = 12 x 10-35

48 x 106 = 4800000

We will consider first searching a nucleotide sequence (lsquoACGTAGACGTrsquo) against a nucleotide database eg the RefSeq mRNA above

Then we will consider the more complex case of amino acid sequence (protein) searches Which is of course what we mostly do

Calculating an E-valueThe RefSeq mRNA database has ~ 50 x 108 letters There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA

Query = lsquoArsquo

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACrsquo

AC AC AC AC

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACGrsquo

ACG

Expected number of matches = (50 x 108) 4 = ~12 x 108

Expected number of matches = (50 x 108) (4x 4) = ~31 x 107

Expected number of matches = (50 x 108) (4 x 4 x 4) = ~81 x 106

Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer

Expected number of matches = (50 x 108) (4 x 4 x 4 x 4 hellip 60 times ) = (50 x 108) 1036 = 50 x 10-28

E-value = 50 x 10-28

E-values In PracticeSo if I take a 60 nt sequencegtsequenceACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA and actually BLAST it against the RefSeq mRNA database I get

BLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds

Length=6060 Score = 119 bits (60) Expect = 2e-26 Identities = 6060 (100) Gaps = 060 (0) Strand=PlusPlus

Query 1 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 60 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 2977 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 3036

What do I get if I BLAST it against the larger nr databaseBLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds

Length=6060 Score = 119 bits (60) Expect = 6e-25 Identities = 6060 (100) Gaps = 060 (0) Strand=PlusPlus

Query 1 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 60 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 2977 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 3036

theoretical value was 50e-28 -

E-value Exercise

Given a transcription factor binding site

ACC[TG]TA

How many would you expect to find by chance in a 10k promoter sequence

How would this differ if there was an optional additional base between the 4th and 5th positionsIeACC[TG]TAOR ACC[TG]TA

E-value Exercise AnswerACC[TG]TA

Expect lsquoArsquo every 4 ntExpect lsquoACCrsquo every 4x4x4 = 64 nt

Expect lsquoT or Grsquo every 2nd ntExpect lsquoACC[TG]rsquo every 64x2 nt = 128 nt

Expect lsquoTArsquo every 4x4 = 16 ntExpect lsquoACC[TG]TArsquo every 128x16 nt = 2048 nt (4x4x4x2x4x4)We would expect ~5 of these promoter sites every 10k by chance

If also ACC[TG]TAA allowed

The two motifs independently have the same E-valueTo allow either means we expect twice as many

E-values Effect of Database SizeThe nr mRNA database has ~ 14 x 1010 letters (was RefSeq and 50 x108)There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA

Query = lsquoArsquo

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACrsquo

AC AC AC AC

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACGrsquo

ACG

Expected number of matches = (14 x 1010) 4 = ~12 x 108

Expected number of matches = (14 x 1010) (4x 4) = ~31 x 107

Expected number of matches = (14 x 1010) (4 x 4 x 4) = ~81 x 106

Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer

Expected number of matches = (14 x 1010) (4 x 4 x 4 x 4 hellip 60 times ) = (14 x 1010) 1036 = 14 x 10-26

E-value = 14 x 10-26

(was E-value = 50 x 10-28)

E-values Effect of Database Size

The E-value is simply dependent on database size

RefSeq

nr

14 x 1010 letters

50 x108 letters

30 x bigger

BLAST the same sequenceagainst each E-value = 14e-26

E-value = 50e-28

The database was ~30 times bigger and so the E-value was ~30 times bigger

Why were the values differentOur calculated E-value for searching against the RefSeq mRNA database was 50 x 10-28But our actual BLAST search at NCBI gave a value of

20 x 10-26 - about 40x larger - why is this

Gapped alignments

If we were expecting N matches for a query sequence lsquoACGTACGTACGTrsquo imagine what would happen to N if we allowed gaps in our matches

ACGTACGTACGT

This would now give us additional possible alignments that would meet our lsquomatchrsquo criteria

ACGTACGTACGT ACGTACAGTACGT ACGTACCGTACGT etc|||||||||||| |||||| |||||| |||||| ||||||ACGTACGTACGT ACGTAC-GTACGT ACGTAC-GTACGT

We will expect many more matches in a given database if we allow our alignments to have gaps The E-value will be larger

E-values Effect of Query Length

Biologically itrsquos the same match Does it mean we are any less sure that this match didnrsquot occur by chance The E-value is simply dependent on match length

database

BLAST 500 nt sequence against a database

BLASTn Get a full length match with sequence XYZ at an E-value = 50e-160

gtsequence ACTAGTCTAGCTAGACATCGATCGATGATGCTACACAGATAGACGATAGATAGTAAGTCGATCGATCGCGCATCGATCGTCTAGATCGATCGCTCGCTGTGTAGATAGATCGGCGATAGA

database

BLAST half of the same sequence against the same database

BLASTn

gtsequence ACTAGTCTAGCTAGACATCGATCGATGATGCTACACAGATAGACGATAGATAGTAAGTCG

Get a match with sequence XYZ again but at an E-value = 50e-80

Why not just use identityAt some levels this a good question

But consider two very different searches both of which give a 75 identity match

Query1 was 60 nt longCGGAGCTCAGGGCTTAACGACTGATATCTCCGCGCATGTCGAGAAACGATACAGCCAGCG||||||||||| || | || | || || |||| | | | |||||| | ||||||||||CGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGWhich would have an E-value ~ 50 x 10-19

And Query2 only 16 nt longACGTACGTACGTACGT||| || | |||| ||ACGCACCTTCGTAGGTWhich would have an E-value ~ 30

And intuitively we feel we would expect to see that sort of number of matches in the database just by chancehellip

So whatrsquos the real problemBasically you are usually trying to answer the question

Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism

Are there any useful guidelines though at least for biological meaningfulness

Basically you are usually trying to answer the question

Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism

BLAST

The difficulty is because

ORTHOLOGY

BLAST Similarity + Probability

biological knowledge

nature of query sequence

phylogenetic relationship

match length PI size of databasehellip

Rules of ThumbHow good does an E-value have to be before we might even think we have an ortholog

largerworse smallerbetter

E-values 10-5 10-10 10-40 10-100 00

fantasy borderline encouraging

pretty good canrsquot get

better

But note that in some gene families with closely related members you can get an E-value of 00 for several different matches and then identity may be more sensitive Also bear in mind in cases like this that ideas of lsquofunctionalrsquo orthology may break down with more than one locus producing identical proteins which share the same functionhellip

Protein BLASTItrsquos (nearly) always better to make comparisons at the amino acid level between protein sequences than the DNA level because the amino acid sequence is more conserved than the underlying DNA sequenceDoes this cause us to treat expected values any differently

If we follow the argument as before then for an exact match of a 20 amino acid sequence in the RefSeq protein database each additional amino acid will reduce the E-value by 120th (there are 20 different amino acids) And as there are 347895532 letters in that databaseE-value = ~35 x 108 (20 x 20 x 20 hellip20 times) = ~35 x 10-18

But this is what we get if we run the blast at NCBI

Score = 431 bits (100) Expect = 8e-04 Identities = 2020 (100) Positives = 2020 (100) Gaps = 020 (0) Frame = +3

Query 3 SSSSFRAYRAALSEVEPPCI 62 SSSSFRAYRAALSEVEPPCISbjct 972 SSSSFRAYRAALSEVEPPCI 991

Really too big a discrepancy to easily explain with hand wavinghellip

Amino Acid Substitutions

A SC F LWYG I LMVL IMFVM ILVP V ILMW FY

N DHSQ REHKS ANTT SY HFW

H NQYK RQER QK

D NEE DQK

In fact we need to take into account both amino acid substitutability as well as as before allowing gapped alignments On average any residue can be substituted for by about 2 others so each position has about 17th chance of lsquomatchingrsquo rather than 120th

So now we getE-value = ~35 x 108 (7 x 7 x 7 hellip20 times) = ~44 x 10-9which is much closer to the actual BLAST value

These substitutabilities are dealt with by the BLOSUM and PAM matrices

ExercisesGo to the file random-DNA-sequenceshtml select one of the 20 randomly generated nucleotide sequences and do a BLASTx (translated DNA-gtprotein) at NCBI against the nr protein database

Did you find any lsquosignificantrsquo hits

Repeat with a second sequence

What conclusions might you draw from this exercise

Try the same sequence(s) against the nr nucleotide database

Is there any general difference

Part 4 Tweaking BLASTAlthough you normally see BLAST as a web page with boxes to place data in and tick boxes etc it is actually a command line program that can be run just by typing the appropriate command and options eg

promptgt blastall ndashp ltblast typegt ndashi ltinput sequencegt ndashd ltdatabasegt

This is the simplest form where the basic program lsquoblastallrsquo takes a number of different options or parameters indicated by the ndashx and followed by its value -p ltwhich blast flavour to rungt-i ltfile with query sequence ingt-d ltpre-indexed database namegt

Not All Parameters are hellipThere are many other parameters and if not listed explicitly they will use a default value most appropriate to the blast flavour requested Eg for ndashW ltword sizegt blastn uses ndashW 11 where blastx uses ndashW 3

There are also some options that appear on the web pages that are not really parameters but manage the job in a similar way One of the most useful of these is on the NCBI blast pages where you can use Entrez queries or pick from an organism list to modify your search

The Many Parameters of BLASTThere are almost literally hundreds of parameters but most are way too obscure even for die-hard techies like me Very few of them are regularly useful in any but their default value but just occasionally they are very necessary

Here are some of the ones that I have used

-e max expected value -m output format (graphical or tabularspreadsheet)-F filter query sequence for low complexity (default TRUE)-U use only upper case regions of query (default FALSE)-G gap opening cost-E gap extension cost-q nucleotide mismatch penalty (BLASTx uses matrices)-r nucleotide match reward-b number of matching sequences to report-g allow gaps (default TRUE)-W word size-z effective database size (removes effect of actual database size)-S query strands to search (default both directions)-l restrict database sequences to given list of lsquogilsquo numbers

BLAST Parameters Exercises1 BLASTn vs BLASTx

Open the file example-sequenceshtml copy the sequence gtblastn-vs-blastxThis is a Xenopus tropicalis cDNA sequence

Go to the NCBI BLAST Home PageNucleotide-nucleotide BLAST (blastn) section Paste your sequence into the box

Run BLASTn against the nr nucleotide database using all default optionsThen hit [format] to wait for the results in a new page(hint if you paste the sequence definition line lsquogtnamersquo into the box as well your results will be labelled accordingly which can be useful)

Now repeat but go to the TRANSLATED BLAST section and BLAST against the nr protein database using BLASTx

How might the different results help us view the presence of this gene in other vertebrates

Results for Exercise 1 BLASTn

BLASTx

BLAST Parameters Exercises2 Low complexity filtering

Open the file example-sequenceshtml copy the sequence gtlow-complexity-filtering-A This sequence contains a long AT tandem repeat

Go to the NCBI BLAST Home PageTRANSLATED BLAST sectionBLASTx Paste your sequence into the box

Carefully UNTICK the ldquoChoose filter [ ] Low complexityrdquo BOX in the second section And then run BLASTx against the nr database

What do you feel about these alignmentsRe-run but leave the low-complexity filter ON this timeDoes this change our view of the protein matches

Now continue with gtlow-complexity-filtering-B and ndashCC is an especially interesting case ndash what can we deduce about the cDNA sequence Annotators beware

Results for Exercise 2A (OFF) BLASTn ndash low complexity filtering OFF

Results for Exercise 2A (ON) BLASTn ndash low complexity filtering ON

Results for Exercise 2B

ON OFF

Results for Exercise 2C

There is a sequence error an extra G at position 117 in the sequence cDNA (117)AGAAAAGAAGAAACATGGCAATGGATCAGAA|||||||||||||||| ||||||||||||||AGAAAAGAAGAAACAT-GCAATGGATCAGAA

Genomic sequence

ON

OFF

BLAST Parameters Exercises3 Limit by Entrez query

Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching sequences in fruit fly enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT

Open the file example-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home Page TRANSLATED BLAST sectionBLASTx and paste the sequence

Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip

BLAST Parameters Exercises4 BLASTn vs tBLASTx and nucleotide mismatch penalties

Open the file example-sequenceshtml

Also open the NCBI BLAST Home PageSPECIAL ndash Align two sequences section

There are several Xenopus tropicalis cyclins in the examples fileCopy the sequence gtcyclin-A1-Xt to the Sequence 1 BLAST windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 BLAST window(i) Run the default comparison should be BLASTn Note the alignmentNow run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these

(ii) Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1 Then try reducing the gap open and gap extension penaltieshellipWhat do we learn from this

(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip

Results for Exercise 4 (i)

BLASTn tBLASTx

Results for Exercise 4 (ii)

Mismatch penalty = -2 (default) Mismatch penalty = -1

BLAST Parameters Exercises5 E-Value maximum for reporting

Open the file example-sequenceshtml

Copy the sequence gtsumo-binding-motif and go to the NCBI BLAST Home PageGo to the PROTEIN BLAST section BLASTp and paste the sequence

Run the search with the default values

Now re-run the search setting the maximum E-value in the box

Expect 100

What difference does this make

BLAST Parameters Exercises6 Word Size

Open the file example-sequenceshtml

Copy the sequence gtmorpholino and go to the NCBI BLAST Home PageGo to the NUCLEOTIDE BLAST section BLASTn and paste the sequence

Check OFF the low complexity filter and then run the search

Now re-run the search setting the following parameters

Low complexity OFFExpect 100Word Size 7Other advanced -q-1 (mismatch penalty -1 instead of default -3)

What difference does this make

END

  • Bioinformatics Workshop 1 Sequences and Similarity Searches
  • The Basic Questions
  • Part 1 Structural Genomics
  • Chromosomes and Genes
  • Gene to Protein
  • Sequence Signals
  • Genomic Signals
  • Derivative Sequences
  • Gene Models
  • Sequences and Genes (Accession Numbers and Names)
  • Gene Symbols Names Etc
  • A Gene-Centric View
  • Sequences and Accession Numbers
  • mRNA Splicing Signals
  • Gene Predictions
  • Supporting Evidence
  • TheoreticalPredicted Sequences
  • Sequences for a model organism
  • So Whatrsquos in the Databases Now
  • Part 2 Comparative Genomics
  • Speciation
  • Residual Similarity
  • Computers Can Detect Homology
  • Orthologs
  • Paralogs
  • lsquoOtherrsquo-logs
  • The Essential Paradigm
  • Function Conserved Longer than Detectable Similarity
  • Redundancy in the Genetic Code
  • Protein Similarity Persists Longer
  • Always Compare Protein Sequences
  • Exercise 1 nucleotide vs amino acid search
  • Answers Exercise 1
  • The Essential Task
  • Functional Orthologs
  • Finding Orthologs
  • Using Synteny is Better
  • Metazome
  • Metazome Exercise
  • Part 3 Finding Sequence Similarities
  • Gaps in Alignments
  • The Downside of Gaps
  • BLAST
  • Flavours of BLAST
  • How does it work
  • BLAST WORDS and INDEXING
  • Analyse the Query Sequence
  • Expand from Word Based Matches
  • BLAST ndashTypical Output
  • When is a match significant
  • E-values
  • E-values From First Principles
  • Calculating an E-value
  • E-values In Practice
  • E-value Exercise
  • E-value Exercise Answer
  • E-values Effect of Database Size
  • Slide 58
  • Why were the values different
  • E-values Effect of Query Length
  • Why not just use identity
  • So whatrsquos the real problem
  • Rules of Thumb
  • Protein BLAST
  • Amino Acid Substitutions
  • Exercises
  • Part 4 Tweaking BLAST
  • Not All Parameters are hellip
  • The Many Parameters of BLAST
  • Slide 70
  • BLAST Parameters Exercises
  • Results for Exercise 1
  • Slide 73
  • Results for Exercise 2A (OFF)
  • Results for Exercise 2A (ON)
  • Results for Exercise 2B
  • Results for Exercise 2C
  • Slide 78
  • Slide 79
  • Results for Exercise 4 (i)
  • Results for Exercise 4 (ii)
  • Slide 82
  • Slide 83
  • END
Page 8: Bioinformatics Workshop 1 Sequences and Similarity Searches

Derivative Sequences

mRNA

capture by cloning into cDNA library

3rsquo EST

5rsquo EST

cDNA sequence

EST single pass sequence from each end of the clone

cDNA multiple pass sequencing over whole length of the clone

5rsquo 3rsquo

Gene Models

gene modelexons

Sequences and Genes(Accession Numbers and Names)

AAB229701

AAP212451

CAA415451

NP_1877592

proteins

S431051mRNAscDNAs lsquosimilar to Cyclin B1 [mus musculus]rsquo

gene

BT0064371 lsquoCyclin B1 isoform 1 [mus musculus]rsquo

X587081

NM_1119853 lsquoCCNB1 Cyclin B1 [mus musculus]rsquo

lsquoCyclin B1 isoform 2 [mus musculus]rsquo

Gene Symbols Names Etc

Gene Symbol CCNB1

Gene Name cyclin B1 [Homo sapiens]

Description G2mitotic-specific cyclin B1

Aliases CCNB CYCB1

A Gene-Centric View

Entrez Genehttpwwwncbinlmnihgov

Cyclin B1

S431051

BT0064371

X587081

NM_1119853

AAB229701

AAP212451

CAA415451

NP_1877592

Exercise 1

Go to Entrez Gene and look for your favourite gene or genes

genomic location

expression data

Sequences and Accession Numbers

NM_0010159221 gi=62860271

GATCGTTCGATTAGCTAGGGACACCACCGATCGATATGACCACAAAAA

BC0096381 gi=16307106

GTTCGATTAGCTAGGGACACCACCGATCGATATGACCACAAAA

NM_0010159222 gi=62860589

GACCGTTCGATTAGCTAGGGACACCACCGATCGATATGACCACAAA

NP_0010159221 protein translated from mRNA

XM_0011025671 predicted mRNA

XP_0010897651 predicted protein translated from predicted mRNA

mRNA Splicing Signals

gene model

genome

CTACCATCCATGCTAACCATTCTACCATTTTATACTCATGCAACGGACCGTAGCGTAGTCGCTTAGCATCCTTTATAACTGGCTA

CTACCATCCATGCTAACCATTCTAC CATTTTATACTCATGCAACGGACCGT AGCGTAGTCGCTTAGCATCCTTTATAACTGGCTA

CTACCATCCATGCTAACCATTCTACGTAAGTCATCTATATCAATATTATTTCAGCATTTTATACTCATGCAACGGACCGTGTCAGTATTACAGAGCGTAGTCGCTTAGCATCCTTTATAACTGGCTA

GTAAGdonor

TTTCAG acceptor

mRNA

exon intron exon intron exon

splice sites

Gene PredictionsGiven- coding sequence must run from ATG ndash STOP codon in-frame- introns GT AG can be spliced out

Also take a statistical approach- coding and non-coding sequence are slightly different in composition- some lsquopossiblersquo splice sites are more likely than others

CGTCGTATGGCTTCGATGTAGTACATCGGATCGGTATGGAATCATTTCAGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA

CGTCGTATGGCTTCGATGTAGTACATCGGATCGGTATGGAATCATTTCAGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA CGTCGTATGGCTTCGATGTAGTACATCGGATCGGTATGGAATCATTTCAGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA

CGTCGTATGGCTTCGATGTAGTACATCGGATCGGTATGGAATCATTTCAGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA

CGTCGTATGGCTTCGATTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA

CGTCGTATGGCTTCGATGTAGTACATCGGATCGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA CGTCGTATGGCTTCGATGTAGTACATCGGATCGGTATGGAATCATTTCAGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA

CGTCGTATGGCTTCGATTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA

CGTCGTATGGCTTCGATGTAGTACATCGGATCGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA

scan genomic sequence hellip

CGTCGTATGGCTTCGATGTAGTACATCGGATCGGTATGGAATCATTTCAGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA

most likely gene model

Supporting Evidence

EST evidence

genome

gene model

We note that in the absence of EST evidence it is only really possible to predict coding sequence with any confidence (and even thenhellip)

So predicted genes based on computational gene models alone will usually lack UTR regions which has some important consequences

exons 1 2 3 4

TheoreticalPredicted Sequences

genome

predicted gene modelexons 1 2 3 4

Wersquove now reversed the process of working out exon structure from aligning cDNA sequences against the genome sequence but we shouldnrsquot lose sight of the fact that we donrsquot really know if these predicted proteins exists ndash especially where supporting EST evidence is weak or non-existent

predicted transcript

predicted protein

Sequences for a model organism

ESTs ndash millions pound10 eachCheap to sequence ndash so we get millions per organismBut lots of errorsAnd incomplete gene sequencesCan give us relative expression levels

cDNAs ndash tens of thousands pound1000 eachExpensive ndash but only need to do one (or a small number) per geneFew errors with multipass sequencingGives us protein sequences

Genomes ndash one pound30000000Extremely expensiveBut the only way to get the whole pictureGives us gene regulation

So Whatrsquos in the Databases Now

15000000ESTs

3300000cDNAs

NCBI July 2005

2700000proteins

950000proteins

nrRefSeq

DNA

Proteins

Part 2 Comparative Genomics

ATGAAGGCTGCCTACGACTGCCGTGATGCAGGCTGCCTACGACTGCCGTGATGCAGGCTGCCAACGACTGCCGTGATGCATGCTGCCAACGACTGCCGTGATGCATGCTGCCAACGACTGCCCTGATGCATGCTGCCAACGGCTGCCCTGATGCATGCTGCCAACGGATGCCCTGATGCATGCCGCCAACGGATGCCCTGATGCATGCCGCCAACGGATGTCCTG

Imagine one mutation gets fixed every 100000 years in this gene sequencehellip

Gene sequence

Evolution by sequence mutation

Speciation

ATGAAGGCTGCCTACGACTGCCGTG

ATGCAGGCTGCCTACGACTGCCGTGATGCAGGCTGCCAACGACTGCCGTGATGCATGCTGCCAACGACTGCCGTGATGCATGCTGCCAACGACTGCCCTGATGCATGCTGCCAACGGCTGCCCTGATGCATGCTGCCAACGGATGCCCTG

Gene AATGAAGGCTGCCTACGACTGCCGTG

ATGAAGGCCGCCTACGACTGCCGTGATGAAGGCCGCCAACGACTGTCGTGATGAAAGCCGCCAACGACTGTCGTGATGAAAGCCGCCAACGACAGTCGTGATGAAAGCCGCCTACGACAGTCGTGATGAAAGCCGCCTACGACAGTCCTG

ATGCATGCTGCCAACGGATGCCCTG

ATGAAAGCCGCCTACGACAGTCCTG||| | || ||| ||| | ||||

If the genetic difference means they can no longer interbreed with fertile offspring ndash then we have a new specieshellip

Residual Similarity

ATGCATGCTGCCAACGGATGCCCTG

ATGAAAGCCGCCTACGACAGTCCTG||| | || ||| ||| | ||||

ATGCATGCTGCCAACGGATGCCCTG

ATGGAAGGCGCTTAGGATAGTCCAG||| | | || | | | || |

After longer periods of evolution homology may no longer be detectable in the DNA sequencehellip

We can still easily detect residual similarity between these sequences this is what we call homology ndash detectable similarity because of common evolutionary origin

Computers Can Detect Homology

In fact computers are very good at this task ndash the two primary challenges are

(a) performing the search fast enough to look through millions of sequence in a timescale compatible with a lab scientistrsquos attention span

(b) at low levels of similarity being able to distinguish between biologically related sequences and chance matcheshellip

ATGCATGCTGCCAACGGATGCCCTG

ATGAAAGCCGCCTACGACAGTCCTG||| | || ||| ||| | ||||

GCTGACTCGTAGCGCTTAGCTAGCT

CCAACATCTAGCCAGATTAGTTAGT | || | | | |

Orthologs

A A

A Gene duplication though speciation The two copies of Gene

A will now evolve independently but will continue to have the ~same function

They are ORTHOLOGS

Paralogs

A

Gene duplication though internal genome duplication

The two copies of Gene A will now evolve independently but will probably not continue to have exactly the same function

They are PARALOGS

A

A Arsquo

A

lsquoOtherrsquo-logsWhat about gene duplication after speciation

How can we describe the relationship(s) between the various copies of gene A in the two frogs

Bear in mind that understanding gene function is more important than semanticshellip

The two copies of A in the orange frog are sometimes called IN-PARALOGS

If they were also present in the green frog (and therefore were in the ancestor species) they would be OUT-PARALOGS

A

A

A

Arsquo A

The Essential Paradigm

1 any group of modern species can be traced back to some extinct common ancestor

A

A

2 in all likelihood they share orthologous genes which have the same function in the modern animal as in the extinct ancestor

3 If we can experimentally determine the function of a gene in one of these organisms then there is a good chance the ORTHOLOGOUS gene in another organism will have the same function

A A

cyclin b1

cyclin b1

Function Conserved Longer than Detectable Similarity

start from first self-replicating sequence

same function detectable similarity

living organisms

whole genome duplication local duplication

Redundancy in the Genetic Code

GCA A alanine GCC A GCG A GCT A

TGC C cystine TGT C

GAC D aspartate GAT D

GGA G glycine GGC G GGG G GGT G

lsquoSynonymousrsquo or lsquosilentrsquo mutations in the third position of the codon triplets have no effect on the amino acid coded for ndash so there is no evolutionary pressure against thishellip

Protein Similarity Persists Longer

CTATCACGAGAACCTGTGCTATCCCGAGAACCTGTGCTATCCCGAGAACCAGTGCTATCCCGTGAACCAGTGCTATCCCGTGAGCCAGTGCTATCCCGTGAGCCAGTTCTGTCCCGTGAGCCAGTT

CTATCACGAGAACCTGTG

CTGTCCCGTGAGCCAGTT|| || || || || ||

LSREPV

LSREPV||||||

CTATCACGAGAACCTGTG

TTGTCCCGGTCGCCAGTT | || | || ||

LSREPV

LSRFPV||| ||

67 100

44 80

Always Compare Protein Sequences

ATGAATGCAGCCTATGATTGCCGAGCCAGAATGCTAAGG MNAAYDCRARMLR ||||| || || || || || || || ||||| || || | ||||||||+||ATGAAGGCCGCATACGACTGTCGTGCTAGAATCCTGAGA MKAAYDCRARILR

DNA comparison amino acid comparison

The DNA sequence can change while the amino acid sequence stays the same so always look for similarities by comparing amino acid sequences

Exercise 1nucleotide vs amino acid search

Go to the file example-sequenceshtml and locate the section for this exercise There should be two sequences lsquosurfeit1rsquo for frog and fly

Go to NCBI Blast home page then lsquoAlign two sequencesrsquo (bottom left lsquospecialrsquo panel) paste one sequence into each window and hit lsquoAlignrsquo ndash this will do a direct DNADNA comparison

Now find the open reading frames of the two genes and translate them into amino acid protein sequences then repeat the two sequences comparison

Go to NCBI ORF Finder ndash paste sequence ndash hit OrfFind ndash identify longest ORF ndash click on it ndash next screen hit Accept ndash change View to Fasta protein ndash hit View ndash copy sequence to Blast2Seqs Do the same with the other sequence

Before you hit lsquoAlignrsquo change the lsquoProgramrsquo (top left) to blastphellip

Answers Exercise 1

The Essential Taskexperiment data mining

gene sequence what is its function

database of proteins in other species

Cyclin-AFoxA1

cdc25

alpha-tubulin

Predicted protein

Gravin-like

Sprouty-2

calmodulin

KIAA10786568

frizzled

Wint8

Troponin T3

Gravin-like

we can only do this because of implied function based on orthology

Functional Orthologs

function known annotation lsquoGravinrsquo available

Human geneXenopus genefunction unknown

sequence similarityorthologs

same function But we know that function is largely determined by shape

similar shape

Which in general we cannot determine ndash but it is probably SHAPE not SEQUENCE that is conserved

We make an assumption that the same gene function is likely to be present in the two organisms and the ones that have this function are likely to be the most similar in sequence

Finding OrthologsSo how do we find orthologs and can we know when we have

The simplest is Reciprocal Best BLAST but it implicitly relies on having all the protein sequences of you own organism and the one you wish to find an ortholog in

frog proteindatabase of human proteins

best match human protein

database of frog proteins

x

Using Synteny is Better

We know that large regions of (say) vertebrate genomes have preserved their overall organisation from one organism to another

And we find the same genes (ie orthologs) in more or less the same order in the syntenic sectionsThese of course represent chromosomal re-arrangements since these organisms diverged

Human chromosome 5

Mouse chromosome 10

Mouse chromosome 2

MetazomeFortunately someone has done all the hard work for ushellip Dan Rokhsar httpwwwmetazomenet

Metazome Exercise

Go back to Entrez Gene and look for your favourite gene again

Pick probable ortholog vertebrate genes from common organisms (human mouse rat chicken frog fish) and paste their protein sequences into a temporary space

Go to Metazome (httpwwwmetazomenet) find the blast window open two versions of it and blast your sequences against the Tetrapod or Jawed vertebrate node

See if you get the same cluster ID as best top hit and have a look at the Metazome alignment(s)hellip

Part 3 Finding Sequence Similarities

We want computer programs which will compare sequences at all possible different alignments looking for a degree of similarity greater than we would expect to find by chance

But first we have to consider the implication of gapshellip

Insertions and deletions are other possible forms of mutations and they can really mess up our simple alignments

ATGCATGCTGCCAACGGATGTCCTG

ATGAAAGCCGCCTACGAAAGTCCTG||| | || ||| ||| | ||||||

ATGCATGCTGGCCAACGGATGTCCTG

ATGAAAGCCGCCTACGAAAGTCCTG||| | || ||| | | | |

Gaps in Alignments

Consider these two obviously similar sequences

TTCCCAACTCTCCTCTTTCACCATGAAGCTCAAGGACAGATTCCACTCGCCCCAAAATCAAGCTCACCCCGTCCAAGAA | || | || |||||||||||||||||||| ||||||||| ||| ||| | ||| | | |TTCCCACCTCTCCTCTTTGCACCATGAAGCTCAAGGACAAATTCCACTCCCCCAAAATCAAGCGCACCCCGTCCCAGAA

In fact we realise that the most probable alignment (regarding biological origin) is with a small gap in each sequence

TTCCCAACTCTCCTCTTT=CACCATGAAGCTCAAGGACAGATTCCACTCGCCCCAAAATCAAGCTCACCCCGTCCAAGAA |||||| ||||||||||| |||||||||||||||||||| ||||||||| |||||||||||||| |||||||||| ||||TTCCCACCTCTCCTCTTTGCACCATGAAGCTCAAGGACAAATTCCACTC=CCCCAAAATCAAGCGCACCCCGTCCCAGAA

So in general we allow ourselves to insert gaps until we find the optimal alignment

But where should this process stop

The Downside of GapsTake two random sequences with no lsquorealrsquo similarity

GACACTAGGTCGATGCGTGGTGGCGAGA

ACGCATCCGGATGTGCACCGTGGAACTG

And allow lsquocost freersquo gaps

GAC--ACT----AGGTCGATGC---GTGG---TGGCGAGA || | | | | | ||| |||| || ACGCA-TCCGGA--T-G-TGCACCGTGGAACTG

Clearly although the alignment has no mismatches it is obviously not biologically meaningful

To prevent this we assign a cost to adding gaps which is offset against the benefit of finding matches ndash and this is the essence of lsquofinding gapped alignmentsrsquo

We want to find the lsquoalignmentrsquo between the two (or more) sequences which shows the greatest degree of similarity while introducing the fewest gaps hellip

BLAST

gtqueryAGACGAACCTAGCACAAGCGCGTCTGGAAAGACCCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGTTCAGGAGTATTTGGACTGCAATATTGGCCCTCGTTCAAGGGCGCCTACCATCACCCGACGGTCATGCCGGTCCCCAGCAGCTGCTAATAACTTCCTTCGCTACTCAAGTTACCACGCTAGCAAAACCCACGGCATACCGTTTACCCTTTAAAATCAGCTTCAACCAGCAACGAA

There are many programs used to find similarities between sequencesThey range from relatively slow programs which find the exact best matching alignment through ones which take progressively inexact shortcuts to speed things up Of this latter class the best known and easily most widely used is BLAST developed by Stephen Altschul and others and continuously refined over the last 10-15 years

The essential idea is to compare your query sequence against a collection or lsquodatabasersquo of target sequences looking for the one(s) that match the query sequence the best

gttarget1AAAACAGGAATATTTACCGGGACCGGGTAATGATGCATCTCGAGGTACACAATATACCTG GAGAACCGAATTATGAGTTGGCCACCTTACTTAACGAAACCAGCAGAGAAAATCCAACAT GGCAACACCCCTCTGACTACACTAGAAGGAACTACTATGTAAGAAAACAGCCTGTCCCTT GCAGTTTGAATGACTGGGTGATGCGAAATGGGGGTCCTGCCATAGAGCGCTTCCATGGTT TACCTTGCACATTTCAGAGAAGTCCTATGCCAGGAGTCCTTCCTACAGGGCCTTCCTGAA ACTATATATGTGCTTATTCTTGTTTGATTTGGCTTTGCAGgttarget2CTCTTAATTTATTTCTCTTCCTGCAGCTCCCTCGCTTTTTCCTTTCCCTGTTACATTCAT CTGACTTGAAGAGTTGCAAATTTTCAGTGTTTCTGTTTTTGTTGCTGATATGTTGTAAAC TTTTTAATAAAATCTATTTCTATAG gttarget3GCAGTTTGAATGACTGGGTGATGCGAAATGGGGGTCCTGCCATAGAGCGCTTCCATGGTT TACCTTGCACATTTCAGAGAAGTCCTATGCCAGGAGTCCTTCCTACAGGGCCTTCCTGAA ACTATATATGTGCTTATTCTTGTTTGATTTGGCTTTGCAGCTAGGGTTTTCACCTTTTCT GGAAAAAAAAATACTGGCTTCC gttarget4CTGCTATTAATGGGCAAAACAACTCAAATAAAGTCCCTCTGCCACCCTCAGACACTGCCC CTGGCCCCCAGCTGCCCGCTGATCCTTGTAGCCAGAGCAGTAAAGTTTTGAAAGTGGAGC CCAAGGAGAATAAAGTTATTAAAGAAACTGGCTTTGAACAAGGTGAAAAGTCTTGTGCAG CACCTCTAGATCATACTGTGAAGGAAAATCTTGGACAAACTTCTAAAGAACAGGTGGTAG

query

database

COMPARE

LIST MATCHES

Flavours of BLAST

ACGATAGATCCCATCCATAAAT ATGACGATAGATCCCATCAT CGATAGGACCACCACA GATAGACCAGGATACATAGGATAATTA AGCTCGCTTGGCTCGATGGCT

MKJLSPWERSYTRGHYTWER MGHTVNBZY MKLPWRHGDBKJGMNDFD MBKLRPIUHDFRTASGSLKWWRTVBN

MKJLSPWERSYTRGHYTWER MGHTVNBZY MKLPWRHGDBKJGMNDFD MBKLRPIUHDFRTASGSLKWWRTVBN

ATGACGATAGATCCCATCAT CGATAGGACCACCACA GATAGACCAGGATACATAGGATAATTA AGCTCGCTTGGCTCGATGGCT

query sequence other operation database sequences

ATGACGATAGATCCCATCAT CGATAGGACCACCACA GATAGACCAGGATACATAGGATAATTA AGCTCGCTTGGCTCGATGGCT

BLASTn

BLASTp

BLASTx

tBLASTn

tBLASTx

ACGATAGATCCCATCCATAAAT

ACGATAGATCCCATCCATAAAT

MQWCGYRWTYQGYRW

MQWCGYRWTYQGYRW

FAST

FAST

SLOW

SLOWER

HORRIBLY

SLOW

6 fra

me

trans

latio

n

How does it work The main task of any sequence comparison program is to test all possible mutual alignments of two sequence and see how good the match is

CTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGT

CCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGTTC | | | | | ||||||||||||||||||||||||| CTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGTCTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGT || | | | | | | | | |||||||||||||||||||||||| | | | | | |

CCGAGCTTCTCATTGCTCTTCCTAACAGTG=TGATAGGCTAACCGTAATGGCGTTC||||||||||||||||||||||||| ||||||||||||||||||||||||

query

1st database sequence

This would actually be a very slow search process if implemented like thishellip

BLAST achieves its speed through two strategies

- it takes a WORD based approach- it pre-INDEXES database sequences

BLAST WORDS and INDEXING1 GACAAATCCAAACCCCTGAAGTTCTCCACCAGCAAAGCCA

2 TAAGCAAATTTAATTTTGTTTACATTTTC

3 GTTAAGACCTTCCCTGACATTTGCAGCAGTTTCAAATGTA

AAAAAAAA 00001 AAAAAAAC 00002 AAAAAAAG 00003 ACAAATCC 07967 ACAAATCC 07968 ACAAATCC 07979 GACAAATC 33568 GACAAATG 33569 TCCAAACC 64321 TCCAAACC 64322

sequence position word

1 1 33658

1 2 07967

1 3 16210

3 15 33568

3 16 07967

Database of sequences

Numbered list of all possible lsquowordsrsquo

Build a position index of all words in the database

Analyse the Query Sequence gtquery AGACAAATCCAAACCCCTGAAGTTCTCCACCAGCAAAGCCA

AAAAAAAA 00001 AAAAAAAC 00002 AAAAAAAG 00003 ACAAATCC 07967 ACAAATCC 07968 ACAAATCC 07979 GACAAATC 33568 GACAAATG 33569 TCCAAACC 64321 TCCAAACC 64322

QUERY SEQUENCE

Numbered list of all possible lsquowordsrsquo

position word

1 14236

2 33658

3 07967

Analyse QUERY SEQUENCE

sequence position word

1 1 33658

1 2 07967

1 3 16210

3 15 33568

3 16 07967

Index of database

Expand from Word Based Matches We lsquoinstantlyrsquo know which sequences in the database have at least a word length match with our query sequence and at what relative position

Next the potential alignments are expanded adding up a score for (total matches ndash mismatches ndash gap penalties) to make the best possible alignment But this is usually for a tiny proportion of the sequences in the database ndash so overall it is much quicker

The highest scoring alignments are reported

But we can potentially miss alignments with no word-size bits in common consider BLASTn with a default word-size of 11

TCGGAAGTGGAAGCTGAACCTGATTGTAGAGTTGGAGGCCAGTGTTCTGGCTGAGC||||||||| ||||| |||||||||| |||||||||| |||| ||||| ||||||| TCGGAAGTGTAAGCTCAACCTGATTGCAGAGTTGGAGTCCAGAGTTCTAGCTGAGC

Care is sometimes neededhellip

BLAST ndashTypical OutputINPUT

gtpartial cDNA sequence Xenopus tropicalisCGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGTTCCCACCTCTCCTCTTTCACCATGAAGCTCAAGGACAAATTCCACTCCCCCAAAATCAAGCGCACCCCGTCCAAGAAGGGGAAGCCGGCCGACCTCACCGTCAAAACAGAAGAGAAACCCGTCAACAAAACCTTAAGCCGCTTGGAGGAACAGGAGAAAGAAGTCGTTAATGCCTTGCGTTACTTTAAGACAATTGTTGACAAGATGGCGGTGGACAAGATGGTGCTGGTGATGCTGCCAGGGTCGGCGA

OUTPUTQuery= (311 letters) Database NCBI Protein Reference Sequences 954378 sequences 347895532 total letters

gtgi|41055060|ref|NP_9574201| similar to guanine nucleotide-releasing factor 2 (specific for crk proto-oncogene) [Danio rerio]

Length=691

Score = 133 bits (335)Expect = 6e-31 Identities = 7698 (77) Positives = 8298 (83) Gaps = 498 (4) Frame = +2

Query 26 MSGKIE-KADSQRSHLSSFTMKLKDKFHSPKIKRTPSKKGKPA--DLTVKTEEKPVNKTL 196 MSGKIE K +SQ+SHLSSFTMKL KFHSPKIKRTPSKKGK + VKT EKPVNK + Sbjct 1 MSGKIESKHESQKSHLSSFTMKLM-KFHSPKIKRTPSKKGKQLQPEPAVKTPEKPVNKKV 59

Query 197 SRLEEQEKEVVNALRYFKTIVDKMAVDKMVLVMLPGSA 310 SRLEEQEK+VV+ALRYFKTIVDKM VD VL MLPGSA Sbjct 60 SRLEEQEKDVVSALRYFKTIVDKMNVDTKVLQMLPGSA 97

When is a match significant

RFKISDCQHPCTYSHNQYMTNHMRECPYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV

NFSWKKTSEKETNCQFDYPNDYNEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFNMCWLEVNSS

RF---KISDCQHPCTYSH-NQYMTNHMREC----PYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV F K S+ + C + + N Y N +C P+ + +W +P + D I N M ++ NFSWKKTSEKETNCQFDYPNDY--NEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFN------MCWLEVNSS

Here is a lsquotypicalrsquo weak alignment from BLASTp

In fact the sequences were randomly generated so there is no biologically significant alignmenthellip

E-values

The number of matches like the discovered match that I would expect to find by chance

An E-value of 00 implies that I would expect no matches like this to arise by chance thereforehellip

An E-value of 1 implies I would expect 1 match like this to arise by chance so if I have a match with such an E-valuehellip

Also ldquoexpect valueldquo or ldquoexpectationrdquo

E-values From First Principles

Some database statistics (23rd July 2005)

Database NCBI RefSeq mRNA 272619 sequences 503566580 total letters (~50 x 108)

Database NCBI nr 3329110 sequences 14601814750 total letters (~14 x 1010)

Notation

12e-35 = 12 x 10-35

48 x 106 = 4800000

We will consider first searching a nucleotide sequence (lsquoACGTAGACGTrsquo) against a nucleotide database eg the RefSeq mRNA above

Then we will consider the more complex case of amino acid sequence (protein) searches Which is of course what we mostly do

Calculating an E-valueThe RefSeq mRNA database has ~ 50 x 108 letters There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA

Query = lsquoArsquo

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACrsquo

AC AC AC AC

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACGrsquo

ACG

Expected number of matches = (50 x 108) 4 = ~12 x 108

Expected number of matches = (50 x 108) (4x 4) = ~31 x 107

Expected number of matches = (50 x 108) (4 x 4 x 4) = ~81 x 106

Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer

Expected number of matches = (50 x 108) (4 x 4 x 4 x 4 hellip 60 times ) = (50 x 108) 1036 = 50 x 10-28

E-value = 50 x 10-28

E-values In PracticeSo if I take a 60 nt sequencegtsequenceACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA and actually BLAST it against the RefSeq mRNA database I get

BLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds

Length=6060 Score = 119 bits (60) Expect = 2e-26 Identities = 6060 (100) Gaps = 060 (0) Strand=PlusPlus

Query 1 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 60 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 2977 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 3036

What do I get if I BLAST it against the larger nr databaseBLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds

Length=6060 Score = 119 bits (60) Expect = 6e-25 Identities = 6060 (100) Gaps = 060 (0) Strand=PlusPlus

Query 1 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 60 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 2977 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 3036

theoretical value was 50e-28 -

E-value Exercise

Given a transcription factor binding site

ACC[TG]TA

How many would you expect to find by chance in a 10k promoter sequence

How would this differ if there was an optional additional base between the 4th and 5th positionsIeACC[TG]TAOR ACC[TG]TA

E-value Exercise AnswerACC[TG]TA

Expect lsquoArsquo every 4 ntExpect lsquoACCrsquo every 4x4x4 = 64 nt

Expect lsquoT or Grsquo every 2nd ntExpect lsquoACC[TG]rsquo every 64x2 nt = 128 nt

Expect lsquoTArsquo every 4x4 = 16 ntExpect lsquoACC[TG]TArsquo every 128x16 nt = 2048 nt (4x4x4x2x4x4)We would expect ~5 of these promoter sites every 10k by chance

If also ACC[TG]TAA allowed

The two motifs independently have the same E-valueTo allow either means we expect twice as many

E-values Effect of Database SizeThe nr mRNA database has ~ 14 x 1010 letters (was RefSeq and 50 x108)There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA

Query = lsquoArsquo

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACrsquo

AC AC AC AC

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACGrsquo

ACG

Expected number of matches = (14 x 1010) 4 = ~12 x 108

Expected number of matches = (14 x 1010) (4x 4) = ~31 x 107

Expected number of matches = (14 x 1010) (4 x 4 x 4) = ~81 x 106

Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer

Expected number of matches = (14 x 1010) (4 x 4 x 4 x 4 hellip 60 times ) = (14 x 1010) 1036 = 14 x 10-26

E-value = 14 x 10-26

(was E-value = 50 x 10-28)

E-values Effect of Database Size

The E-value is simply dependent on database size

RefSeq

nr

14 x 1010 letters

50 x108 letters

30 x bigger

BLAST the same sequenceagainst each E-value = 14e-26

E-value = 50e-28

The database was ~30 times bigger and so the E-value was ~30 times bigger

Why were the values differentOur calculated E-value for searching against the RefSeq mRNA database was 50 x 10-28But our actual BLAST search at NCBI gave a value of

20 x 10-26 - about 40x larger - why is this

Gapped alignments

If we were expecting N matches for a query sequence lsquoACGTACGTACGTrsquo imagine what would happen to N if we allowed gaps in our matches

ACGTACGTACGT

This would now give us additional possible alignments that would meet our lsquomatchrsquo criteria

ACGTACGTACGT ACGTACAGTACGT ACGTACCGTACGT etc|||||||||||| |||||| |||||| |||||| ||||||ACGTACGTACGT ACGTAC-GTACGT ACGTAC-GTACGT

We will expect many more matches in a given database if we allow our alignments to have gaps The E-value will be larger

E-values Effect of Query Length

Biologically itrsquos the same match Does it mean we are any less sure that this match didnrsquot occur by chance The E-value is simply dependent on match length

database

BLAST 500 nt sequence against a database

BLASTn Get a full length match with sequence XYZ at an E-value = 50e-160

gtsequence ACTAGTCTAGCTAGACATCGATCGATGATGCTACACAGATAGACGATAGATAGTAAGTCGATCGATCGCGCATCGATCGTCTAGATCGATCGCTCGCTGTGTAGATAGATCGGCGATAGA

database

BLAST half of the same sequence against the same database

BLASTn

gtsequence ACTAGTCTAGCTAGACATCGATCGATGATGCTACACAGATAGACGATAGATAGTAAGTCG

Get a match with sequence XYZ again but at an E-value = 50e-80

Why not just use identityAt some levels this a good question

But consider two very different searches both of which give a 75 identity match

Query1 was 60 nt longCGGAGCTCAGGGCTTAACGACTGATATCTCCGCGCATGTCGAGAAACGATACAGCCAGCG||||||||||| || | || | || || |||| | | | |||||| | ||||||||||CGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGWhich would have an E-value ~ 50 x 10-19

And Query2 only 16 nt longACGTACGTACGTACGT||| || | |||| ||ACGCACCTTCGTAGGTWhich would have an E-value ~ 30

And intuitively we feel we would expect to see that sort of number of matches in the database just by chancehellip

So whatrsquos the real problemBasically you are usually trying to answer the question

Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism

Are there any useful guidelines though at least for biological meaningfulness

Basically you are usually trying to answer the question

Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism

BLAST

The difficulty is because

ORTHOLOGY

BLAST Similarity + Probability

biological knowledge

nature of query sequence

phylogenetic relationship

match length PI size of databasehellip

Rules of ThumbHow good does an E-value have to be before we might even think we have an ortholog

largerworse smallerbetter

E-values 10-5 10-10 10-40 10-100 00

fantasy borderline encouraging

pretty good canrsquot get

better

But note that in some gene families with closely related members you can get an E-value of 00 for several different matches and then identity may be more sensitive Also bear in mind in cases like this that ideas of lsquofunctionalrsquo orthology may break down with more than one locus producing identical proteins which share the same functionhellip

Protein BLASTItrsquos (nearly) always better to make comparisons at the amino acid level between protein sequences than the DNA level because the amino acid sequence is more conserved than the underlying DNA sequenceDoes this cause us to treat expected values any differently

If we follow the argument as before then for an exact match of a 20 amino acid sequence in the RefSeq protein database each additional amino acid will reduce the E-value by 120th (there are 20 different amino acids) And as there are 347895532 letters in that databaseE-value = ~35 x 108 (20 x 20 x 20 hellip20 times) = ~35 x 10-18

But this is what we get if we run the blast at NCBI

Score = 431 bits (100) Expect = 8e-04 Identities = 2020 (100) Positives = 2020 (100) Gaps = 020 (0) Frame = +3

Query 3 SSSSFRAYRAALSEVEPPCI 62 SSSSFRAYRAALSEVEPPCISbjct 972 SSSSFRAYRAALSEVEPPCI 991

Really too big a discrepancy to easily explain with hand wavinghellip

Amino Acid Substitutions

A SC F LWYG I LMVL IMFVM ILVP V ILMW FY

N DHSQ REHKS ANTT SY HFW

H NQYK RQER QK

D NEE DQK

In fact we need to take into account both amino acid substitutability as well as as before allowing gapped alignments On average any residue can be substituted for by about 2 others so each position has about 17th chance of lsquomatchingrsquo rather than 120th

So now we getE-value = ~35 x 108 (7 x 7 x 7 hellip20 times) = ~44 x 10-9which is much closer to the actual BLAST value

These substitutabilities are dealt with by the BLOSUM and PAM matrices

ExercisesGo to the file random-DNA-sequenceshtml select one of the 20 randomly generated nucleotide sequences and do a BLASTx (translated DNA-gtprotein) at NCBI against the nr protein database

Did you find any lsquosignificantrsquo hits

Repeat with a second sequence

What conclusions might you draw from this exercise

Try the same sequence(s) against the nr nucleotide database

Is there any general difference

Part 4 Tweaking BLASTAlthough you normally see BLAST as a web page with boxes to place data in and tick boxes etc it is actually a command line program that can be run just by typing the appropriate command and options eg

promptgt blastall ndashp ltblast typegt ndashi ltinput sequencegt ndashd ltdatabasegt

This is the simplest form where the basic program lsquoblastallrsquo takes a number of different options or parameters indicated by the ndashx and followed by its value -p ltwhich blast flavour to rungt-i ltfile with query sequence ingt-d ltpre-indexed database namegt

Not All Parameters are hellipThere are many other parameters and if not listed explicitly they will use a default value most appropriate to the blast flavour requested Eg for ndashW ltword sizegt blastn uses ndashW 11 where blastx uses ndashW 3

There are also some options that appear on the web pages that are not really parameters but manage the job in a similar way One of the most useful of these is on the NCBI blast pages where you can use Entrez queries or pick from an organism list to modify your search

The Many Parameters of BLASTThere are almost literally hundreds of parameters but most are way too obscure even for die-hard techies like me Very few of them are regularly useful in any but their default value but just occasionally they are very necessary

Here are some of the ones that I have used

-e max expected value -m output format (graphical or tabularspreadsheet)-F filter query sequence for low complexity (default TRUE)-U use only upper case regions of query (default FALSE)-G gap opening cost-E gap extension cost-q nucleotide mismatch penalty (BLASTx uses matrices)-r nucleotide match reward-b number of matching sequences to report-g allow gaps (default TRUE)-W word size-z effective database size (removes effect of actual database size)-S query strands to search (default both directions)-l restrict database sequences to given list of lsquogilsquo numbers

BLAST Parameters Exercises1 BLASTn vs BLASTx

Open the file example-sequenceshtml copy the sequence gtblastn-vs-blastxThis is a Xenopus tropicalis cDNA sequence

Go to the NCBI BLAST Home PageNucleotide-nucleotide BLAST (blastn) section Paste your sequence into the box

Run BLASTn against the nr nucleotide database using all default optionsThen hit [format] to wait for the results in a new page(hint if you paste the sequence definition line lsquogtnamersquo into the box as well your results will be labelled accordingly which can be useful)

Now repeat but go to the TRANSLATED BLAST section and BLAST against the nr protein database using BLASTx

How might the different results help us view the presence of this gene in other vertebrates

Results for Exercise 1 BLASTn

BLASTx

BLAST Parameters Exercises2 Low complexity filtering

Open the file example-sequenceshtml copy the sequence gtlow-complexity-filtering-A This sequence contains a long AT tandem repeat

Go to the NCBI BLAST Home PageTRANSLATED BLAST sectionBLASTx Paste your sequence into the box

Carefully UNTICK the ldquoChoose filter [ ] Low complexityrdquo BOX in the second section And then run BLASTx against the nr database

What do you feel about these alignmentsRe-run but leave the low-complexity filter ON this timeDoes this change our view of the protein matches

Now continue with gtlow-complexity-filtering-B and ndashCC is an especially interesting case ndash what can we deduce about the cDNA sequence Annotators beware

Results for Exercise 2A (OFF) BLASTn ndash low complexity filtering OFF

Results for Exercise 2A (ON) BLASTn ndash low complexity filtering ON

Results for Exercise 2B

ON OFF

Results for Exercise 2C

There is a sequence error an extra G at position 117 in the sequence cDNA (117)AGAAAAGAAGAAACATGGCAATGGATCAGAA|||||||||||||||| ||||||||||||||AGAAAAGAAGAAACAT-GCAATGGATCAGAA

Genomic sequence

ON

OFF

BLAST Parameters Exercises3 Limit by Entrez query

Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching sequences in fruit fly enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT

Open the file example-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home Page TRANSLATED BLAST sectionBLASTx and paste the sequence

Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip

BLAST Parameters Exercises4 BLASTn vs tBLASTx and nucleotide mismatch penalties

Open the file example-sequenceshtml

Also open the NCBI BLAST Home PageSPECIAL ndash Align two sequences section

There are several Xenopus tropicalis cyclins in the examples fileCopy the sequence gtcyclin-A1-Xt to the Sequence 1 BLAST windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 BLAST window(i) Run the default comparison should be BLASTn Note the alignmentNow run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these

(ii) Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1 Then try reducing the gap open and gap extension penaltieshellipWhat do we learn from this

(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip

Results for Exercise 4 (i)

BLASTn tBLASTx

Results for Exercise 4 (ii)

Mismatch penalty = -2 (default) Mismatch penalty = -1

BLAST Parameters Exercises5 E-Value maximum for reporting

Open the file example-sequenceshtml

Copy the sequence gtsumo-binding-motif and go to the NCBI BLAST Home PageGo to the PROTEIN BLAST section BLASTp and paste the sequence

Run the search with the default values

Now re-run the search setting the maximum E-value in the box

Expect 100

What difference does this make

BLAST Parameters Exercises6 Word Size

Open the file example-sequenceshtml

Copy the sequence gtmorpholino and go to the NCBI BLAST Home PageGo to the NUCLEOTIDE BLAST section BLASTn and paste the sequence

Check OFF the low complexity filter and then run the search

Now re-run the search setting the following parameters

Low complexity OFFExpect 100Word Size 7Other advanced -q-1 (mismatch penalty -1 instead of default -3)

What difference does this make

END

  • Bioinformatics Workshop 1 Sequences and Similarity Searches
  • The Basic Questions
  • Part 1 Structural Genomics
  • Chromosomes and Genes
  • Gene to Protein
  • Sequence Signals
  • Genomic Signals
  • Derivative Sequences
  • Gene Models
  • Sequences and Genes (Accession Numbers and Names)
  • Gene Symbols Names Etc
  • A Gene-Centric View
  • Sequences and Accession Numbers
  • mRNA Splicing Signals
  • Gene Predictions
  • Supporting Evidence
  • TheoreticalPredicted Sequences
  • Sequences for a model organism
  • So Whatrsquos in the Databases Now
  • Part 2 Comparative Genomics
  • Speciation
  • Residual Similarity
  • Computers Can Detect Homology
  • Orthologs
  • Paralogs
  • lsquoOtherrsquo-logs
  • The Essential Paradigm
  • Function Conserved Longer than Detectable Similarity
  • Redundancy in the Genetic Code
  • Protein Similarity Persists Longer
  • Always Compare Protein Sequences
  • Exercise 1 nucleotide vs amino acid search
  • Answers Exercise 1
  • The Essential Task
  • Functional Orthologs
  • Finding Orthologs
  • Using Synteny is Better
  • Metazome
  • Metazome Exercise
  • Part 3 Finding Sequence Similarities
  • Gaps in Alignments
  • The Downside of Gaps
  • BLAST
  • Flavours of BLAST
  • How does it work
  • BLAST WORDS and INDEXING
  • Analyse the Query Sequence
  • Expand from Word Based Matches
  • BLAST ndashTypical Output
  • When is a match significant
  • E-values
  • E-values From First Principles
  • Calculating an E-value
  • E-values In Practice
  • E-value Exercise
  • E-value Exercise Answer
  • E-values Effect of Database Size
  • Slide 58
  • Why were the values different
  • E-values Effect of Query Length
  • Why not just use identity
  • So whatrsquos the real problem
  • Rules of Thumb
  • Protein BLAST
  • Amino Acid Substitutions
  • Exercises
  • Part 4 Tweaking BLAST
  • Not All Parameters are hellip
  • The Many Parameters of BLAST
  • Slide 70
  • BLAST Parameters Exercises
  • Results for Exercise 1
  • Slide 73
  • Results for Exercise 2A (OFF)
  • Results for Exercise 2A (ON)
  • Results for Exercise 2B
  • Results for Exercise 2C
  • Slide 78
  • Slide 79
  • Results for Exercise 4 (i)
  • Results for Exercise 4 (ii)
  • Slide 82
  • Slide 83
  • END
Page 9: Bioinformatics Workshop 1 Sequences and Similarity Searches

Gene Models

gene modelexons

Sequences and Genes(Accession Numbers and Names)

AAB229701

AAP212451

CAA415451

NP_1877592

proteins

S431051mRNAscDNAs lsquosimilar to Cyclin B1 [mus musculus]rsquo

gene

BT0064371 lsquoCyclin B1 isoform 1 [mus musculus]rsquo

X587081

NM_1119853 lsquoCCNB1 Cyclin B1 [mus musculus]rsquo

lsquoCyclin B1 isoform 2 [mus musculus]rsquo

Gene Symbols Names Etc

Gene Symbol CCNB1

Gene Name cyclin B1 [Homo sapiens]

Description G2mitotic-specific cyclin B1

Aliases CCNB CYCB1

A Gene-Centric View

Entrez Genehttpwwwncbinlmnihgov

Cyclin B1

S431051

BT0064371

X587081

NM_1119853

AAB229701

AAP212451

CAA415451

NP_1877592

Exercise 1

Go to Entrez Gene and look for your favourite gene or genes

genomic location

expression data

Sequences and Accession Numbers

NM_0010159221 gi=62860271

GATCGTTCGATTAGCTAGGGACACCACCGATCGATATGACCACAAAAA

BC0096381 gi=16307106

GTTCGATTAGCTAGGGACACCACCGATCGATATGACCACAAAA

NM_0010159222 gi=62860589

GACCGTTCGATTAGCTAGGGACACCACCGATCGATATGACCACAAA

NP_0010159221 protein translated from mRNA

XM_0011025671 predicted mRNA

XP_0010897651 predicted protein translated from predicted mRNA

mRNA Splicing Signals

gene model

genome

CTACCATCCATGCTAACCATTCTACCATTTTATACTCATGCAACGGACCGTAGCGTAGTCGCTTAGCATCCTTTATAACTGGCTA

CTACCATCCATGCTAACCATTCTAC CATTTTATACTCATGCAACGGACCGT AGCGTAGTCGCTTAGCATCCTTTATAACTGGCTA

CTACCATCCATGCTAACCATTCTACGTAAGTCATCTATATCAATATTATTTCAGCATTTTATACTCATGCAACGGACCGTGTCAGTATTACAGAGCGTAGTCGCTTAGCATCCTTTATAACTGGCTA

GTAAGdonor

TTTCAG acceptor

mRNA

exon intron exon intron exon

splice sites

Gene PredictionsGiven- coding sequence must run from ATG ndash STOP codon in-frame- introns GT AG can be spliced out

Also take a statistical approach- coding and non-coding sequence are slightly different in composition- some lsquopossiblersquo splice sites are more likely than others

CGTCGTATGGCTTCGATGTAGTACATCGGATCGGTATGGAATCATTTCAGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA

CGTCGTATGGCTTCGATGTAGTACATCGGATCGGTATGGAATCATTTCAGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA CGTCGTATGGCTTCGATGTAGTACATCGGATCGGTATGGAATCATTTCAGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA

CGTCGTATGGCTTCGATGTAGTACATCGGATCGGTATGGAATCATTTCAGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA

CGTCGTATGGCTTCGATTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA

CGTCGTATGGCTTCGATGTAGTACATCGGATCGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA CGTCGTATGGCTTCGATGTAGTACATCGGATCGGTATGGAATCATTTCAGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA

CGTCGTATGGCTTCGATTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA

CGTCGTATGGCTTCGATGTAGTACATCGGATCGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA

scan genomic sequence hellip

CGTCGTATGGCTTCGATGTAGTACATCGGATCGGTATGGAATCATTTCAGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA

most likely gene model

Supporting Evidence

EST evidence

genome

gene model

We note that in the absence of EST evidence it is only really possible to predict coding sequence with any confidence (and even thenhellip)

So predicted genes based on computational gene models alone will usually lack UTR regions which has some important consequences

exons 1 2 3 4

TheoreticalPredicted Sequences

genome

predicted gene modelexons 1 2 3 4

Wersquove now reversed the process of working out exon structure from aligning cDNA sequences against the genome sequence but we shouldnrsquot lose sight of the fact that we donrsquot really know if these predicted proteins exists ndash especially where supporting EST evidence is weak or non-existent

predicted transcript

predicted protein

Sequences for a model organism

ESTs ndash millions pound10 eachCheap to sequence ndash so we get millions per organismBut lots of errorsAnd incomplete gene sequencesCan give us relative expression levels

cDNAs ndash tens of thousands pound1000 eachExpensive ndash but only need to do one (or a small number) per geneFew errors with multipass sequencingGives us protein sequences

Genomes ndash one pound30000000Extremely expensiveBut the only way to get the whole pictureGives us gene regulation

So Whatrsquos in the Databases Now

15000000ESTs

3300000cDNAs

NCBI July 2005

2700000proteins

950000proteins

nrRefSeq

DNA

Proteins

Part 2 Comparative Genomics

ATGAAGGCTGCCTACGACTGCCGTGATGCAGGCTGCCTACGACTGCCGTGATGCAGGCTGCCAACGACTGCCGTGATGCATGCTGCCAACGACTGCCGTGATGCATGCTGCCAACGACTGCCCTGATGCATGCTGCCAACGGCTGCCCTGATGCATGCTGCCAACGGATGCCCTGATGCATGCCGCCAACGGATGCCCTGATGCATGCCGCCAACGGATGTCCTG

Imagine one mutation gets fixed every 100000 years in this gene sequencehellip

Gene sequence

Evolution by sequence mutation

Speciation

ATGAAGGCTGCCTACGACTGCCGTG

ATGCAGGCTGCCTACGACTGCCGTGATGCAGGCTGCCAACGACTGCCGTGATGCATGCTGCCAACGACTGCCGTGATGCATGCTGCCAACGACTGCCCTGATGCATGCTGCCAACGGCTGCCCTGATGCATGCTGCCAACGGATGCCCTG

Gene AATGAAGGCTGCCTACGACTGCCGTG

ATGAAGGCCGCCTACGACTGCCGTGATGAAGGCCGCCAACGACTGTCGTGATGAAAGCCGCCAACGACTGTCGTGATGAAAGCCGCCAACGACAGTCGTGATGAAAGCCGCCTACGACAGTCGTGATGAAAGCCGCCTACGACAGTCCTG

ATGCATGCTGCCAACGGATGCCCTG

ATGAAAGCCGCCTACGACAGTCCTG||| | || ||| ||| | ||||

If the genetic difference means they can no longer interbreed with fertile offspring ndash then we have a new specieshellip

Residual Similarity

ATGCATGCTGCCAACGGATGCCCTG

ATGAAAGCCGCCTACGACAGTCCTG||| | || ||| ||| | ||||

ATGCATGCTGCCAACGGATGCCCTG

ATGGAAGGCGCTTAGGATAGTCCAG||| | | || | | | || |

After longer periods of evolution homology may no longer be detectable in the DNA sequencehellip

We can still easily detect residual similarity between these sequences this is what we call homology ndash detectable similarity because of common evolutionary origin

Computers Can Detect Homology

In fact computers are very good at this task ndash the two primary challenges are

(a) performing the search fast enough to look through millions of sequence in a timescale compatible with a lab scientistrsquos attention span

(b) at low levels of similarity being able to distinguish between biologically related sequences and chance matcheshellip

ATGCATGCTGCCAACGGATGCCCTG

ATGAAAGCCGCCTACGACAGTCCTG||| | || ||| ||| | ||||

GCTGACTCGTAGCGCTTAGCTAGCT

CCAACATCTAGCCAGATTAGTTAGT | || | | | |

Orthologs

A A

A Gene duplication though speciation The two copies of Gene

A will now evolve independently but will continue to have the ~same function

They are ORTHOLOGS

Paralogs

A

Gene duplication though internal genome duplication

The two copies of Gene A will now evolve independently but will probably not continue to have exactly the same function

They are PARALOGS

A

A Arsquo

A

lsquoOtherrsquo-logsWhat about gene duplication after speciation

How can we describe the relationship(s) between the various copies of gene A in the two frogs

Bear in mind that understanding gene function is more important than semanticshellip

The two copies of A in the orange frog are sometimes called IN-PARALOGS

If they were also present in the green frog (and therefore were in the ancestor species) they would be OUT-PARALOGS

A

A

A

Arsquo A

The Essential Paradigm

1 any group of modern species can be traced back to some extinct common ancestor

A

A

2 in all likelihood they share orthologous genes which have the same function in the modern animal as in the extinct ancestor

3 If we can experimentally determine the function of a gene in one of these organisms then there is a good chance the ORTHOLOGOUS gene in another organism will have the same function

A A

cyclin b1

cyclin b1

Function Conserved Longer than Detectable Similarity

start from first self-replicating sequence

same function detectable similarity

living organisms

whole genome duplication local duplication

Redundancy in the Genetic Code

GCA A alanine GCC A GCG A GCT A

TGC C cystine TGT C

GAC D aspartate GAT D

GGA G glycine GGC G GGG G GGT G

lsquoSynonymousrsquo or lsquosilentrsquo mutations in the third position of the codon triplets have no effect on the amino acid coded for ndash so there is no evolutionary pressure against thishellip

Protein Similarity Persists Longer

CTATCACGAGAACCTGTGCTATCCCGAGAACCTGTGCTATCCCGAGAACCAGTGCTATCCCGTGAACCAGTGCTATCCCGTGAGCCAGTGCTATCCCGTGAGCCAGTTCTGTCCCGTGAGCCAGTT

CTATCACGAGAACCTGTG

CTGTCCCGTGAGCCAGTT|| || || || || ||

LSREPV

LSREPV||||||

CTATCACGAGAACCTGTG

TTGTCCCGGTCGCCAGTT | || | || ||

LSREPV

LSRFPV||| ||

67 100

44 80

Always Compare Protein Sequences

ATGAATGCAGCCTATGATTGCCGAGCCAGAATGCTAAGG MNAAYDCRARMLR ||||| || || || || || || || ||||| || || | ||||||||+||ATGAAGGCCGCATACGACTGTCGTGCTAGAATCCTGAGA MKAAYDCRARILR

DNA comparison amino acid comparison

The DNA sequence can change while the amino acid sequence stays the same so always look for similarities by comparing amino acid sequences

Exercise 1nucleotide vs amino acid search

Go to the file example-sequenceshtml and locate the section for this exercise There should be two sequences lsquosurfeit1rsquo for frog and fly

Go to NCBI Blast home page then lsquoAlign two sequencesrsquo (bottom left lsquospecialrsquo panel) paste one sequence into each window and hit lsquoAlignrsquo ndash this will do a direct DNADNA comparison

Now find the open reading frames of the two genes and translate them into amino acid protein sequences then repeat the two sequences comparison

Go to NCBI ORF Finder ndash paste sequence ndash hit OrfFind ndash identify longest ORF ndash click on it ndash next screen hit Accept ndash change View to Fasta protein ndash hit View ndash copy sequence to Blast2Seqs Do the same with the other sequence

Before you hit lsquoAlignrsquo change the lsquoProgramrsquo (top left) to blastphellip

Answers Exercise 1

The Essential Taskexperiment data mining

gene sequence what is its function

database of proteins in other species

Cyclin-AFoxA1

cdc25

alpha-tubulin

Predicted protein

Gravin-like

Sprouty-2

calmodulin

KIAA10786568

frizzled

Wint8

Troponin T3

Gravin-like

we can only do this because of implied function based on orthology

Functional Orthologs

function known annotation lsquoGravinrsquo available

Human geneXenopus genefunction unknown

sequence similarityorthologs

same function But we know that function is largely determined by shape

similar shape

Which in general we cannot determine ndash but it is probably SHAPE not SEQUENCE that is conserved

We make an assumption that the same gene function is likely to be present in the two organisms and the ones that have this function are likely to be the most similar in sequence

Finding OrthologsSo how do we find orthologs and can we know when we have

The simplest is Reciprocal Best BLAST but it implicitly relies on having all the protein sequences of you own organism and the one you wish to find an ortholog in

frog proteindatabase of human proteins

best match human protein

database of frog proteins

x

Using Synteny is Better

We know that large regions of (say) vertebrate genomes have preserved their overall organisation from one organism to another

And we find the same genes (ie orthologs) in more or less the same order in the syntenic sectionsThese of course represent chromosomal re-arrangements since these organisms diverged

Human chromosome 5

Mouse chromosome 10

Mouse chromosome 2

MetazomeFortunately someone has done all the hard work for ushellip Dan Rokhsar httpwwwmetazomenet

Metazome Exercise

Go back to Entrez Gene and look for your favourite gene again

Pick probable ortholog vertebrate genes from common organisms (human mouse rat chicken frog fish) and paste their protein sequences into a temporary space

Go to Metazome (httpwwwmetazomenet) find the blast window open two versions of it and blast your sequences against the Tetrapod or Jawed vertebrate node

See if you get the same cluster ID as best top hit and have a look at the Metazome alignment(s)hellip

Part 3 Finding Sequence Similarities

We want computer programs which will compare sequences at all possible different alignments looking for a degree of similarity greater than we would expect to find by chance

But first we have to consider the implication of gapshellip

Insertions and deletions are other possible forms of mutations and they can really mess up our simple alignments

ATGCATGCTGCCAACGGATGTCCTG

ATGAAAGCCGCCTACGAAAGTCCTG||| | || ||| ||| | ||||||

ATGCATGCTGGCCAACGGATGTCCTG

ATGAAAGCCGCCTACGAAAGTCCTG||| | || ||| | | | |

Gaps in Alignments

Consider these two obviously similar sequences

TTCCCAACTCTCCTCTTTCACCATGAAGCTCAAGGACAGATTCCACTCGCCCCAAAATCAAGCTCACCCCGTCCAAGAA | || | || |||||||||||||||||||| ||||||||| ||| ||| | ||| | | |TTCCCACCTCTCCTCTTTGCACCATGAAGCTCAAGGACAAATTCCACTCCCCCAAAATCAAGCGCACCCCGTCCCAGAA

In fact we realise that the most probable alignment (regarding biological origin) is with a small gap in each sequence

TTCCCAACTCTCCTCTTT=CACCATGAAGCTCAAGGACAGATTCCACTCGCCCCAAAATCAAGCTCACCCCGTCCAAGAA |||||| ||||||||||| |||||||||||||||||||| ||||||||| |||||||||||||| |||||||||| ||||TTCCCACCTCTCCTCTTTGCACCATGAAGCTCAAGGACAAATTCCACTC=CCCCAAAATCAAGCGCACCCCGTCCCAGAA

So in general we allow ourselves to insert gaps until we find the optimal alignment

But where should this process stop

The Downside of GapsTake two random sequences with no lsquorealrsquo similarity

GACACTAGGTCGATGCGTGGTGGCGAGA

ACGCATCCGGATGTGCACCGTGGAACTG

And allow lsquocost freersquo gaps

GAC--ACT----AGGTCGATGC---GTGG---TGGCGAGA || | | | | | ||| |||| || ACGCA-TCCGGA--T-G-TGCACCGTGGAACTG

Clearly although the alignment has no mismatches it is obviously not biologically meaningful

To prevent this we assign a cost to adding gaps which is offset against the benefit of finding matches ndash and this is the essence of lsquofinding gapped alignmentsrsquo

We want to find the lsquoalignmentrsquo between the two (or more) sequences which shows the greatest degree of similarity while introducing the fewest gaps hellip

BLAST

gtqueryAGACGAACCTAGCACAAGCGCGTCTGGAAAGACCCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGTTCAGGAGTATTTGGACTGCAATATTGGCCCTCGTTCAAGGGCGCCTACCATCACCCGACGGTCATGCCGGTCCCCAGCAGCTGCTAATAACTTCCTTCGCTACTCAAGTTACCACGCTAGCAAAACCCACGGCATACCGTTTACCCTTTAAAATCAGCTTCAACCAGCAACGAA

There are many programs used to find similarities between sequencesThey range from relatively slow programs which find the exact best matching alignment through ones which take progressively inexact shortcuts to speed things up Of this latter class the best known and easily most widely used is BLAST developed by Stephen Altschul and others and continuously refined over the last 10-15 years

The essential idea is to compare your query sequence against a collection or lsquodatabasersquo of target sequences looking for the one(s) that match the query sequence the best

gttarget1AAAACAGGAATATTTACCGGGACCGGGTAATGATGCATCTCGAGGTACACAATATACCTG GAGAACCGAATTATGAGTTGGCCACCTTACTTAACGAAACCAGCAGAGAAAATCCAACAT GGCAACACCCCTCTGACTACACTAGAAGGAACTACTATGTAAGAAAACAGCCTGTCCCTT GCAGTTTGAATGACTGGGTGATGCGAAATGGGGGTCCTGCCATAGAGCGCTTCCATGGTT TACCTTGCACATTTCAGAGAAGTCCTATGCCAGGAGTCCTTCCTACAGGGCCTTCCTGAA ACTATATATGTGCTTATTCTTGTTTGATTTGGCTTTGCAGgttarget2CTCTTAATTTATTTCTCTTCCTGCAGCTCCCTCGCTTTTTCCTTTCCCTGTTACATTCAT CTGACTTGAAGAGTTGCAAATTTTCAGTGTTTCTGTTTTTGTTGCTGATATGTTGTAAAC TTTTTAATAAAATCTATTTCTATAG gttarget3GCAGTTTGAATGACTGGGTGATGCGAAATGGGGGTCCTGCCATAGAGCGCTTCCATGGTT TACCTTGCACATTTCAGAGAAGTCCTATGCCAGGAGTCCTTCCTACAGGGCCTTCCTGAA ACTATATATGTGCTTATTCTTGTTTGATTTGGCTTTGCAGCTAGGGTTTTCACCTTTTCT GGAAAAAAAAATACTGGCTTCC gttarget4CTGCTATTAATGGGCAAAACAACTCAAATAAAGTCCCTCTGCCACCCTCAGACACTGCCC CTGGCCCCCAGCTGCCCGCTGATCCTTGTAGCCAGAGCAGTAAAGTTTTGAAAGTGGAGC CCAAGGAGAATAAAGTTATTAAAGAAACTGGCTTTGAACAAGGTGAAAAGTCTTGTGCAG CACCTCTAGATCATACTGTGAAGGAAAATCTTGGACAAACTTCTAAAGAACAGGTGGTAG

query

database

COMPARE

LIST MATCHES

Flavours of BLAST

ACGATAGATCCCATCCATAAAT ATGACGATAGATCCCATCAT CGATAGGACCACCACA GATAGACCAGGATACATAGGATAATTA AGCTCGCTTGGCTCGATGGCT

MKJLSPWERSYTRGHYTWER MGHTVNBZY MKLPWRHGDBKJGMNDFD MBKLRPIUHDFRTASGSLKWWRTVBN

MKJLSPWERSYTRGHYTWER MGHTVNBZY MKLPWRHGDBKJGMNDFD MBKLRPIUHDFRTASGSLKWWRTVBN

ATGACGATAGATCCCATCAT CGATAGGACCACCACA GATAGACCAGGATACATAGGATAATTA AGCTCGCTTGGCTCGATGGCT

query sequence other operation database sequences

ATGACGATAGATCCCATCAT CGATAGGACCACCACA GATAGACCAGGATACATAGGATAATTA AGCTCGCTTGGCTCGATGGCT

BLASTn

BLASTp

BLASTx

tBLASTn

tBLASTx

ACGATAGATCCCATCCATAAAT

ACGATAGATCCCATCCATAAAT

MQWCGYRWTYQGYRW

MQWCGYRWTYQGYRW

FAST

FAST

SLOW

SLOWER

HORRIBLY

SLOW

6 fra

me

trans

latio

n

How does it work The main task of any sequence comparison program is to test all possible mutual alignments of two sequence and see how good the match is

CTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGT

CCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGTTC | | | | | ||||||||||||||||||||||||| CTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGTCTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGT || | | | | | | | | |||||||||||||||||||||||| | | | | | |

CCGAGCTTCTCATTGCTCTTCCTAACAGTG=TGATAGGCTAACCGTAATGGCGTTC||||||||||||||||||||||||| ||||||||||||||||||||||||

query

1st database sequence

This would actually be a very slow search process if implemented like thishellip

BLAST achieves its speed through two strategies

- it takes a WORD based approach- it pre-INDEXES database sequences

BLAST WORDS and INDEXING1 GACAAATCCAAACCCCTGAAGTTCTCCACCAGCAAAGCCA

2 TAAGCAAATTTAATTTTGTTTACATTTTC

3 GTTAAGACCTTCCCTGACATTTGCAGCAGTTTCAAATGTA

AAAAAAAA 00001 AAAAAAAC 00002 AAAAAAAG 00003 ACAAATCC 07967 ACAAATCC 07968 ACAAATCC 07979 GACAAATC 33568 GACAAATG 33569 TCCAAACC 64321 TCCAAACC 64322

sequence position word

1 1 33658

1 2 07967

1 3 16210

3 15 33568

3 16 07967

Database of sequences

Numbered list of all possible lsquowordsrsquo

Build a position index of all words in the database

Analyse the Query Sequence gtquery AGACAAATCCAAACCCCTGAAGTTCTCCACCAGCAAAGCCA

AAAAAAAA 00001 AAAAAAAC 00002 AAAAAAAG 00003 ACAAATCC 07967 ACAAATCC 07968 ACAAATCC 07979 GACAAATC 33568 GACAAATG 33569 TCCAAACC 64321 TCCAAACC 64322

QUERY SEQUENCE

Numbered list of all possible lsquowordsrsquo

position word

1 14236

2 33658

3 07967

Analyse QUERY SEQUENCE

sequence position word

1 1 33658

1 2 07967

1 3 16210

3 15 33568

3 16 07967

Index of database

Expand from Word Based Matches We lsquoinstantlyrsquo know which sequences in the database have at least a word length match with our query sequence and at what relative position

Next the potential alignments are expanded adding up a score for (total matches ndash mismatches ndash gap penalties) to make the best possible alignment But this is usually for a tiny proportion of the sequences in the database ndash so overall it is much quicker

The highest scoring alignments are reported

But we can potentially miss alignments with no word-size bits in common consider BLASTn with a default word-size of 11

TCGGAAGTGGAAGCTGAACCTGATTGTAGAGTTGGAGGCCAGTGTTCTGGCTGAGC||||||||| ||||| |||||||||| |||||||||| |||| ||||| ||||||| TCGGAAGTGTAAGCTCAACCTGATTGCAGAGTTGGAGTCCAGAGTTCTAGCTGAGC

Care is sometimes neededhellip

BLAST ndashTypical OutputINPUT

gtpartial cDNA sequence Xenopus tropicalisCGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGTTCCCACCTCTCCTCTTTCACCATGAAGCTCAAGGACAAATTCCACTCCCCCAAAATCAAGCGCACCCCGTCCAAGAAGGGGAAGCCGGCCGACCTCACCGTCAAAACAGAAGAGAAACCCGTCAACAAAACCTTAAGCCGCTTGGAGGAACAGGAGAAAGAAGTCGTTAATGCCTTGCGTTACTTTAAGACAATTGTTGACAAGATGGCGGTGGACAAGATGGTGCTGGTGATGCTGCCAGGGTCGGCGA

OUTPUTQuery= (311 letters) Database NCBI Protein Reference Sequences 954378 sequences 347895532 total letters

gtgi|41055060|ref|NP_9574201| similar to guanine nucleotide-releasing factor 2 (specific for crk proto-oncogene) [Danio rerio]

Length=691

Score = 133 bits (335)Expect = 6e-31 Identities = 7698 (77) Positives = 8298 (83) Gaps = 498 (4) Frame = +2

Query 26 MSGKIE-KADSQRSHLSSFTMKLKDKFHSPKIKRTPSKKGKPA--DLTVKTEEKPVNKTL 196 MSGKIE K +SQ+SHLSSFTMKL KFHSPKIKRTPSKKGK + VKT EKPVNK + Sbjct 1 MSGKIESKHESQKSHLSSFTMKLM-KFHSPKIKRTPSKKGKQLQPEPAVKTPEKPVNKKV 59

Query 197 SRLEEQEKEVVNALRYFKTIVDKMAVDKMVLVMLPGSA 310 SRLEEQEK+VV+ALRYFKTIVDKM VD VL MLPGSA Sbjct 60 SRLEEQEKDVVSALRYFKTIVDKMNVDTKVLQMLPGSA 97

When is a match significant

RFKISDCQHPCTYSHNQYMTNHMRECPYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV

NFSWKKTSEKETNCQFDYPNDYNEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFNMCWLEVNSS

RF---KISDCQHPCTYSH-NQYMTNHMREC----PYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV F K S+ + C + + N Y N +C P+ + +W +P + D I N M ++ NFSWKKTSEKETNCQFDYPNDY--NEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFN------MCWLEVNSS

Here is a lsquotypicalrsquo weak alignment from BLASTp

In fact the sequences were randomly generated so there is no biologically significant alignmenthellip

E-values

The number of matches like the discovered match that I would expect to find by chance

An E-value of 00 implies that I would expect no matches like this to arise by chance thereforehellip

An E-value of 1 implies I would expect 1 match like this to arise by chance so if I have a match with such an E-valuehellip

Also ldquoexpect valueldquo or ldquoexpectationrdquo

E-values From First Principles

Some database statistics (23rd July 2005)

Database NCBI RefSeq mRNA 272619 sequences 503566580 total letters (~50 x 108)

Database NCBI nr 3329110 sequences 14601814750 total letters (~14 x 1010)

Notation

12e-35 = 12 x 10-35

48 x 106 = 4800000

We will consider first searching a nucleotide sequence (lsquoACGTAGACGTrsquo) against a nucleotide database eg the RefSeq mRNA above

Then we will consider the more complex case of amino acid sequence (protein) searches Which is of course what we mostly do

Calculating an E-valueThe RefSeq mRNA database has ~ 50 x 108 letters There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA

Query = lsquoArsquo

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACrsquo

AC AC AC AC

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACGrsquo

ACG

Expected number of matches = (50 x 108) 4 = ~12 x 108

Expected number of matches = (50 x 108) (4x 4) = ~31 x 107

Expected number of matches = (50 x 108) (4 x 4 x 4) = ~81 x 106

Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer

Expected number of matches = (50 x 108) (4 x 4 x 4 x 4 hellip 60 times ) = (50 x 108) 1036 = 50 x 10-28

E-value = 50 x 10-28

E-values In PracticeSo if I take a 60 nt sequencegtsequenceACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA and actually BLAST it against the RefSeq mRNA database I get

BLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds

Length=6060 Score = 119 bits (60) Expect = 2e-26 Identities = 6060 (100) Gaps = 060 (0) Strand=PlusPlus

Query 1 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 60 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 2977 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 3036

What do I get if I BLAST it against the larger nr databaseBLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds

Length=6060 Score = 119 bits (60) Expect = 6e-25 Identities = 6060 (100) Gaps = 060 (0) Strand=PlusPlus

Query 1 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 60 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 2977 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 3036

theoretical value was 50e-28 -

E-value Exercise

Given a transcription factor binding site

ACC[TG]TA

How many would you expect to find by chance in a 10k promoter sequence

How would this differ if there was an optional additional base between the 4th and 5th positionsIeACC[TG]TAOR ACC[TG]TA

E-value Exercise AnswerACC[TG]TA

Expect lsquoArsquo every 4 ntExpect lsquoACCrsquo every 4x4x4 = 64 nt

Expect lsquoT or Grsquo every 2nd ntExpect lsquoACC[TG]rsquo every 64x2 nt = 128 nt

Expect lsquoTArsquo every 4x4 = 16 ntExpect lsquoACC[TG]TArsquo every 128x16 nt = 2048 nt (4x4x4x2x4x4)We would expect ~5 of these promoter sites every 10k by chance

If also ACC[TG]TAA allowed

The two motifs independently have the same E-valueTo allow either means we expect twice as many

E-values Effect of Database SizeThe nr mRNA database has ~ 14 x 1010 letters (was RefSeq and 50 x108)There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA

Query = lsquoArsquo

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACrsquo

AC AC AC AC

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACGrsquo

ACG

Expected number of matches = (14 x 1010) 4 = ~12 x 108

Expected number of matches = (14 x 1010) (4x 4) = ~31 x 107

Expected number of matches = (14 x 1010) (4 x 4 x 4) = ~81 x 106

Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer

Expected number of matches = (14 x 1010) (4 x 4 x 4 x 4 hellip 60 times ) = (14 x 1010) 1036 = 14 x 10-26

E-value = 14 x 10-26

(was E-value = 50 x 10-28)

E-values Effect of Database Size

The E-value is simply dependent on database size

RefSeq

nr

14 x 1010 letters

50 x108 letters

30 x bigger

BLAST the same sequenceagainst each E-value = 14e-26

E-value = 50e-28

The database was ~30 times bigger and so the E-value was ~30 times bigger

Why were the values differentOur calculated E-value for searching against the RefSeq mRNA database was 50 x 10-28But our actual BLAST search at NCBI gave a value of

20 x 10-26 - about 40x larger - why is this

Gapped alignments

If we were expecting N matches for a query sequence lsquoACGTACGTACGTrsquo imagine what would happen to N if we allowed gaps in our matches

ACGTACGTACGT

This would now give us additional possible alignments that would meet our lsquomatchrsquo criteria

ACGTACGTACGT ACGTACAGTACGT ACGTACCGTACGT etc|||||||||||| |||||| |||||| |||||| ||||||ACGTACGTACGT ACGTAC-GTACGT ACGTAC-GTACGT

We will expect many more matches in a given database if we allow our alignments to have gaps The E-value will be larger

E-values Effect of Query Length

Biologically itrsquos the same match Does it mean we are any less sure that this match didnrsquot occur by chance The E-value is simply dependent on match length

database

BLAST 500 nt sequence against a database

BLASTn Get a full length match with sequence XYZ at an E-value = 50e-160

gtsequence ACTAGTCTAGCTAGACATCGATCGATGATGCTACACAGATAGACGATAGATAGTAAGTCGATCGATCGCGCATCGATCGTCTAGATCGATCGCTCGCTGTGTAGATAGATCGGCGATAGA

database

BLAST half of the same sequence against the same database

BLASTn

gtsequence ACTAGTCTAGCTAGACATCGATCGATGATGCTACACAGATAGACGATAGATAGTAAGTCG

Get a match with sequence XYZ again but at an E-value = 50e-80

Why not just use identityAt some levels this a good question

But consider two very different searches both of which give a 75 identity match

Query1 was 60 nt longCGGAGCTCAGGGCTTAACGACTGATATCTCCGCGCATGTCGAGAAACGATACAGCCAGCG||||||||||| || | || | || || |||| | | | |||||| | ||||||||||CGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGWhich would have an E-value ~ 50 x 10-19

And Query2 only 16 nt longACGTACGTACGTACGT||| || | |||| ||ACGCACCTTCGTAGGTWhich would have an E-value ~ 30

And intuitively we feel we would expect to see that sort of number of matches in the database just by chancehellip

So whatrsquos the real problemBasically you are usually trying to answer the question

Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism

Are there any useful guidelines though at least for biological meaningfulness

Basically you are usually trying to answer the question

Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism

BLAST

The difficulty is because

ORTHOLOGY

BLAST Similarity + Probability

biological knowledge

nature of query sequence

phylogenetic relationship

match length PI size of databasehellip

Rules of ThumbHow good does an E-value have to be before we might even think we have an ortholog

largerworse smallerbetter

E-values 10-5 10-10 10-40 10-100 00

fantasy borderline encouraging

pretty good canrsquot get

better

But note that in some gene families with closely related members you can get an E-value of 00 for several different matches and then identity may be more sensitive Also bear in mind in cases like this that ideas of lsquofunctionalrsquo orthology may break down with more than one locus producing identical proteins which share the same functionhellip

Protein BLASTItrsquos (nearly) always better to make comparisons at the amino acid level between protein sequences than the DNA level because the amino acid sequence is more conserved than the underlying DNA sequenceDoes this cause us to treat expected values any differently

If we follow the argument as before then for an exact match of a 20 amino acid sequence in the RefSeq protein database each additional amino acid will reduce the E-value by 120th (there are 20 different amino acids) And as there are 347895532 letters in that databaseE-value = ~35 x 108 (20 x 20 x 20 hellip20 times) = ~35 x 10-18

But this is what we get if we run the blast at NCBI

Score = 431 bits (100) Expect = 8e-04 Identities = 2020 (100) Positives = 2020 (100) Gaps = 020 (0) Frame = +3

Query 3 SSSSFRAYRAALSEVEPPCI 62 SSSSFRAYRAALSEVEPPCISbjct 972 SSSSFRAYRAALSEVEPPCI 991

Really too big a discrepancy to easily explain with hand wavinghellip

Amino Acid Substitutions

A SC F LWYG I LMVL IMFVM ILVP V ILMW FY

N DHSQ REHKS ANTT SY HFW

H NQYK RQER QK

D NEE DQK

In fact we need to take into account both amino acid substitutability as well as as before allowing gapped alignments On average any residue can be substituted for by about 2 others so each position has about 17th chance of lsquomatchingrsquo rather than 120th

So now we getE-value = ~35 x 108 (7 x 7 x 7 hellip20 times) = ~44 x 10-9which is much closer to the actual BLAST value

These substitutabilities are dealt with by the BLOSUM and PAM matrices

ExercisesGo to the file random-DNA-sequenceshtml select one of the 20 randomly generated nucleotide sequences and do a BLASTx (translated DNA-gtprotein) at NCBI against the nr protein database

Did you find any lsquosignificantrsquo hits

Repeat with a second sequence

What conclusions might you draw from this exercise

Try the same sequence(s) against the nr nucleotide database

Is there any general difference

Part 4 Tweaking BLASTAlthough you normally see BLAST as a web page with boxes to place data in and tick boxes etc it is actually a command line program that can be run just by typing the appropriate command and options eg

promptgt blastall ndashp ltblast typegt ndashi ltinput sequencegt ndashd ltdatabasegt

This is the simplest form where the basic program lsquoblastallrsquo takes a number of different options or parameters indicated by the ndashx and followed by its value -p ltwhich blast flavour to rungt-i ltfile with query sequence ingt-d ltpre-indexed database namegt

Not All Parameters are hellipThere are many other parameters and if not listed explicitly they will use a default value most appropriate to the blast flavour requested Eg for ndashW ltword sizegt blastn uses ndashW 11 where blastx uses ndashW 3

There are also some options that appear on the web pages that are not really parameters but manage the job in a similar way One of the most useful of these is on the NCBI blast pages where you can use Entrez queries or pick from an organism list to modify your search

The Many Parameters of BLASTThere are almost literally hundreds of parameters but most are way too obscure even for die-hard techies like me Very few of them are regularly useful in any but their default value but just occasionally they are very necessary

Here are some of the ones that I have used

-e max expected value -m output format (graphical or tabularspreadsheet)-F filter query sequence for low complexity (default TRUE)-U use only upper case regions of query (default FALSE)-G gap opening cost-E gap extension cost-q nucleotide mismatch penalty (BLASTx uses matrices)-r nucleotide match reward-b number of matching sequences to report-g allow gaps (default TRUE)-W word size-z effective database size (removes effect of actual database size)-S query strands to search (default both directions)-l restrict database sequences to given list of lsquogilsquo numbers

BLAST Parameters Exercises1 BLASTn vs BLASTx

Open the file example-sequenceshtml copy the sequence gtblastn-vs-blastxThis is a Xenopus tropicalis cDNA sequence

Go to the NCBI BLAST Home PageNucleotide-nucleotide BLAST (blastn) section Paste your sequence into the box

Run BLASTn against the nr nucleotide database using all default optionsThen hit [format] to wait for the results in a new page(hint if you paste the sequence definition line lsquogtnamersquo into the box as well your results will be labelled accordingly which can be useful)

Now repeat but go to the TRANSLATED BLAST section and BLAST against the nr protein database using BLASTx

How might the different results help us view the presence of this gene in other vertebrates

Results for Exercise 1 BLASTn

BLASTx

BLAST Parameters Exercises2 Low complexity filtering

Open the file example-sequenceshtml copy the sequence gtlow-complexity-filtering-A This sequence contains a long AT tandem repeat

Go to the NCBI BLAST Home PageTRANSLATED BLAST sectionBLASTx Paste your sequence into the box

Carefully UNTICK the ldquoChoose filter [ ] Low complexityrdquo BOX in the second section And then run BLASTx against the nr database

What do you feel about these alignmentsRe-run but leave the low-complexity filter ON this timeDoes this change our view of the protein matches

Now continue with gtlow-complexity-filtering-B and ndashCC is an especially interesting case ndash what can we deduce about the cDNA sequence Annotators beware

Results for Exercise 2A (OFF) BLASTn ndash low complexity filtering OFF

Results for Exercise 2A (ON) BLASTn ndash low complexity filtering ON

Results for Exercise 2B

ON OFF

Results for Exercise 2C

There is a sequence error an extra G at position 117 in the sequence cDNA (117)AGAAAAGAAGAAACATGGCAATGGATCAGAA|||||||||||||||| ||||||||||||||AGAAAAGAAGAAACAT-GCAATGGATCAGAA

Genomic sequence

ON

OFF

BLAST Parameters Exercises3 Limit by Entrez query

Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching sequences in fruit fly enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT

Open the file example-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home Page TRANSLATED BLAST sectionBLASTx and paste the sequence

Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip

BLAST Parameters Exercises4 BLASTn vs tBLASTx and nucleotide mismatch penalties

Open the file example-sequenceshtml

Also open the NCBI BLAST Home PageSPECIAL ndash Align two sequences section

There are several Xenopus tropicalis cyclins in the examples fileCopy the sequence gtcyclin-A1-Xt to the Sequence 1 BLAST windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 BLAST window(i) Run the default comparison should be BLASTn Note the alignmentNow run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these

(ii) Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1 Then try reducing the gap open and gap extension penaltieshellipWhat do we learn from this

(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip

Results for Exercise 4 (i)

BLASTn tBLASTx

Results for Exercise 4 (ii)

Mismatch penalty = -2 (default) Mismatch penalty = -1

BLAST Parameters Exercises5 E-Value maximum for reporting

Open the file example-sequenceshtml

Copy the sequence gtsumo-binding-motif and go to the NCBI BLAST Home PageGo to the PROTEIN BLAST section BLASTp and paste the sequence

Run the search with the default values

Now re-run the search setting the maximum E-value in the box

Expect 100

What difference does this make

BLAST Parameters Exercises6 Word Size

Open the file example-sequenceshtml

Copy the sequence gtmorpholino and go to the NCBI BLAST Home PageGo to the NUCLEOTIDE BLAST section BLASTn and paste the sequence

Check OFF the low complexity filter and then run the search

Now re-run the search setting the following parameters

Low complexity OFFExpect 100Word Size 7Other advanced -q-1 (mismatch penalty -1 instead of default -3)

What difference does this make

END

  • Bioinformatics Workshop 1 Sequences and Similarity Searches
  • The Basic Questions
  • Part 1 Structural Genomics
  • Chromosomes and Genes
  • Gene to Protein
  • Sequence Signals
  • Genomic Signals
  • Derivative Sequences
  • Gene Models
  • Sequences and Genes (Accession Numbers and Names)
  • Gene Symbols Names Etc
  • A Gene-Centric View
  • Sequences and Accession Numbers
  • mRNA Splicing Signals
  • Gene Predictions
  • Supporting Evidence
  • TheoreticalPredicted Sequences
  • Sequences for a model organism
  • So Whatrsquos in the Databases Now
  • Part 2 Comparative Genomics
  • Speciation
  • Residual Similarity
  • Computers Can Detect Homology
  • Orthologs
  • Paralogs
  • lsquoOtherrsquo-logs
  • The Essential Paradigm
  • Function Conserved Longer than Detectable Similarity
  • Redundancy in the Genetic Code
  • Protein Similarity Persists Longer
  • Always Compare Protein Sequences
  • Exercise 1 nucleotide vs amino acid search
  • Answers Exercise 1
  • The Essential Task
  • Functional Orthologs
  • Finding Orthologs
  • Using Synteny is Better
  • Metazome
  • Metazome Exercise
  • Part 3 Finding Sequence Similarities
  • Gaps in Alignments
  • The Downside of Gaps
  • BLAST
  • Flavours of BLAST
  • How does it work
  • BLAST WORDS and INDEXING
  • Analyse the Query Sequence
  • Expand from Word Based Matches
  • BLAST ndashTypical Output
  • When is a match significant
  • E-values
  • E-values From First Principles
  • Calculating an E-value
  • E-values In Practice
  • E-value Exercise
  • E-value Exercise Answer
  • E-values Effect of Database Size
  • Slide 58
  • Why were the values different
  • E-values Effect of Query Length
  • Why not just use identity
  • So whatrsquos the real problem
  • Rules of Thumb
  • Protein BLAST
  • Amino Acid Substitutions
  • Exercises
  • Part 4 Tweaking BLAST
  • Not All Parameters are hellip
  • The Many Parameters of BLAST
  • Slide 70
  • BLAST Parameters Exercises
  • Results for Exercise 1
  • Slide 73
  • Results for Exercise 2A (OFF)
  • Results for Exercise 2A (ON)
  • Results for Exercise 2B
  • Results for Exercise 2C
  • Slide 78
  • Slide 79
  • Results for Exercise 4 (i)
  • Results for Exercise 4 (ii)
  • Slide 82
  • Slide 83
  • END
Page 10: Bioinformatics Workshop 1 Sequences and Similarity Searches

Sequences and Genes(Accession Numbers and Names)

AAB229701

AAP212451

CAA415451

NP_1877592

proteins

S431051mRNAscDNAs lsquosimilar to Cyclin B1 [mus musculus]rsquo

gene

BT0064371 lsquoCyclin B1 isoform 1 [mus musculus]rsquo

X587081

NM_1119853 lsquoCCNB1 Cyclin B1 [mus musculus]rsquo

lsquoCyclin B1 isoform 2 [mus musculus]rsquo

Gene Symbols Names Etc

Gene Symbol CCNB1

Gene Name cyclin B1 [Homo sapiens]

Description G2mitotic-specific cyclin B1

Aliases CCNB CYCB1

A Gene-Centric View

Entrez Genehttpwwwncbinlmnihgov

Cyclin B1

S431051

BT0064371

X587081

NM_1119853

AAB229701

AAP212451

CAA415451

NP_1877592

Exercise 1

Go to Entrez Gene and look for your favourite gene or genes

genomic location

expression data

Sequences and Accession Numbers

NM_0010159221 gi=62860271

GATCGTTCGATTAGCTAGGGACACCACCGATCGATATGACCACAAAAA

BC0096381 gi=16307106

GTTCGATTAGCTAGGGACACCACCGATCGATATGACCACAAAA

NM_0010159222 gi=62860589

GACCGTTCGATTAGCTAGGGACACCACCGATCGATATGACCACAAA

NP_0010159221 protein translated from mRNA

XM_0011025671 predicted mRNA

XP_0010897651 predicted protein translated from predicted mRNA

mRNA Splicing Signals

gene model

genome

CTACCATCCATGCTAACCATTCTACCATTTTATACTCATGCAACGGACCGTAGCGTAGTCGCTTAGCATCCTTTATAACTGGCTA

CTACCATCCATGCTAACCATTCTAC CATTTTATACTCATGCAACGGACCGT AGCGTAGTCGCTTAGCATCCTTTATAACTGGCTA

CTACCATCCATGCTAACCATTCTACGTAAGTCATCTATATCAATATTATTTCAGCATTTTATACTCATGCAACGGACCGTGTCAGTATTACAGAGCGTAGTCGCTTAGCATCCTTTATAACTGGCTA

GTAAGdonor

TTTCAG acceptor

mRNA

exon intron exon intron exon

splice sites

Gene PredictionsGiven- coding sequence must run from ATG ndash STOP codon in-frame- introns GT AG can be spliced out

Also take a statistical approach- coding and non-coding sequence are slightly different in composition- some lsquopossiblersquo splice sites are more likely than others

CGTCGTATGGCTTCGATGTAGTACATCGGATCGGTATGGAATCATTTCAGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA

CGTCGTATGGCTTCGATGTAGTACATCGGATCGGTATGGAATCATTTCAGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA CGTCGTATGGCTTCGATGTAGTACATCGGATCGGTATGGAATCATTTCAGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA

CGTCGTATGGCTTCGATGTAGTACATCGGATCGGTATGGAATCATTTCAGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA

CGTCGTATGGCTTCGATTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA

CGTCGTATGGCTTCGATGTAGTACATCGGATCGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA CGTCGTATGGCTTCGATGTAGTACATCGGATCGGTATGGAATCATTTCAGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA

CGTCGTATGGCTTCGATTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA

CGTCGTATGGCTTCGATGTAGTACATCGGATCGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA

scan genomic sequence hellip

CGTCGTATGGCTTCGATGTAGTACATCGGATCGGTATGGAATCATTTCAGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA

most likely gene model

Supporting Evidence

EST evidence

genome

gene model

We note that in the absence of EST evidence it is only really possible to predict coding sequence with any confidence (and even thenhellip)

So predicted genes based on computational gene models alone will usually lack UTR regions which has some important consequences

exons 1 2 3 4

TheoreticalPredicted Sequences

genome

predicted gene modelexons 1 2 3 4

Wersquove now reversed the process of working out exon structure from aligning cDNA sequences against the genome sequence but we shouldnrsquot lose sight of the fact that we donrsquot really know if these predicted proteins exists ndash especially where supporting EST evidence is weak or non-existent

predicted transcript

predicted protein

Sequences for a model organism

ESTs ndash millions pound10 eachCheap to sequence ndash so we get millions per organismBut lots of errorsAnd incomplete gene sequencesCan give us relative expression levels

cDNAs ndash tens of thousands pound1000 eachExpensive ndash but only need to do one (or a small number) per geneFew errors with multipass sequencingGives us protein sequences

Genomes ndash one pound30000000Extremely expensiveBut the only way to get the whole pictureGives us gene regulation

So Whatrsquos in the Databases Now

15000000ESTs

3300000cDNAs

NCBI July 2005

2700000proteins

950000proteins

nrRefSeq

DNA

Proteins

Part 2 Comparative Genomics

ATGAAGGCTGCCTACGACTGCCGTGATGCAGGCTGCCTACGACTGCCGTGATGCAGGCTGCCAACGACTGCCGTGATGCATGCTGCCAACGACTGCCGTGATGCATGCTGCCAACGACTGCCCTGATGCATGCTGCCAACGGCTGCCCTGATGCATGCTGCCAACGGATGCCCTGATGCATGCCGCCAACGGATGCCCTGATGCATGCCGCCAACGGATGTCCTG

Imagine one mutation gets fixed every 100000 years in this gene sequencehellip

Gene sequence

Evolution by sequence mutation

Speciation

ATGAAGGCTGCCTACGACTGCCGTG

ATGCAGGCTGCCTACGACTGCCGTGATGCAGGCTGCCAACGACTGCCGTGATGCATGCTGCCAACGACTGCCGTGATGCATGCTGCCAACGACTGCCCTGATGCATGCTGCCAACGGCTGCCCTGATGCATGCTGCCAACGGATGCCCTG

Gene AATGAAGGCTGCCTACGACTGCCGTG

ATGAAGGCCGCCTACGACTGCCGTGATGAAGGCCGCCAACGACTGTCGTGATGAAAGCCGCCAACGACTGTCGTGATGAAAGCCGCCAACGACAGTCGTGATGAAAGCCGCCTACGACAGTCGTGATGAAAGCCGCCTACGACAGTCCTG

ATGCATGCTGCCAACGGATGCCCTG

ATGAAAGCCGCCTACGACAGTCCTG||| | || ||| ||| | ||||

If the genetic difference means they can no longer interbreed with fertile offspring ndash then we have a new specieshellip

Residual Similarity

ATGCATGCTGCCAACGGATGCCCTG

ATGAAAGCCGCCTACGACAGTCCTG||| | || ||| ||| | ||||

ATGCATGCTGCCAACGGATGCCCTG

ATGGAAGGCGCTTAGGATAGTCCAG||| | | || | | | || |

After longer periods of evolution homology may no longer be detectable in the DNA sequencehellip

We can still easily detect residual similarity between these sequences this is what we call homology ndash detectable similarity because of common evolutionary origin

Computers Can Detect Homology

In fact computers are very good at this task ndash the two primary challenges are

(a) performing the search fast enough to look through millions of sequence in a timescale compatible with a lab scientistrsquos attention span

(b) at low levels of similarity being able to distinguish between biologically related sequences and chance matcheshellip

ATGCATGCTGCCAACGGATGCCCTG

ATGAAAGCCGCCTACGACAGTCCTG||| | || ||| ||| | ||||

GCTGACTCGTAGCGCTTAGCTAGCT

CCAACATCTAGCCAGATTAGTTAGT | || | | | |

Orthologs

A A

A Gene duplication though speciation The two copies of Gene

A will now evolve independently but will continue to have the ~same function

They are ORTHOLOGS

Paralogs

A

Gene duplication though internal genome duplication

The two copies of Gene A will now evolve independently but will probably not continue to have exactly the same function

They are PARALOGS

A

A Arsquo

A

lsquoOtherrsquo-logsWhat about gene duplication after speciation

How can we describe the relationship(s) between the various copies of gene A in the two frogs

Bear in mind that understanding gene function is more important than semanticshellip

The two copies of A in the orange frog are sometimes called IN-PARALOGS

If they were also present in the green frog (and therefore were in the ancestor species) they would be OUT-PARALOGS

A

A

A

Arsquo A

The Essential Paradigm

1 any group of modern species can be traced back to some extinct common ancestor

A

A

2 in all likelihood they share orthologous genes which have the same function in the modern animal as in the extinct ancestor

3 If we can experimentally determine the function of a gene in one of these organisms then there is a good chance the ORTHOLOGOUS gene in another organism will have the same function

A A

cyclin b1

cyclin b1

Function Conserved Longer than Detectable Similarity

start from first self-replicating sequence

same function detectable similarity

living organisms

whole genome duplication local duplication

Redundancy in the Genetic Code

GCA A alanine GCC A GCG A GCT A

TGC C cystine TGT C

GAC D aspartate GAT D

GGA G glycine GGC G GGG G GGT G

lsquoSynonymousrsquo or lsquosilentrsquo mutations in the third position of the codon triplets have no effect on the amino acid coded for ndash so there is no evolutionary pressure against thishellip

Protein Similarity Persists Longer

CTATCACGAGAACCTGTGCTATCCCGAGAACCTGTGCTATCCCGAGAACCAGTGCTATCCCGTGAACCAGTGCTATCCCGTGAGCCAGTGCTATCCCGTGAGCCAGTTCTGTCCCGTGAGCCAGTT

CTATCACGAGAACCTGTG

CTGTCCCGTGAGCCAGTT|| || || || || ||

LSREPV

LSREPV||||||

CTATCACGAGAACCTGTG

TTGTCCCGGTCGCCAGTT | || | || ||

LSREPV

LSRFPV||| ||

67 100

44 80

Always Compare Protein Sequences

ATGAATGCAGCCTATGATTGCCGAGCCAGAATGCTAAGG MNAAYDCRARMLR ||||| || || || || || || || ||||| || || | ||||||||+||ATGAAGGCCGCATACGACTGTCGTGCTAGAATCCTGAGA MKAAYDCRARILR

DNA comparison amino acid comparison

The DNA sequence can change while the amino acid sequence stays the same so always look for similarities by comparing amino acid sequences

Exercise 1nucleotide vs amino acid search

Go to the file example-sequenceshtml and locate the section for this exercise There should be two sequences lsquosurfeit1rsquo for frog and fly

Go to NCBI Blast home page then lsquoAlign two sequencesrsquo (bottom left lsquospecialrsquo panel) paste one sequence into each window and hit lsquoAlignrsquo ndash this will do a direct DNADNA comparison

Now find the open reading frames of the two genes and translate them into amino acid protein sequences then repeat the two sequences comparison

Go to NCBI ORF Finder ndash paste sequence ndash hit OrfFind ndash identify longest ORF ndash click on it ndash next screen hit Accept ndash change View to Fasta protein ndash hit View ndash copy sequence to Blast2Seqs Do the same with the other sequence

Before you hit lsquoAlignrsquo change the lsquoProgramrsquo (top left) to blastphellip

Answers Exercise 1

The Essential Taskexperiment data mining

gene sequence what is its function

database of proteins in other species

Cyclin-AFoxA1

cdc25

alpha-tubulin

Predicted protein

Gravin-like

Sprouty-2

calmodulin

KIAA10786568

frizzled

Wint8

Troponin T3

Gravin-like

we can only do this because of implied function based on orthology

Functional Orthologs

function known annotation lsquoGravinrsquo available

Human geneXenopus genefunction unknown

sequence similarityorthologs

same function But we know that function is largely determined by shape

similar shape

Which in general we cannot determine ndash but it is probably SHAPE not SEQUENCE that is conserved

We make an assumption that the same gene function is likely to be present in the two organisms and the ones that have this function are likely to be the most similar in sequence

Finding OrthologsSo how do we find orthologs and can we know when we have

The simplest is Reciprocal Best BLAST but it implicitly relies on having all the protein sequences of you own organism and the one you wish to find an ortholog in

frog proteindatabase of human proteins

best match human protein

database of frog proteins

x

Using Synteny is Better

We know that large regions of (say) vertebrate genomes have preserved their overall organisation from one organism to another

And we find the same genes (ie orthologs) in more or less the same order in the syntenic sectionsThese of course represent chromosomal re-arrangements since these organisms diverged

Human chromosome 5

Mouse chromosome 10

Mouse chromosome 2

MetazomeFortunately someone has done all the hard work for ushellip Dan Rokhsar httpwwwmetazomenet

Metazome Exercise

Go back to Entrez Gene and look for your favourite gene again

Pick probable ortholog vertebrate genes from common organisms (human mouse rat chicken frog fish) and paste their protein sequences into a temporary space

Go to Metazome (httpwwwmetazomenet) find the blast window open two versions of it and blast your sequences against the Tetrapod or Jawed vertebrate node

See if you get the same cluster ID as best top hit and have a look at the Metazome alignment(s)hellip

Part 3 Finding Sequence Similarities

We want computer programs which will compare sequences at all possible different alignments looking for a degree of similarity greater than we would expect to find by chance

But first we have to consider the implication of gapshellip

Insertions and deletions are other possible forms of mutations and they can really mess up our simple alignments

ATGCATGCTGCCAACGGATGTCCTG

ATGAAAGCCGCCTACGAAAGTCCTG||| | || ||| ||| | ||||||

ATGCATGCTGGCCAACGGATGTCCTG

ATGAAAGCCGCCTACGAAAGTCCTG||| | || ||| | | | |

Gaps in Alignments

Consider these two obviously similar sequences

TTCCCAACTCTCCTCTTTCACCATGAAGCTCAAGGACAGATTCCACTCGCCCCAAAATCAAGCTCACCCCGTCCAAGAA | || | || |||||||||||||||||||| ||||||||| ||| ||| | ||| | | |TTCCCACCTCTCCTCTTTGCACCATGAAGCTCAAGGACAAATTCCACTCCCCCAAAATCAAGCGCACCCCGTCCCAGAA

In fact we realise that the most probable alignment (regarding biological origin) is with a small gap in each sequence

TTCCCAACTCTCCTCTTT=CACCATGAAGCTCAAGGACAGATTCCACTCGCCCCAAAATCAAGCTCACCCCGTCCAAGAA |||||| ||||||||||| |||||||||||||||||||| ||||||||| |||||||||||||| |||||||||| ||||TTCCCACCTCTCCTCTTTGCACCATGAAGCTCAAGGACAAATTCCACTC=CCCCAAAATCAAGCGCACCCCGTCCCAGAA

So in general we allow ourselves to insert gaps until we find the optimal alignment

But where should this process stop

The Downside of GapsTake two random sequences with no lsquorealrsquo similarity

GACACTAGGTCGATGCGTGGTGGCGAGA

ACGCATCCGGATGTGCACCGTGGAACTG

And allow lsquocost freersquo gaps

GAC--ACT----AGGTCGATGC---GTGG---TGGCGAGA || | | | | | ||| |||| || ACGCA-TCCGGA--T-G-TGCACCGTGGAACTG

Clearly although the alignment has no mismatches it is obviously not biologically meaningful

To prevent this we assign a cost to adding gaps which is offset against the benefit of finding matches ndash and this is the essence of lsquofinding gapped alignmentsrsquo

We want to find the lsquoalignmentrsquo between the two (or more) sequences which shows the greatest degree of similarity while introducing the fewest gaps hellip

BLAST

gtqueryAGACGAACCTAGCACAAGCGCGTCTGGAAAGACCCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGTTCAGGAGTATTTGGACTGCAATATTGGCCCTCGTTCAAGGGCGCCTACCATCACCCGACGGTCATGCCGGTCCCCAGCAGCTGCTAATAACTTCCTTCGCTACTCAAGTTACCACGCTAGCAAAACCCACGGCATACCGTTTACCCTTTAAAATCAGCTTCAACCAGCAACGAA

There are many programs used to find similarities between sequencesThey range from relatively slow programs which find the exact best matching alignment through ones which take progressively inexact shortcuts to speed things up Of this latter class the best known and easily most widely used is BLAST developed by Stephen Altschul and others and continuously refined over the last 10-15 years

The essential idea is to compare your query sequence against a collection or lsquodatabasersquo of target sequences looking for the one(s) that match the query sequence the best

gttarget1AAAACAGGAATATTTACCGGGACCGGGTAATGATGCATCTCGAGGTACACAATATACCTG GAGAACCGAATTATGAGTTGGCCACCTTACTTAACGAAACCAGCAGAGAAAATCCAACAT GGCAACACCCCTCTGACTACACTAGAAGGAACTACTATGTAAGAAAACAGCCTGTCCCTT GCAGTTTGAATGACTGGGTGATGCGAAATGGGGGTCCTGCCATAGAGCGCTTCCATGGTT TACCTTGCACATTTCAGAGAAGTCCTATGCCAGGAGTCCTTCCTACAGGGCCTTCCTGAA ACTATATATGTGCTTATTCTTGTTTGATTTGGCTTTGCAGgttarget2CTCTTAATTTATTTCTCTTCCTGCAGCTCCCTCGCTTTTTCCTTTCCCTGTTACATTCAT CTGACTTGAAGAGTTGCAAATTTTCAGTGTTTCTGTTTTTGTTGCTGATATGTTGTAAAC TTTTTAATAAAATCTATTTCTATAG gttarget3GCAGTTTGAATGACTGGGTGATGCGAAATGGGGGTCCTGCCATAGAGCGCTTCCATGGTT TACCTTGCACATTTCAGAGAAGTCCTATGCCAGGAGTCCTTCCTACAGGGCCTTCCTGAA ACTATATATGTGCTTATTCTTGTTTGATTTGGCTTTGCAGCTAGGGTTTTCACCTTTTCT GGAAAAAAAAATACTGGCTTCC gttarget4CTGCTATTAATGGGCAAAACAACTCAAATAAAGTCCCTCTGCCACCCTCAGACACTGCCC CTGGCCCCCAGCTGCCCGCTGATCCTTGTAGCCAGAGCAGTAAAGTTTTGAAAGTGGAGC CCAAGGAGAATAAAGTTATTAAAGAAACTGGCTTTGAACAAGGTGAAAAGTCTTGTGCAG CACCTCTAGATCATACTGTGAAGGAAAATCTTGGACAAACTTCTAAAGAACAGGTGGTAG

query

database

COMPARE

LIST MATCHES

Flavours of BLAST

ACGATAGATCCCATCCATAAAT ATGACGATAGATCCCATCAT CGATAGGACCACCACA GATAGACCAGGATACATAGGATAATTA AGCTCGCTTGGCTCGATGGCT

MKJLSPWERSYTRGHYTWER MGHTVNBZY MKLPWRHGDBKJGMNDFD MBKLRPIUHDFRTASGSLKWWRTVBN

MKJLSPWERSYTRGHYTWER MGHTVNBZY MKLPWRHGDBKJGMNDFD MBKLRPIUHDFRTASGSLKWWRTVBN

ATGACGATAGATCCCATCAT CGATAGGACCACCACA GATAGACCAGGATACATAGGATAATTA AGCTCGCTTGGCTCGATGGCT

query sequence other operation database sequences

ATGACGATAGATCCCATCAT CGATAGGACCACCACA GATAGACCAGGATACATAGGATAATTA AGCTCGCTTGGCTCGATGGCT

BLASTn

BLASTp

BLASTx

tBLASTn

tBLASTx

ACGATAGATCCCATCCATAAAT

ACGATAGATCCCATCCATAAAT

MQWCGYRWTYQGYRW

MQWCGYRWTYQGYRW

FAST

FAST

SLOW

SLOWER

HORRIBLY

SLOW

6 fra

me

trans

latio

n

How does it work The main task of any sequence comparison program is to test all possible mutual alignments of two sequence and see how good the match is

CTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGT

CCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGTTC | | | | | ||||||||||||||||||||||||| CTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGTCTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGT || | | | | | | | | |||||||||||||||||||||||| | | | | | |

CCGAGCTTCTCATTGCTCTTCCTAACAGTG=TGATAGGCTAACCGTAATGGCGTTC||||||||||||||||||||||||| ||||||||||||||||||||||||

query

1st database sequence

This would actually be a very slow search process if implemented like thishellip

BLAST achieves its speed through two strategies

- it takes a WORD based approach- it pre-INDEXES database sequences

BLAST WORDS and INDEXING1 GACAAATCCAAACCCCTGAAGTTCTCCACCAGCAAAGCCA

2 TAAGCAAATTTAATTTTGTTTACATTTTC

3 GTTAAGACCTTCCCTGACATTTGCAGCAGTTTCAAATGTA

AAAAAAAA 00001 AAAAAAAC 00002 AAAAAAAG 00003 ACAAATCC 07967 ACAAATCC 07968 ACAAATCC 07979 GACAAATC 33568 GACAAATG 33569 TCCAAACC 64321 TCCAAACC 64322

sequence position word

1 1 33658

1 2 07967

1 3 16210

3 15 33568

3 16 07967

Database of sequences

Numbered list of all possible lsquowordsrsquo

Build a position index of all words in the database

Analyse the Query Sequence gtquery AGACAAATCCAAACCCCTGAAGTTCTCCACCAGCAAAGCCA

AAAAAAAA 00001 AAAAAAAC 00002 AAAAAAAG 00003 ACAAATCC 07967 ACAAATCC 07968 ACAAATCC 07979 GACAAATC 33568 GACAAATG 33569 TCCAAACC 64321 TCCAAACC 64322

QUERY SEQUENCE

Numbered list of all possible lsquowordsrsquo

position word

1 14236

2 33658

3 07967

Analyse QUERY SEQUENCE

sequence position word

1 1 33658

1 2 07967

1 3 16210

3 15 33568

3 16 07967

Index of database

Expand from Word Based Matches We lsquoinstantlyrsquo know which sequences in the database have at least a word length match with our query sequence and at what relative position

Next the potential alignments are expanded adding up a score for (total matches ndash mismatches ndash gap penalties) to make the best possible alignment But this is usually for a tiny proportion of the sequences in the database ndash so overall it is much quicker

The highest scoring alignments are reported

But we can potentially miss alignments with no word-size bits in common consider BLASTn with a default word-size of 11

TCGGAAGTGGAAGCTGAACCTGATTGTAGAGTTGGAGGCCAGTGTTCTGGCTGAGC||||||||| ||||| |||||||||| |||||||||| |||| ||||| ||||||| TCGGAAGTGTAAGCTCAACCTGATTGCAGAGTTGGAGTCCAGAGTTCTAGCTGAGC

Care is sometimes neededhellip

BLAST ndashTypical OutputINPUT

gtpartial cDNA sequence Xenopus tropicalisCGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGTTCCCACCTCTCCTCTTTCACCATGAAGCTCAAGGACAAATTCCACTCCCCCAAAATCAAGCGCACCCCGTCCAAGAAGGGGAAGCCGGCCGACCTCACCGTCAAAACAGAAGAGAAACCCGTCAACAAAACCTTAAGCCGCTTGGAGGAACAGGAGAAAGAAGTCGTTAATGCCTTGCGTTACTTTAAGACAATTGTTGACAAGATGGCGGTGGACAAGATGGTGCTGGTGATGCTGCCAGGGTCGGCGA

OUTPUTQuery= (311 letters) Database NCBI Protein Reference Sequences 954378 sequences 347895532 total letters

gtgi|41055060|ref|NP_9574201| similar to guanine nucleotide-releasing factor 2 (specific for crk proto-oncogene) [Danio rerio]

Length=691

Score = 133 bits (335)Expect = 6e-31 Identities = 7698 (77) Positives = 8298 (83) Gaps = 498 (4) Frame = +2

Query 26 MSGKIE-KADSQRSHLSSFTMKLKDKFHSPKIKRTPSKKGKPA--DLTVKTEEKPVNKTL 196 MSGKIE K +SQ+SHLSSFTMKL KFHSPKIKRTPSKKGK + VKT EKPVNK + Sbjct 1 MSGKIESKHESQKSHLSSFTMKLM-KFHSPKIKRTPSKKGKQLQPEPAVKTPEKPVNKKV 59

Query 197 SRLEEQEKEVVNALRYFKTIVDKMAVDKMVLVMLPGSA 310 SRLEEQEK+VV+ALRYFKTIVDKM VD VL MLPGSA Sbjct 60 SRLEEQEKDVVSALRYFKTIVDKMNVDTKVLQMLPGSA 97

When is a match significant

RFKISDCQHPCTYSHNQYMTNHMRECPYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV

NFSWKKTSEKETNCQFDYPNDYNEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFNMCWLEVNSS

RF---KISDCQHPCTYSH-NQYMTNHMREC----PYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV F K S+ + C + + N Y N +C P+ + +W +P + D I N M ++ NFSWKKTSEKETNCQFDYPNDY--NEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFN------MCWLEVNSS

Here is a lsquotypicalrsquo weak alignment from BLASTp

In fact the sequences were randomly generated so there is no biologically significant alignmenthellip

E-values

The number of matches like the discovered match that I would expect to find by chance

An E-value of 00 implies that I would expect no matches like this to arise by chance thereforehellip

An E-value of 1 implies I would expect 1 match like this to arise by chance so if I have a match with such an E-valuehellip

Also ldquoexpect valueldquo or ldquoexpectationrdquo

E-values From First Principles

Some database statistics (23rd July 2005)

Database NCBI RefSeq mRNA 272619 sequences 503566580 total letters (~50 x 108)

Database NCBI nr 3329110 sequences 14601814750 total letters (~14 x 1010)

Notation

12e-35 = 12 x 10-35

48 x 106 = 4800000

We will consider first searching a nucleotide sequence (lsquoACGTAGACGTrsquo) against a nucleotide database eg the RefSeq mRNA above

Then we will consider the more complex case of amino acid sequence (protein) searches Which is of course what we mostly do

Calculating an E-valueThe RefSeq mRNA database has ~ 50 x 108 letters There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA

Query = lsquoArsquo

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACrsquo

AC AC AC AC

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACGrsquo

ACG

Expected number of matches = (50 x 108) 4 = ~12 x 108

Expected number of matches = (50 x 108) (4x 4) = ~31 x 107

Expected number of matches = (50 x 108) (4 x 4 x 4) = ~81 x 106

Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer

Expected number of matches = (50 x 108) (4 x 4 x 4 x 4 hellip 60 times ) = (50 x 108) 1036 = 50 x 10-28

E-value = 50 x 10-28

E-values In PracticeSo if I take a 60 nt sequencegtsequenceACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA and actually BLAST it against the RefSeq mRNA database I get

BLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds

Length=6060 Score = 119 bits (60) Expect = 2e-26 Identities = 6060 (100) Gaps = 060 (0) Strand=PlusPlus

Query 1 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 60 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 2977 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 3036

What do I get if I BLAST it against the larger nr databaseBLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds

Length=6060 Score = 119 bits (60) Expect = 6e-25 Identities = 6060 (100) Gaps = 060 (0) Strand=PlusPlus

Query 1 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 60 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 2977 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 3036

theoretical value was 50e-28 -

E-value Exercise

Given a transcription factor binding site

ACC[TG]TA

How many would you expect to find by chance in a 10k promoter sequence

How would this differ if there was an optional additional base between the 4th and 5th positionsIeACC[TG]TAOR ACC[TG]TA

E-value Exercise AnswerACC[TG]TA

Expect lsquoArsquo every 4 ntExpect lsquoACCrsquo every 4x4x4 = 64 nt

Expect lsquoT or Grsquo every 2nd ntExpect lsquoACC[TG]rsquo every 64x2 nt = 128 nt

Expect lsquoTArsquo every 4x4 = 16 ntExpect lsquoACC[TG]TArsquo every 128x16 nt = 2048 nt (4x4x4x2x4x4)We would expect ~5 of these promoter sites every 10k by chance

If also ACC[TG]TAA allowed

The two motifs independently have the same E-valueTo allow either means we expect twice as many

E-values Effect of Database SizeThe nr mRNA database has ~ 14 x 1010 letters (was RefSeq and 50 x108)There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA

Query = lsquoArsquo

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACrsquo

AC AC AC AC

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACGrsquo

ACG

Expected number of matches = (14 x 1010) 4 = ~12 x 108

Expected number of matches = (14 x 1010) (4x 4) = ~31 x 107

Expected number of matches = (14 x 1010) (4 x 4 x 4) = ~81 x 106

Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer

Expected number of matches = (14 x 1010) (4 x 4 x 4 x 4 hellip 60 times ) = (14 x 1010) 1036 = 14 x 10-26

E-value = 14 x 10-26

(was E-value = 50 x 10-28)

E-values Effect of Database Size

The E-value is simply dependent on database size

RefSeq

nr

14 x 1010 letters

50 x108 letters

30 x bigger

BLAST the same sequenceagainst each E-value = 14e-26

E-value = 50e-28

The database was ~30 times bigger and so the E-value was ~30 times bigger

Why were the values differentOur calculated E-value for searching against the RefSeq mRNA database was 50 x 10-28But our actual BLAST search at NCBI gave a value of

20 x 10-26 - about 40x larger - why is this

Gapped alignments

If we were expecting N matches for a query sequence lsquoACGTACGTACGTrsquo imagine what would happen to N if we allowed gaps in our matches

ACGTACGTACGT

This would now give us additional possible alignments that would meet our lsquomatchrsquo criteria

ACGTACGTACGT ACGTACAGTACGT ACGTACCGTACGT etc|||||||||||| |||||| |||||| |||||| ||||||ACGTACGTACGT ACGTAC-GTACGT ACGTAC-GTACGT

We will expect many more matches in a given database if we allow our alignments to have gaps The E-value will be larger

E-values Effect of Query Length

Biologically itrsquos the same match Does it mean we are any less sure that this match didnrsquot occur by chance The E-value is simply dependent on match length

database

BLAST 500 nt sequence against a database

BLASTn Get a full length match with sequence XYZ at an E-value = 50e-160

gtsequence ACTAGTCTAGCTAGACATCGATCGATGATGCTACACAGATAGACGATAGATAGTAAGTCGATCGATCGCGCATCGATCGTCTAGATCGATCGCTCGCTGTGTAGATAGATCGGCGATAGA

database

BLAST half of the same sequence against the same database

BLASTn

gtsequence ACTAGTCTAGCTAGACATCGATCGATGATGCTACACAGATAGACGATAGATAGTAAGTCG

Get a match with sequence XYZ again but at an E-value = 50e-80

Why not just use identityAt some levels this a good question

But consider two very different searches both of which give a 75 identity match

Query1 was 60 nt longCGGAGCTCAGGGCTTAACGACTGATATCTCCGCGCATGTCGAGAAACGATACAGCCAGCG||||||||||| || | || | || || |||| | | | |||||| | ||||||||||CGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGWhich would have an E-value ~ 50 x 10-19

And Query2 only 16 nt longACGTACGTACGTACGT||| || | |||| ||ACGCACCTTCGTAGGTWhich would have an E-value ~ 30

And intuitively we feel we would expect to see that sort of number of matches in the database just by chancehellip

So whatrsquos the real problemBasically you are usually trying to answer the question

Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism

Are there any useful guidelines though at least for biological meaningfulness

Basically you are usually trying to answer the question

Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism

BLAST

The difficulty is because

ORTHOLOGY

BLAST Similarity + Probability

biological knowledge

nature of query sequence

phylogenetic relationship

match length PI size of databasehellip

Rules of ThumbHow good does an E-value have to be before we might even think we have an ortholog

largerworse smallerbetter

E-values 10-5 10-10 10-40 10-100 00

fantasy borderline encouraging

pretty good canrsquot get

better

But note that in some gene families with closely related members you can get an E-value of 00 for several different matches and then identity may be more sensitive Also bear in mind in cases like this that ideas of lsquofunctionalrsquo orthology may break down with more than one locus producing identical proteins which share the same functionhellip

Protein BLASTItrsquos (nearly) always better to make comparisons at the amino acid level between protein sequences than the DNA level because the amino acid sequence is more conserved than the underlying DNA sequenceDoes this cause us to treat expected values any differently

If we follow the argument as before then for an exact match of a 20 amino acid sequence in the RefSeq protein database each additional amino acid will reduce the E-value by 120th (there are 20 different amino acids) And as there are 347895532 letters in that databaseE-value = ~35 x 108 (20 x 20 x 20 hellip20 times) = ~35 x 10-18

But this is what we get if we run the blast at NCBI

Score = 431 bits (100) Expect = 8e-04 Identities = 2020 (100) Positives = 2020 (100) Gaps = 020 (0) Frame = +3

Query 3 SSSSFRAYRAALSEVEPPCI 62 SSSSFRAYRAALSEVEPPCISbjct 972 SSSSFRAYRAALSEVEPPCI 991

Really too big a discrepancy to easily explain with hand wavinghellip

Amino Acid Substitutions

A SC F LWYG I LMVL IMFVM ILVP V ILMW FY

N DHSQ REHKS ANTT SY HFW

H NQYK RQER QK

D NEE DQK

In fact we need to take into account both amino acid substitutability as well as as before allowing gapped alignments On average any residue can be substituted for by about 2 others so each position has about 17th chance of lsquomatchingrsquo rather than 120th

So now we getE-value = ~35 x 108 (7 x 7 x 7 hellip20 times) = ~44 x 10-9which is much closer to the actual BLAST value

These substitutabilities are dealt with by the BLOSUM and PAM matrices

ExercisesGo to the file random-DNA-sequenceshtml select one of the 20 randomly generated nucleotide sequences and do a BLASTx (translated DNA-gtprotein) at NCBI against the nr protein database

Did you find any lsquosignificantrsquo hits

Repeat with a second sequence

What conclusions might you draw from this exercise

Try the same sequence(s) against the nr nucleotide database

Is there any general difference

Part 4 Tweaking BLASTAlthough you normally see BLAST as a web page with boxes to place data in and tick boxes etc it is actually a command line program that can be run just by typing the appropriate command and options eg

promptgt blastall ndashp ltblast typegt ndashi ltinput sequencegt ndashd ltdatabasegt

This is the simplest form where the basic program lsquoblastallrsquo takes a number of different options or parameters indicated by the ndashx and followed by its value -p ltwhich blast flavour to rungt-i ltfile with query sequence ingt-d ltpre-indexed database namegt

Not All Parameters are hellipThere are many other parameters and if not listed explicitly they will use a default value most appropriate to the blast flavour requested Eg for ndashW ltword sizegt blastn uses ndashW 11 where blastx uses ndashW 3

There are also some options that appear on the web pages that are not really parameters but manage the job in a similar way One of the most useful of these is on the NCBI blast pages where you can use Entrez queries or pick from an organism list to modify your search

The Many Parameters of BLASTThere are almost literally hundreds of parameters but most are way too obscure even for die-hard techies like me Very few of them are regularly useful in any but their default value but just occasionally they are very necessary

Here are some of the ones that I have used

-e max expected value -m output format (graphical or tabularspreadsheet)-F filter query sequence for low complexity (default TRUE)-U use only upper case regions of query (default FALSE)-G gap opening cost-E gap extension cost-q nucleotide mismatch penalty (BLASTx uses matrices)-r nucleotide match reward-b number of matching sequences to report-g allow gaps (default TRUE)-W word size-z effective database size (removes effect of actual database size)-S query strands to search (default both directions)-l restrict database sequences to given list of lsquogilsquo numbers

BLAST Parameters Exercises1 BLASTn vs BLASTx

Open the file example-sequenceshtml copy the sequence gtblastn-vs-blastxThis is a Xenopus tropicalis cDNA sequence

Go to the NCBI BLAST Home PageNucleotide-nucleotide BLAST (blastn) section Paste your sequence into the box

Run BLASTn against the nr nucleotide database using all default optionsThen hit [format] to wait for the results in a new page(hint if you paste the sequence definition line lsquogtnamersquo into the box as well your results will be labelled accordingly which can be useful)

Now repeat but go to the TRANSLATED BLAST section and BLAST against the nr protein database using BLASTx

How might the different results help us view the presence of this gene in other vertebrates

Results for Exercise 1 BLASTn

BLASTx

BLAST Parameters Exercises2 Low complexity filtering

Open the file example-sequenceshtml copy the sequence gtlow-complexity-filtering-A This sequence contains a long AT tandem repeat

Go to the NCBI BLAST Home PageTRANSLATED BLAST sectionBLASTx Paste your sequence into the box

Carefully UNTICK the ldquoChoose filter [ ] Low complexityrdquo BOX in the second section And then run BLASTx against the nr database

What do you feel about these alignmentsRe-run but leave the low-complexity filter ON this timeDoes this change our view of the protein matches

Now continue with gtlow-complexity-filtering-B and ndashCC is an especially interesting case ndash what can we deduce about the cDNA sequence Annotators beware

Results for Exercise 2A (OFF) BLASTn ndash low complexity filtering OFF

Results for Exercise 2A (ON) BLASTn ndash low complexity filtering ON

Results for Exercise 2B

ON OFF

Results for Exercise 2C

There is a sequence error an extra G at position 117 in the sequence cDNA (117)AGAAAAGAAGAAACATGGCAATGGATCAGAA|||||||||||||||| ||||||||||||||AGAAAAGAAGAAACAT-GCAATGGATCAGAA

Genomic sequence

ON

OFF

BLAST Parameters Exercises3 Limit by Entrez query

Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching sequences in fruit fly enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT

Open the file example-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home Page TRANSLATED BLAST sectionBLASTx and paste the sequence

Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip

BLAST Parameters Exercises4 BLASTn vs tBLASTx and nucleotide mismatch penalties

Open the file example-sequenceshtml

Also open the NCBI BLAST Home PageSPECIAL ndash Align two sequences section

There are several Xenopus tropicalis cyclins in the examples fileCopy the sequence gtcyclin-A1-Xt to the Sequence 1 BLAST windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 BLAST window(i) Run the default comparison should be BLASTn Note the alignmentNow run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these

(ii) Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1 Then try reducing the gap open and gap extension penaltieshellipWhat do we learn from this

(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip

Results for Exercise 4 (i)

BLASTn tBLASTx

Results for Exercise 4 (ii)

Mismatch penalty = -2 (default) Mismatch penalty = -1

BLAST Parameters Exercises5 E-Value maximum for reporting

Open the file example-sequenceshtml

Copy the sequence gtsumo-binding-motif and go to the NCBI BLAST Home PageGo to the PROTEIN BLAST section BLASTp and paste the sequence

Run the search with the default values

Now re-run the search setting the maximum E-value in the box

Expect 100

What difference does this make

BLAST Parameters Exercises6 Word Size

Open the file example-sequenceshtml

Copy the sequence gtmorpholino and go to the NCBI BLAST Home PageGo to the NUCLEOTIDE BLAST section BLASTn and paste the sequence

Check OFF the low complexity filter and then run the search

Now re-run the search setting the following parameters

Low complexity OFFExpect 100Word Size 7Other advanced -q-1 (mismatch penalty -1 instead of default -3)

What difference does this make

END

  • Bioinformatics Workshop 1 Sequences and Similarity Searches
  • The Basic Questions
  • Part 1 Structural Genomics
  • Chromosomes and Genes
  • Gene to Protein
  • Sequence Signals
  • Genomic Signals
  • Derivative Sequences
  • Gene Models
  • Sequences and Genes (Accession Numbers and Names)
  • Gene Symbols Names Etc
  • A Gene-Centric View
  • Sequences and Accession Numbers
  • mRNA Splicing Signals
  • Gene Predictions
  • Supporting Evidence
  • TheoreticalPredicted Sequences
  • Sequences for a model organism
  • So Whatrsquos in the Databases Now
  • Part 2 Comparative Genomics
  • Speciation
  • Residual Similarity
  • Computers Can Detect Homology
  • Orthologs
  • Paralogs
  • lsquoOtherrsquo-logs
  • The Essential Paradigm
  • Function Conserved Longer than Detectable Similarity
  • Redundancy in the Genetic Code
  • Protein Similarity Persists Longer
  • Always Compare Protein Sequences
  • Exercise 1 nucleotide vs amino acid search
  • Answers Exercise 1
  • The Essential Task
  • Functional Orthologs
  • Finding Orthologs
  • Using Synteny is Better
  • Metazome
  • Metazome Exercise
  • Part 3 Finding Sequence Similarities
  • Gaps in Alignments
  • The Downside of Gaps
  • BLAST
  • Flavours of BLAST
  • How does it work
  • BLAST WORDS and INDEXING
  • Analyse the Query Sequence
  • Expand from Word Based Matches
  • BLAST ndashTypical Output
  • When is a match significant
  • E-values
  • E-values From First Principles
  • Calculating an E-value
  • E-values In Practice
  • E-value Exercise
  • E-value Exercise Answer
  • E-values Effect of Database Size
  • Slide 58
  • Why were the values different
  • E-values Effect of Query Length
  • Why not just use identity
  • So whatrsquos the real problem
  • Rules of Thumb
  • Protein BLAST
  • Amino Acid Substitutions
  • Exercises
  • Part 4 Tweaking BLAST
  • Not All Parameters are hellip
  • The Many Parameters of BLAST
  • Slide 70
  • BLAST Parameters Exercises
  • Results for Exercise 1
  • Slide 73
  • Results for Exercise 2A (OFF)
  • Results for Exercise 2A (ON)
  • Results for Exercise 2B
  • Results for Exercise 2C
  • Slide 78
  • Slide 79
  • Results for Exercise 4 (i)
  • Results for Exercise 4 (ii)
  • Slide 82
  • Slide 83
  • END
Page 11: Bioinformatics Workshop 1 Sequences and Similarity Searches

Gene Symbols Names Etc

Gene Symbol CCNB1

Gene Name cyclin B1 [Homo sapiens]

Description G2mitotic-specific cyclin B1

Aliases CCNB CYCB1

A Gene-Centric View

Entrez Genehttpwwwncbinlmnihgov

Cyclin B1

S431051

BT0064371

X587081

NM_1119853

AAB229701

AAP212451

CAA415451

NP_1877592

Exercise 1

Go to Entrez Gene and look for your favourite gene or genes

genomic location

expression data

Sequences and Accession Numbers

NM_0010159221 gi=62860271

GATCGTTCGATTAGCTAGGGACACCACCGATCGATATGACCACAAAAA

BC0096381 gi=16307106

GTTCGATTAGCTAGGGACACCACCGATCGATATGACCACAAAA

NM_0010159222 gi=62860589

GACCGTTCGATTAGCTAGGGACACCACCGATCGATATGACCACAAA

NP_0010159221 protein translated from mRNA

XM_0011025671 predicted mRNA

XP_0010897651 predicted protein translated from predicted mRNA

mRNA Splicing Signals

gene model

genome

CTACCATCCATGCTAACCATTCTACCATTTTATACTCATGCAACGGACCGTAGCGTAGTCGCTTAGCATCCTTTATAACTGGCTA

CTACCATCCATGCTAACCATTCTAC CATTTTATACTCATGCAACGGACCGT AGCGTAGTCGCTTAGCATCCTTTATAACTGGCTA

CTACCATCCATGCTAACCATTCTACGTAAGTCATCTATATCAATATTATTTCAGCATTTTATACTCATGCAACGGACCGTGTCAGTATTACAGAGCGTAGTCGCTTAGCATCCTTTATAACTGGCTA

GTAAGdonor

TTTCAG acceptor

mRNA

exon intron exon intron exon

splice sites

Gene PredictionsGiven- coding sequence must run from ATG ndash STOP codon in-frame- introns GT AG can be spliced out

Also take a statistical approach- coding and non-coding sequence are slightly different in composition- some lsquopossiblersquo splice sites are more likely than others

CGTCGTATGGCTTCGATGTAGTACATCGGATCGGTATGGAATCATTTCAGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA

CGTCGTATGGCTTCGATGTAGTACATCGGATCGGTATGGAATCATTTCAGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA CGTCGTATGGCTTCGATGTAGTACATCGGATCGGTATGGAATCATTTCAGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA

CGTCGTATGGCTTCGATGTAGTACATCGGATCGGTATGGAATCATTTCAGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA

CGTCGTATGGCTTCGATTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA

CGTCGTATGGCTTCGATGTAGTACATCGGATCGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA CGTCGTATGGCTTCGATGTAGTACATCGGATCGGTATGGAATCATTTCAGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA

CGTCGTATGGCTTCGATTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA

CGTCGTATGGCTTCGATGTAGTACATCGGATCGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA

scan genomic sequence hellip

CGTCGTATGGCTTCGATGTAGTACATCGGATCGGTATGGAATCATTTCAGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA

most likely gene model

Supporting Evidence

EST evidence

genome

gene model

We note that in the absence of EST evidence it is only really possible to predict coding sequence with any confidence (and even thenhellip)

So predicted genes based on computational gene models alone will usually lack UTR regions which has some important consequences

exons 1 2 3 4

TheoreticalPredicted Sequences

genome

predicted gene modelexons 1 2 3 4

Wersquove now reversed the process of working out exon structure from aligning cDNA sequences against the genome sequence but we shouldnrsquot lose sight of the fact that we donrsquot really know if these predicted proteins exists ndash especially where supporting EST evidence is weak or non-existent

predicted transcript

predicted protein

Sequences for a model organism

ESTs ndash millions pound10 eachCheap to sequence ndash so we get millions per organismBut lots of errorsAnd incomplete gene sequencesCan give us relative expression levels

cDNAs ndash tens of thousands pound1000 eachExpensive ndash but only need to do one (or a small number) per geneFew errors with multipass sequencingGives us protein sequences

Genomes ndash one pound30000000Extremely expensiveBut the only way to get the whole pictureGives us gene regulation

So Whatrsquos in the Databases Now

15000000ESTs

3300000cDNAs

NCBI July 2005

2700000proteins

950000proteins

nrRefSeq

DNA

Proteins

Part 2 Comparative Genomics

ATGAAGGCTGCCTACGACTGCCGTGATGCAGGCTGCCTACGACTGCCGTGATGCAGGCTGCCAACGACTGCCGTGATGCATGCTGCCAACGACTGCCGTGATGCATGCTGCCAACGACTGCCCTGATGCATGCTGCCAACGGCTGCCCTGATGCATGCTGCCAACGGATGCCCTGATGCATGCCGCCAACGGATGCCCTGATGCATGCCGCCAACGGATGTCCTG

Imagine one mutation gets fixed every 100000 years in this gene sequencehellip

Gene sequence

Evolution by sequence mutation

Speciation

ATGAAGGCTGCCTACGACTGCCGTG

ATGCAGGCTGCCTACGACTGCCGTGATGCAGGCTGCCAACGACTGCCGTGATGCATGCTGCCAACGACTGCCGTGATGCATGCTGCCAACGACTGCCCTGATGCATGCTGCCAACGGCTGCCCTGATGCATGCTGCCAACGGATGCCCTG

Gene AATGAAGGCTGCCTACGACTGCCGTG

ATGAAGGCCGCCTACGACTGCCGTGATGAAGGCCGCCAACGACTGTCGTGATGAAAGCCGCCAACGACTGTCGTGATGAAAGCCGCCAACGACAGTCGTGATGAAAGCCGCCTACGACAGTCGTGATGAAAGCCGCCTACGACAGTCCTG

ATGCATGCTGCCAACGGATGCCCTG

ATGAAAGCCGCCTACGACAGTCCTG||| | || ||| ||| | ||||

If the genetic difference means they can no longer interbreed with fertile offspring ndash then we have a new specieshellip

Residual Similarity

ATGCATGCTGCCAACGGATGCCCTG

ATGAAAGCCGCCTACGACAGTCCTG||| | || ||| ||| | ||||

ATGCATGCTGCCAACGGATGCCCTG

ATGGAAGGCGCTTAGGATAGTCCAG||| | | || | | | || |

After longer periods of evolution homology may no longer be detectable in the DNA sequencehellip

We can still easily detect residual similarity between these sequences this is what we call homology ndash detectable similarity because of common evolutionary origin

Computers Can Detect Homology

In fact computers are very good at this task ndash the two primary challenges are

(a) performing the search fast enough to look through millions of sequence in a timescale compatible with a lab scientistrsquos attention span

(b) at low levels of similarity being able to distinguish between biologically related sequences and chance matcheshellip

ATGCATGCTGCCAACGGATGCCCTG

ATGAAAGCCGCCTACGACAGTCCTG||| | || ||| ||| | ||||

GCTGACTCGTAGCGCTTAGCTAGCT

CCAACATCTAGCCAGATTAGTTAGT | || | | | |

Orthologs

A A

A Gene duplication though speciation The two copies of Gene

A will now evolve independently but will continue to have the ~same function

They are ORTHOLOGS

Paralogs

A

Gene duplication though internal genome duplication

The two copies of Gene A will now evolve independently but will probably not continue to have exactly the same function

They are PARALOGS

A

A Arsquo

A

lsquoOtherrsquo-logsWhat about gene duplication after speciation

How can we describe the relationship(s) between the various copies of gene A in the two frogs

Bear in mind that understanding gene function is more important than semanticshellip

The two copies of A in the orange frog are sometimes called IN-PARALOGS

If they were also present in the green frog (and therefore were in the ancestor species) they would be OUT-PARALOGS

A

A

A

Arsquo A

The Essential Paradigm

1 any group of modern species can be traced back to some extinct common ancestor

A

A

2 in all likelihood they share orthologous genes which have the same function in the modern animal as in the extinct ancestor

3 If we can experimentally determine the function of a gene in one of these organisms then there is a good chance the ORTHOLOGOUS gene in another organism will have the same function

A A

cyclin b1

cyclin b1

Function Conserved Longer than Detectable Similarity

start from first self-replicating sequence

same function detectable similarity

living organisms

whole genome duplication local duplication

Redundancy in the Genetic Code

GCA A alanine GCC A GCG A GCT A

TGC C cystine TGT C

GAC D aspartate GAT D

GGA G glycine GGC G GGG G GGT G

lsquoSynonymousrsquo or lsquosilentrsquo mutations in the third position of the codon triplets have no effect on the amino acid coded for ndash so there is no evolutionary pressure against thishellip

Protein Similarity Persists Longer

CTATCACGAGAACCTGTGCTATCCCGAGAACCTGTGCTATCCCGAGAACCAGTGCTATCCCGTGAACCAGTGCTATCCCGTGAGCCAGTGCTATCCCGTGAGCCAGTTCTGTCCCGTGAGCCAGTT

CTATCACGAGAACCTGTG

CTGTCCCGTGAGCCAGTT|| || || || || ||

LSREPV

LSREPV||||||

CTATCACGAGAACCTGTG

TTGTCCCGGTCGCCAGTT | || | || ||

LSREPV

LSRFPV||| ||

67 100

44 80

Always Compare Protein Sequences

ATGAATGCAGCCTATGATTGCCGAGCCAGAATGCTAAGG MNAAYDCRARMLR ||||| || || || || || || || ||||| || || | ||||||||+||ATGAAGGCCGCATACGACTGTCGTGCTAGAATCCTGAGA MKAAYDCRARILR

DNA comparison amino acid comparison

The DNA sequence can change while the amino acid sequence stays the same so always look for similarities by comparing amino acid sequences

Exercise 1nucleotide vs amino acid search

Go to the file example-sequenceshtml and locate the section for this exercise There should be two sequences lsquosurfeit1rsquo for frog and fly

Go to NCBI Blast home page then lsquoAlign two sequencesrsquo (bottom left lsquospecialrsquo panel) paste one sequence into each window and hit lsquoAlignrsquo ndash this will do a direct DNADNA comparison

Now find the open reading frames of the two genes and translate them into amino acid protein sequences then repeat the two sequences comparison

Go to NCBI ORF Finder ndash paste sequence ndash hit OrfFind ndash identify longest ORF ndash click on it ndash next screen hit Accept ndash change View to Fasta protein ndash hit View ndash copy sequence to Blast2Seqs Do the same with the other sequence

Before you hit lsquoAlignrsquo change the lsquoProgramrsquo (top left) to blastphellip

Answers Exercise 1

The Essential Taskexperiment data mining

gene sequence what is its function

database of proteins in other species

Cyclin-AFoxA1

cdc25

alpha-tubulin

Predicted protein

Gravin-like

Sprouty-2

calmodulin

KIAA10786568

frizzled

Wint8

Troponin T3

Gravin-like

we can only do this because of implied function based on orthology

Functional Orthologs

function known annotation lsquoGravinrsquo available

Human geneXenopus genefunction unknown

sequence similarityorthologs

same function But we know that function is largely determined by shape

similar shape

Which in general we cannot determine ndash but it is probably SHAPE not SEQUENCE that is conserved

We make an assumption that the same gene function is likely to be present in the two organisms and the ones that have this function are likely to be the most similar in sequence

Finding OrthologsSo how do we find orthologs and can we know when we have

The simplest is Reciprocal Best BLAST but it implicitly relies on having all the protein sequences of you own organism and the one you wish to find an ortholog in

frog proteindatabase of human proteins

best match human protein

database of frog proteins

x

Using Synteny is Better

We know that large regions of (say) vertebrate genomes have preserved their overall organisation from one organism to another

And we find the same genes (ie orthologs) in more or less the same order in the syntenic sectionsThese of course represent chromosomal re-arrangements since these organisms diverged

Human chromosome 5

Mouse chromosome 10

Mouse chromosome 2

MetazomeFortunately someone has done all the hard work for ushellip Dan Rokhsar httpwwwmetazomenet

Metazome Exercise

Go back to Entrez Gene and look for your favourite gene again

Pick probable ortholog vertebrate genes from common organisms (human mouse rat chicken frog fish) and paste their protein sequences into a temporary space

Go to Metazome (httpwwwmetazomenet) find the blast window open two versions of it and blast your sequences against the Tetrapod or Jawed vertebrate node

See if you get the same cluster ID as best top hit and have a look at the Metazome alignment(s)hellip

Part 3 Finding Sequence Similarities

We want computer programs which will compare sequences at all possible different alignments looking for a degree of similarity greater than we would expect to find by chance

But first we have to consider the implication of gapshellip

Insertions and deletions are other possible forms of mutations and they can really mess up our simple alignments

ATGCATGCTGCCAACGGATGTCCTG

ATGAAAGCCGCCTACGAAAGTCCTG||| | || ||| ||| | ||||||

ATGCATGCTGGCCAACGGATGTCCTG

ATGAAAGCCGCCTACGAAAGTCCTG||| | || ||| | | | |

Gaps in Alignments

Consider these two obviously similar sequences

TTCCCAACTCTCCTCTTTCACCATGAAGCTCAAGGACAGATTCCACTCGCCCCAAAATCAAGCTCACCCCGTCCAAGAA | || | || |||||||||||||||||||| ||||||||| ||| ||| | ||| | | |TTCCCACCTCTCCTCTTTGCACCATGAAGCTCAAGGACAAATTCCACTCCCCCAAAATCAAGCGCACCCCGTCCCAGAA

In fact we realise that the most probable alignment (regarding biological origin) is with a small gap in each sequence

TTCCCAACTCTCCTCTTT=CACCATGAAGCTCAAGGACAGATTCCACTCGCCCCAAAATCAAGCTCACCCCGTCCAAGAA |||||| ||||||||||| |||||||||||||||||||| ||||||||| |||||||||||||| |||||||||| ||||TTCCCACCTCTCCTCTTTGCACCATGAAGCTCAAGGACAAATTCCACTC=CCCCAAAATCAAGCGCACCCCGTCCCAGAA

So in general we allow ourselves to insert gaps until we find the optimal alignment

But where should this process stop

The Downside of GapsTake two random sequences with no lsquorealrsquo similarity

GACACTAGGTCGATGCGTGGTGGCGAGA

ACGCATCCGGATGTGCACCGTGGAACTG

And allow lsquocost freersquo gaps

GAC--ACT----AGGTCGATGC---GTGG---TGGCGAGA || | | | | | ||| |||| || ACGCA-TCCGGA--T-G-TGCACCGTGGAACTG

Clearly although the alignment has no mismatches it is obviously not biologically meaningful

To prevent this we assign a cost to adding gaps which is offset against the benefit of finding matches ndash and this is the essence of lsquofinding gapped alignmentsrsquo

We want to find the lsquoalignmentrsquo between the two (or more) sequences which shows the greatest degree of similarity while introducing the fewest gaps hellip

BLAST

gtqueryAGACGAACCTAGCACAAGCGCGTCTGGAAAGACCCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGTTCAGGAGTATTTGGACTGCAATATTGGCCCTCGTTCAAGGGCGCCTACCATCACCCGACGGTCATGCCGGTCCCCAGCAGCTGCTAATAACTTCCTTCGCTACTCAAGTTACCACGCTAGCAAAACCCACGGCATACCGTTTACCCTTTAAAATCAGCTTCAACCAGCAACGAA

There are many programs used to find similarities between sequencesThey range from relatively slow programs which find the exact best matching alignment through ones which take progressively inexact shortcuts to speed things up Of this latter class the best known and easily most widely used is BLAST developed by Stephen Altschul and others and continuously refined over the last 10-15 years

The essential idea is to compare your query sequence against a collection or lsquodatabasersquo of target sequences looking for the one(s) that match the query sequence the best

gttarget1AAAACAGGAATATTTACCGGGACCGGGTAATGATGCATCTCGAGGTACACAATATACCTG GAGAACCGAATTATGAGTTGGCCACCTTACTTAACGAAACCAGCAGAGAAAATCCAACAT GGCAACACCCCTCTGACTACACTAGAAGGAACTACTATGTAAGAAAACAGCCTGTCCCTT GCAGTTTGAATGACTGGGTGATGCGAAATGGGGGTCCTGCCATAGAGCGCTTCCATGGTT TACCTTGCACATTTCAGAGAAGTCCTATGCCAGGAGTCCTTCCTACAGGGCCTTCCTGAA ACTATATATGTGCTTATTCTTGTTTGATTTGGCTTTGCAGgttarget2CTCTTAATTTATTTCTCTTCCTGCAGCTCCCTCGCTTTTTCCTTTCCCTGTTACATTCAT CTGACTTGAAGAGTTGCAAATTTTCAGTGTTTCTGTTTTTGTTGCTGATATGTTGTAAAC TTTTTAATAAAATCTATTTCTATAG gttarget3GCAGTTTGAATGACTGGGTGATGCGAAATGGGGGTCCTGCCATAGAGCGCTTCCATGGTT TACCTTGCACATTTCAGAGAAGTCCTATGCCAGGAGTCCTTCCTACAGGGCCTTCCTGAA ACTATATATGTGCTTATTCTTGTTTGATTTGGCTTTGCAGCTAGGGTTTTCACCTTTTCT GGAAAAAAAAATACTGGCTTCC gttarget4CTGCTATTAATGGGCAAAACAACTCAAATAAAGTCCCTCTGCCACCCTCAGACACTGCCC CTGGCCCCCAGCTGCCCGCTGATCCTTGTAGCCAGAGCAGTAAAGTTTTGAAAGTGGAGC CCAAGGAGAATAAAGTTATTAAAGAAACTGGCTTTGAACAAGGTGAAAAGTCTTGTGCAG CACCTCTAGATCATACTGTGAAGGAAAATCTTGGACAAACTTCTAAAGAACAGGTGGTAG

query

database

COMPARE

LIST MATCHES

Flavours of BLAST

ACGATAGATCCCATCCATAAAT ATGACGATAGATCCCATCAT CGATAGGACCACCACA GATAGACCAGGATACATAGGATAATTA AGCTCGCTTGGCTCGATGGCT

MKJLSPWERSYTRGHYTWER MGHTVNBZY MKLPWRHGDBKJGMNDFD MBKLRPIUHDFRTASGSLKWWRTVBN

MKJLSPWERSYTRGHYTWER MGHTVNBZY MKLPWRHGDBKJGMNDFD MBKLRPIUHDFRTASGSLKWWRTVBN

ATGACGATAGATCCCATCAT CGATAGGACCACCACA GATAGACCAGGATACATAGGATAATTA AGCTCGCTTGGCTCGATGGCT

query sequence other operation database sequences

ATGACGATAGATCCCATCAT CGATAGGACCACCACA GATAGACCAGGATACATAGGATAATTA AGCTCGCTTGGCTCGATGGCT

BLASTn

BLASTp

BLASTx

tBLASTn

tBLASTx

ACGATAGATCCCATCCATAAAT

ACGATAGATCCCATCCATAAAT

MQWCGYRWTYQGYRW

MQWCGYRWTYQGYRW

FAST

FAST

SLOW

SLOWER

HORRIBLY

SLOW

6 fra

me

trans

latio

n

How does it work The main task of any sequence comparison program is to test all possible mutual alignments of two sequence and see how good the match is

CTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGT

CCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGTTC | | | | | ||||||||||||||||||||||||| CTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGTCTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGT || | | | | | | | | |||||||||||||||||||||||| | | | | | |

CCGAGCTTCTCATTGCTCTTCCTAACAGTG=TGATAGGCTAACCGTAATGGCGTTC||||||||||||||||||||||||| ||||||||||||||||||||||||

query

1st database sequence

This would actually be a very slow search process if implemented like thishellip

BLAST achieves its speed through two strategies

- it takes a WORD based approach- it pre-INDEXES database sequences

BLAST WORDS and INDEXING1 GACAAATCCAAACCCCTGAAGTTCTCCACCAGCAAAGCCA

2 TAAGCAAATTTAATTTTGTTTACATTTTC

3 GTTAAGACCTTCCCTGACATTTGCAGCAGTTTCAAATGTA

AAAAAAAA 00001 AAAAAAAC 00002 AAAAAAAG 00003 ACAAATCC 07967 ACAAATCC 07968 ACAAATCC 07979 GACAAATC 33568 GACAAATG 33569 TCCAAACC 64321 TCCAAACC 64322

sequence position word

1 1 33658

1 2 07967

1 3 16210

3 15 33568

3 16 07967

Database of sequences

Numbered list of all possible lsquowordsrsquo

Build a position index of all words in the database

Analyse the Query Sequence gtquery AGACAAATCCAAACCCCTGAAGTTCTCCACCAGCAAAGCCA

AAAAAAAA 00001 AAAAAAAC 00002 AAAAAAAG 00003 ACAAATCC 07967 ACAAATCC 07968 ACAAATCC 07979 GACAAATC 33568 GACAAATG 33569 TCCAAACC 64321 TCCAAACC 64322

QUERY SEQUENCE

Numbered list of all possible lsquowordsrsquo

position word

1 14236

2 33658

3 07967

Analyse QUERY SEQUENCE

sequence position word

1 1 33658

1 2 07967

1 3 16210

3 15 33568

3 16 07967

Index of database

Expand from Word Based Matches We lsquoinstantlyrsquo know which sequences in the database have at least a word length match with our query sequence and at what relative position

Next the potential alignments are expanded adding up a score for (total matches ndash mismatches ndash gap penalties) to make the best possible alignment But this is usually for a tiny proportion of the sequences in the database ndash so overall it is much quicker

The highest scoring alignments are reported

But we can potentially miss alignments with no word-size bits in common consider BLASTn with a default word-size of 11

TCGGAAGTGGAAGCTGAACCTGATTGTAGAGTTGGAGGCCAGTGTTCTGGCTGAGC||||||||| ||||| |||||||||| |||||||||| |||| ||||| ||||||| TCGGAAGTGTAAGCTCAACCTGATTGCAGAGTTGGAGTCCAGAGTTCTAGCTGAGC

Care is sometimes neededhellip

BLAST ndashTypical OutputINPUT

gtpartial cDNA sequence Xenopus tropicalisCGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGTTCCCACCTCTCCTCTTTCACCATGAAGCTCAAGGACAAATTCCACTCCCCCAAAATCAAGCGCACCCCGTCCAAGAAGGGGAAGCCGGCCGACCTCACCGTCAAAACAGAAGAGAAACCCGTCAACAAAACCTTAAGCCGCTTGGAGGAACAGGAGAAAGAAGTCGTTAATGCCTTGCGTTACTTTAAGACAATTGTTGACAAGATGGCGGTGGACAAGATGGTGCTGGTGATGCTGCCAGGGTCGGCGA

OUTPUTQuery= (311 letters) Database NCBI Protein Reference Sequences 954378 sequences 347895532 total letters

gtgi|41055060|ref|NP_9574201| similar to guanine nucleotide-releasing factor 2 (specific for crk proto-oncogene) [Danio rerio]

Length=691

Score = 133 bits (335)Expect = 6e-31 Identities = 7698 (77) Positives = 8298 (83) Gaps = 498 (4) Frame = +2

Query 26 MSGKIE-KADSQRSHLSSFTMKLKDKFHSPKIKRTPSKKGKPA--DLTVKTEEKPVNKTL 196 MSGKIE K +SQ+SHLSSFTMKL KFHSPKIKRTPSKKGK + VKT EKPVNK + Sbjct 1 MSGKIESKHESQKSHLSSFTMKLM-KFHSPKIKRTPSKKGKQLQPEPAVKTPEKPVNKKV 59

Query 197 SRLEEQEKEVVNALRYFKTIVDKMAVDKMVLVMLPGSA 310 SRLEEQEK+VV+ALRYFKTIVDKM VD VL MLPGSA Sbjct 60 SRLEEQEKDVVSALRYFKTIVDKMNVDTKVLQMLPGSA 97

When is a match significant

RFKISDCQHPCTYSHNQYMTNHMRECPYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV

NFSWKKTSEKETNCQFDYPNDYNEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFNMCWLEVNSS

RF---KISDCQHPCTYSH-NQYMTNHMREC----PYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV F K S+ + C + + N Y N +C P+ + +W +P + D I N M ++ NFSWKKTSEKETNCQFDYPNDY--NEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFN------MCWLEVNSS

Here is a lsquotypicalrsquo weak alignment from BLASTp

In fact the sequences were randomly generated so there is no biologically significant alignmenthellip

E-values

The number of matches like the discovered match that I would expect to find by chance

An E-value of 00 implies that I would expect no matches like this to arise by chance thereforehellip

An E-value of 1 implies I would expect 1 match like this to arise by chance so if I have a match with such an E-valuehellip

Also ldquoexpect valueldquo or ldquoexpectationrdquo

E-values From First Principles

Some database statistics (23rd July 2005)

Database NCBI RefSeq mRNA 272619 sequences 503566580 total letters (~50 x 108)

Database NCBI nr 3329110 sequences 14601814750 total letters (~14 x 1010)

Notation

12e-35 = 12 x 10-35

48 x 106 = 4800000

We will consider first searching a nucleotide sequence (lsquoACGTAGACGTrsquo) against a nucleotide database eg the RefSeq mRNA above

Then we will consider the more complex case of amino acid sequence (protein) searches Which is of course what we mostly do

Calculating an E-valueThe RefSeq mRNA database has ~ 50 x 108 letters There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA

Query = lsquoArsquo

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACrsquo

AC AC AC AC

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACGrsquo

ACG

Expected number of matches = (50 x 108) 4 = ~12 x 108

Expected number of matches = (50 x 108) (4x 4) = ~31 x 107

Expected number of matches = (50 x 108) (4 x 4 x 4) = ~81 x 106

Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer

Expected number of matches = (50 x 108) (4 x 4 x 4 x 4 hellip 60 times ) = (50 x 108) 1036 = 50 x 10-28

E-value = 50 x 10-28

E-values In PracticeSo if I take a 60 nt sequencegtsequenceACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA and actually BLAST it against the RefSeq mRNA database I get

BLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds

Length=6060 Score = 119 bits (60) Expect = 2e-26 Identities = 6060 (100) Gaps = 060 (0) Strand=PlusPlus

Query 1 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 60 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 2977 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 3036

What do I get if I BLAST it against the larger nr databaseBLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds

Length=6060 Score = 119 bits (60) Expect = 6e-25 Identities = 6060 (100) Gaps = 060 (0) Strand=PlusPlus

Query 1 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 60 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 2977 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 3036

theoretical value was 50e-28 -

E-value Exercise

Given a transcription factor binding site

ACC[TG]TA

How many would you expect to find by chance in a 10k promoter sequence

How would this differ if there was an optional additional base between the 4th and 5th positionsIeACC[TG]TAOR ACC[TG]TA

E-value Exercise AnswerACC[TG]TA

Expect lsquoArsquo every 4 ntExpect lsquoACCrsquo every 4x4x4 = 64 nt

Expect lsquoT or Grsquo every 2nd ntExpect lsquoACC[TG]rsquo every 64x2 nt = 128 nt

Expect lsquoTArsquo every 4x4 = 16 ntExpect lsquoACC[TG]TArsquo every 128x16 nt = 2048 nt (4x4x4x2x4x4)We would expect ~5 of these promoter sites every 10k by chance

If also ACC[TG]TAA allowed

The two motifs independently have the same E-valueTo allow either means we expect twice as many

E-values Effect of Database SizeThe nr mRNA database has ~ 14 x 1010 letters (was RefSeq and 50 x108)There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA

Query = lsquoArsquo

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACrsquo

AC AC AC AC

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACGrsquo

ACG

Expected number of matches = (14 x 1010) 4 = ~12 x 108

Expected number of matches = (14 x 1010) (4x 4) = ~31 x 107

Expected number of matches = (14 x 1010) (4 x 4 x 4) = ~81 x 106

Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer

Expected number of matches = (14 x 1010) (4 x 4 x 4 x 4 hellip 60 times ) = (14 x 1010) 1036 = 14 x 10-26

E-value = 14 x 10-26

(was E-value = 50 x 10-28)

E-values Effect of Database Size

The E-value is simply dependent on database size

RefSeq

nr

14 x 1010 letters

50 x108 letters

30 x bigger

BLAST the same sequenceagainst each E-value = 14e-26

E-value = 50e-28

The database was ~30 times bigger and so the E-value was ~30 times bigger

Why were the values differentOur calculated E-value for searching against the RefSeq mRNA database was 50 x 10-28But our actual BLAST search at NCBI gave a value of

20 x 10-26 - about 40x larger - why is this

Gapped alignments

If we were expecting N matches for a query sequence lsquoACGTACGTACGTrsquo imagine what would happen to N if we allowed gaps in our matches

ACGTACGTACGT

This would now give us additional possible alignments that would meet our lsquomatchrsquo criteria

ACGTACGTACGT ACGTACAGTACGT ACGTACCGTACGT etc|||||||||||| |||||| |||||| |||||| ||||||ACGTACGTACGT ACGTAC-GTACGT ACGTAC-GTACGT

We will expect many more matches in a given database if we allow our alignments to have gaps The E-value will be larger

E-values Effect of Query Length

Biologically itrsquos the same match Does it mean we are any less sure that this match didnrsquot occur by chance The E-value is simply dependent on match length

database

BLAST 500 nt sequence against a database

BLASTn Get a full length match with sequence XYZ at an E-value = 50e-160

gtsequence ACTAGTCTAGCTAGACATCGATCGATGATGCTACACAGATAGACGATAGATAGTAAGTCGATCGATCGCGCATCGATCGTCTAGATCGATCGCTCGCTGTGTAGATAGATCGGCGATAGA

database

BLAST half of the same sequence against the same database

BLASTn

gtsequence ACTAGTCTAGCTAGACATCGATCGATGATGCTACACAGATAGACGATAGATAGTAAGTCG

Get a match with sequence XYZ again but at an E-value = 50e-80

Why not just use identityAt some levels this a good question

But consider two very different searches both of which give a 75 identity match

Query1 was 60 nt longCGGAGCTCAGGGCTTAACGACTGATATCTCCGCGCATGTCGAGAAACGATACAGCCAGCG||||||||||| || | || | || || |||| | | | |||||| | ||||||||||CGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGWhich would have an E-value ~ 50 x 10-19

And Query2 only 16 nt longACGTACGTACGTACGT||| || | |||| ||ACGCACCTTCGTAGGTWhich would have an E-value ~ 30

And intuitively we feel we would expect to see that sort of number of matches in the database just by chancehellip

So whatrsquos the real problemBasically you are usually trying to answer the question

Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism

Are there any useful guidelines though at least for biological meaningfulness

Basically you are usually trying to answer the question

Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism

BLAST

The difficulty is because

ORTHOLOGY

BLAST Similarity + Probability

biological knowledge

nature of query sequence

phylogenetic relationship

match length PI size of databasehellip

Rules of ThumbHow good does an E-value have to be before we might even think we have an ortholog

largerworse smallerbetter

E-values 10-5 10-10 10-40 10-100 00

fantasy borderline encouraging

pretty good canrsquot get

better

But note that in some gene families with closely related members you can get an E-value of 00 for several different matches and then identity may be more sensitive Also bear in mind in cases like this that ideas of lsquofunctionalrsquo orthology may break down with more than one locus producing identical proteins which share the same functionhellip

Protein BLASTItrsquos (nearly) always better to make comparisons at the amino acid level between protein sequences than the DNA level because the amino acid sequence is more conserved than the underlying DNA sequenceDoes this cause us to treat expected values any differently

If we follow the argument as before then for an exact match of a 20 amino acid sequence in the RefSeq protein database each additional amino acid will reduce the E-value by 120th (there are 20 different amino acids) And as there are 347895532 letters in that databaseE-value = ~35 x 108 (20 x 20 x 20 hellip20 times) = ~35 x 10-18

But this is what we get if we run the blast at NCBI

Score = 431 bits (100) Expect = 8e-04 Identities = 2020 (100) Positives = 2020 (100) Gaps = 020 (0) Frame = +3

Query 3 SSSSFRAYRAALSEVEPPCI 62 SSSSFRAYRAALSEVEPPCISbjct 972 SSSSFRAYRAALSEVEPPCI 991

Really too big a discrepancy to easily explain with hand wavinghellip

Amino Acid Substitutions

A SC F LWYG I LMVL IMFVM ILVP V ILMW FY

N DHSQ REHKS ANTT SY HFW

H NQYK RQER QK

D NEE DQK

In fact we need to take into account both amino acid substitutability as well as as before allowing gapped alignments On average any residue can be substituted for by about 2 others so each position has about 17th chance of lsquomatchingrsquo rather than 120th

So now we getE-value = ~35 x 108 (7 x 7 x 7 hellip20 times) = ~44 x 10-9which is much closer to the actual BLAST value

These substitutabilities are dealt with by the BLOSUM and PAM matrices

ExercisesGo to the file random-DNA-sequenceshtml select one of the 20 randomly generated nucleotide sequences and do a BLASTx (translated DNA-gtprotein) at NCBI against the nr protein database

Did you find any lsquosignificantrsquo hits

Repeat with a second sequence

What conclusions might you draw from this exercise

Try the same sequence(s) against the nr nucleotide database

Is there any general difference

Part 4 Tweaking BLASTAlthough you normally see BLAST as a web page with boxes to place data in and tick boxes etc it is actually a command line program that can be run just by typing the appropriate command and options eg

promptgt blastall ndashp ltblast typegt ndashi ltinput sequencegt ndashd ltdatabasegt

This is the simplest form where the basic program lsquoblastallrsquo takes a number of different options or parameters indicated by the ndashx and followed by its value -p ltwhich blast flavour to rungt-i ltfile with query sequence ingt-d ltpre-indexed database namegt

Not All Parameters are hellipThere are many other parameters and if not listed explicitly they will use a default value most appropriate to the blast flavour requested Eg for ndashW ltword sizegt blastn uses ndashW 11 where blastx uses ndashW 3

There are also some options that appear on the web pages that are not really parameters but manage the job in a similar way One of the most useful of these is on the NCBI blast pages where you can use Entrez queries or pick from an organism list to modify your search

The Many Parameters of BLASTThere are almost literally hundreds of parameters but most are way too obscure even for die-hard techies like me Very few of them are regularly useful in any but their default value but just occasionally they are very necessary

Here are some of the ones that I have used

-e max expected value -m output format (graphical or tabularspreadsheet)-F filter query sequence for low complexity (default TRUE)-U use only upper case regions of query (default FALSE)-G gap opening cost-E gap extension cost-q nucleotide mismatch penalty (BLASTx uses matrices)-r nucleotide match reward-b number of matching sequences to report-g allow gaps (default TRUE)-W word size-z effective database size (removes effect of actual database size)-S query strands to search (default both directions)-l restrict database sequences to given list of lsquogilsquo numbers

BLAST Parameters Exercises1 BLASTn vs BLASTx

Open the file example-sequenceshtml copy the sequence gtblastn-vs-blastxThis is a Xenopus tropicalis cDNA sequence

Go to the NCBI BLAST Home PageNucleotide-nucleotide BLAST (blastn) section Paste your sequence into the box

Run BLASTn against the nr nucleotide database using all default optionsThen hit [format] to wait for the results in a new page(hint if you paste the sequence definition line lsquogtnamersquo into the box as well your results will be labelled accordingly which can be useful)

Now repeat but go to the TRANSLATED BLAST section and BLAST against the nr protein database using BLASTx

How might the different results help us view the presence of this gene in other vertebrates

Results for Exercise 1 BLASTn

BLASTx

BLAST Parameters Exercises2 Low complexity filtering

Open the file example-sequenceshtml copy the sequence gtlow-complexity-filtering-A This sequence contains a long AT tandem repeat

Go to the NCBI BLAST Home PageTRANSLATED BLAST sectionBLASTx Paste your sequence into the box

Carefully UNTICK the ldquoChoose filter [ ] Low complexityrdquo BOX in the second section And then run BLASTx against the nr database

What do you feel about these alignmentsRe-run but leave the low-complexity filter ON this timeDoes this change our view of the protein matches

Now continue with gtlow-complexity-filtering-B and ndashCC is an especially interesting case ndash what can we deduce about the cDNA sequence Annotators beware

Results for Exercise 2A (OFF) BLASTn ndash low complexity filtering OFF

Results for Exercise 2A (ON) BLASTn ndash low complexity filtering ON

Results for Exercise 2B

ON OFF

Results for Exercise 2C

There is a sequence error an extra G at position 117 in the sequence cDNA (117)AGAAAAGAAGAAACATGGCAATGGATCAGAA|||||||||||||||| ||||||||||||||AGAAAAGAAGAAACAT-GCAATGGATCAGAA

Genomic sequence

ON

OFF

BLAST Parameters Exercises3 Limit by Entrez query

Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching sequences in fruit fly enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT

Open the file example-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home Page TRANSLATED BLAST sectionBLASTx and paste the sequence

Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip

BLAST Parameters Exercises4 BLASTn vs tBLASTx and nucleotide mismatch penalties

Open the file example-sequenceshtml

Also open the NCBI BLAST Home PageSPECIAL ndash Align two sequences section

There are several Xenopus tropicalis cyclins in the examples fileCopy the sequence gtcyclin-A1-Xt to the Sequence 1 BLAST windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 BLAST window(i) Run the default comparison should be BLASTn Note the alignmentNow run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these

(ii) Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1 Then try reducing the gap open and gap extension penaltieshellipWhat do we learn from this

(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip

Results for Exercise 4 (i)

BLASTn tBLASTx

Results for Exercise 4 (ii)

Mismatch penalty = -2 (default) Mismatch penalty = -1

BLAST Parameters Exercises5 E-Value maximum for reporting

Open the file example-sequenceshtml

Copy the sequence gtsumo-binding-motif and go to the NCBI BLAST Home PageGo to the PROTEIN BLAST section BLASTp and paste the sequence

Run the search with the default values

Now re-run the search setting the maximum E-value in the box

Expect 100

What difference does this make

BLAST Parameters Exercises6 Word Size

Open the file example-sequenceshtml

Copy the sequence gtmorpholino and go to the NCBI BLAST Home PageGo to the NUCLEOTIDE BLAST section BLASTn and paste the sequence

Check OFF the low complexity filter and then run the search

Now re-run the search setting the following parameters

Low complexity OFFExpect 100Word Size 7Other advanced -q-1 (mismatch penalty -1 instead of default -3)

What difference does this make

END

  • Bioinformatics Workshop 1 Sequences and Similarity Searches
  • The Basic Questions
  • Part 1 Structural Genomics
  • Chromosomes and Genes
  • Gene to Protein
  • Sequence Signals
  • Genomic Signals
  • Derivative Sequences
  • Gene Models
  • Sequences and Genes (Accession Numbers and Names)
  • Gene Symbols Names Etc
  • A Gene-Centric View
  • Sequences and Accession Numbers
  • mRNA Splicing Signals
  • Gene Predictions
  • Supporting Evidence
  • TheoreticalPredicted Sequences
  • Sequences for a model organism
  • So Whatrsquos in the Databases Now
  • Part 2 Comparative Genomics
  • Speciation
  • Residual Similarity
  • Computers Can Detect Homology
  • Orthologs
  • Paralogs
  • lsquoOtherrsquo-logs
  • The Essential Paradigm
  • Function Conserved Longer than Detectable Similarity
  • Redundancy in the Genetic Code
  • Protein Similarity Persists Longer
  • Always Compare Protein Sequences
  • Exercise 1 nucleotide vs amino acid search
  • Answers Exercise 1
  • The Essential Task
  • Functional Orthologs
  • Finding Orthologs
  • Using Synteny is Better
  • Metazome
  • Metazome Exercise
  • Part 3 Finding Sequence Similarities
  • Gaps in Alignments
  • The Downside of Gaps
  • BLAST
  • Flavours of BLAST
  • How does it work
  • BLAST WORDS and INDEXING
  • Analyse the Query Sequence
  • Expand from Word Based Matches
  • BLAST ndashTypical Output
  • When is a match significant
  • E-values
  • E-values From First Principles
  • Calculating an E-value
  • E-values In Practice
  • E-value Exercise
  • E-value Exercise Answer
  • E-values Effect of Database Size
  • Slide 58
  • Why were the values different
  • E-values Effect of Query Length
  • Why not just use identity
  • So whatrsquos the real problem
  • Rules of Thumb
  • Protein BLAST
  • Amino Acid Substitutions
  • Exercises
  • Part 4 Tweaking BLAST
  • Not All Parameters are hellip
  • The Many Parameters of BLAST
  • Slide 70
  • BLAST Parameters Exercises
  • Results for Exercise 1
  • Slide 73
  • Results for Exercise 2A (OFF)
  • Results for Exercise 2A (ON)
  • Results for Exercise 2B
  • Results for Exercise 2C
  • Slide 78
  • Slide 79
  • Results for Exercise 4 (i)
  • Results for Exercise 4 (ii)
  • Slide 82
  • Slide 83
  • END
Page 12: Bioinformatics Workshop 1 Sequences and Similarity Searches

A Gene-Centric View

Entrez Genehttpwwwncbinlmnihgov

Cyclin B1

S431051

BT0064371

X587081

NM_1119853

AAB229701

AAP212451

CAA415451

NP_1877592

Exercise 1

Go to Entrez Gene and look for your favourite gene or genes

genomic location

expression data

Sequences and Accession Numbers

NM_0010159221 gi=62860271

GATCGTTCGATTAGCTAGGGACACCACCGATCGATATGACCACAAAAA

BC0096381 gi=16307106

GTTCGATTAGCTAGGGACACCACCGATCGATATGACCACAAAA

NM_0010159222 gi=62860589

GACCGTTCGATTAGCTAGGGACACCACCGATCGATATGACCACAAA

NP_0010159221 protein translated from mRNA

XM_0011025671 predicted mRNA

XP_0010897651 predicted protein translated from predicted mRNA

mRNA Splicing Signals

gene model

genome

CTACCATCCATGCTAACCATTCTACCATTTTATACTCATGCAACGGACCGTAGCGTAGTCGCTTAGCATCCTTTATAACTGGCTA

CTACCATCCATGCTAACCATTCTAC CATTTTATACTCATGCAACGGACCGT AGCGTAGTCGCTTAGCATCCTTTATAACTGGCTA

CTACCATCCATGCTAACCATTCTACGTAAGTCATCTATATCAATATTATTTCAGCATTTTATACTCATGCAACGGACCGTGTCAGTATTACAGAGCGTAGTCGCTTAGCATCCTTTATAACTGGCTA

GTAAGdonor

TTTCAG acceptor

mRNA

exon intron exon intron exon

splice sites

Gene PredictionsGiven- coding sequence must run from ATG ndash STOP codon in-frame- introns GT AG can be spliced out

Also take a statistical approach- coding and non-coding sequence are slightly different in composition- some lsquopossiblersquo splice sites are more likely than others

CGTCGTATGGCTTCGATGTAGTACATCGGATCGGTATGGAATCATTTCAGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA

CGTCGTATGGCTTCGATGTAGTACATCGGATCGGTATGGAATCATTTCAGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA CGTCGTATGGCTTCGATGTAGTACATCGGATCGGTATGGAATCATTTCAGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA

CGTCGTATGGCTTCGATGTAGTACATCGGATCGGTATGGAATCATTTCAGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA

CGTCGTATGGCTTCGATTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA

CGTCGTATGGCTTCGATGTAGTACATCGGATCGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA CGTCGTATGGCTTCGATGTAGTACATCGGATCGGTATGGAATCATTTCAGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA

CGTCGTATGGCTTCGATTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA

CGTCGTATGGCTTCGATGTAGTACATCGGATCGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA

scan genomic sequence hellip

CGTCGTATGGCTTCGATGTAGTACATCGGATCGGTATGGAATCATTTCAGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA

most likely gene model

Supporting Evidence

EST evidence

genome

gene model

We note that in the absence of EST evidence it is only really possible to predict coding sequence with any confidence (and even thenhellip)

So predicted genes based on computational gene models alone will usually lack UTR regions which has some important consequences

exons 1 2 3 4

TheoreticalPredicted Sequences

genome

predicted gene modelexons 1 2 3 4

Wersquove now reversed the process of working out exon structure from aligning cDNA sequences against the genome sequence but we shouldnrsquot lose sight of the fact that we donrsquot really know if these predicted proteins exists ndash especially where supporting EST evidence is weak or non-existent

predicted transcript

predicted protein

Sequences for a model organism

ESTs ndash millions pound10 eachCheap to sequence ndash so we get millions per organismBut lots of errorsAnd incomplete gene sequencesCan give us relative expression levels

cDNAs ndash tens of thousands pound1000 eachExpensive ndash but only need to do one (or a small number) per geneFew errors with multipass sequencingGives us protein sequences

Genomes ndash one pound30000000Extremely expensiveBut the only way to get the whole pictureGives us gene regulation

So Whatrsquos in the Databases Now

15000000ESTs

3300000cDNAs

NCBI July 2005

2700000proteins

950000proteins

nrRefSeq

DNA

Proteins

Part 2 Comparative Genomics

ATGAAGGCTGCCTACGACTGCCGTGATGCAGGCTGCCTACGACTGCCGTGATGCAGGCTGCCAACGACTGCCGTGATGCATGCTGCCAACGACTGCCGTGATGCATGCTGCCAACGACTGCCCTGATGCATGCTGCCAACGGCTGCCCTGATGCATGCTGCCAACGGATGCCCTGATGCATGCCGCCAACGGATGCCCTGATGCATGCCGCCAACGGATGTCCTG

Imagine one mutation gets fixed every 100000 years in this gene sequencehellip

Gene sequence

Evolution by sequence mutation

Speciation

ATGAAGGCTGCCTACGACTGCCGTG

ATGCAGGCTGCCTACGACTGCCGTGATGCAGGCTGCCAACGACTGCCGTGATGCATGCTGCCAACGACTGCCGTGATGCATGCTGCCAACGACTGCCCTGATGCATGCTGCCAACGGCTGCCCTGATGCATGCTGCCAACGGATGCCCTG

Gene AATGAAGGCTGCCTACGACTGCCGTG

ATGAAGGCCGCCTACGACTGCCGTGATGAAGGCCGCCAACGACTGTCGTGATGAAAGCCGCCAACGACTGTCGTGATGAAAGCCGCCAACGACAGTCGTGATGAAAGCCGCCTACGACAGTCGTGATGAAAGCCGCCTACGACAGTCCTG

ATGCATGCTGCCAACGGATGCCCTG

ATGAAAGCCGCCTACGACAGTCCTG||| | || ||| ||| | ||||

If the genetic difference means they can no longer interbreed with fertile offspring ndash then we have a new specieshellip

Residual Similarity

ATGCATGCTGCCAACGGATGCCCTG

ATGAAAGCCGCCTACGACAGTCCTG||| | || ||| ||| | ||||

ATGCATGCTGCCAACGGATGCCCTG

ATGGAAGGCGCTTAGGATAGTCCAG||| | | || | | | || |

After longer periods of evolution homology may no longer be detectable in the DNA sequencehellip

We can still easily detect residual similarity between these sequences this is what we call homology ndash detectable similarity because of common evolutionary origin

Computers Can Detect Homology

In fact computers are very good at this task ndash the two primary challenges are

(a) performing the search fast enough to look through millions of sequence in a timescale compatible with a lab scientistrsquos attention span

(b) at low levels of similarity being able to distinguish between biologically related sequences and chance matcheshellip

ATGCATGCTGCCAACGGATGCCCTG

ATGAAAGCCGCCTACGACAGTCCTG||| | || ||| ||| | ||||

GCTGACTCGTAGCGCTTAGCTAGCT

CCAACATCTAGCCAGATTAGTTAGT | || | | | |

Orthologs

A A

A Gene duplication though speciation The two copies of Gene

A will now evolve independently but will continue to have the ~same function

They are ORTHOLOGS

Paralogs

A

Gene duplication though internal genome duplication

The two copies of Gene A will now evolve independently but will probably not continue to have exactly the same function

They are PARALOGS

A

A Arsquo

A

lsquoOtherrsquo-logsWhat about gene duplication after speciation

How can we describe the relationship(s) between the various copies of gene A in the two frogs

Bear in mind that understanding gene function is more important than semanticshellip

The two copies of A in the orange frog are sometimes called IN-PARALOGS

If they were also present in the green frog (and therefore were in the ancestor species) they would be OUT-PARALOGS

A

A

A

Arsquo A

The Essential Paradigm

1 any group of modern species can be traced back to some extinct common ancestor

A

A

2 in all likelihood they share orthologous genes which have the same function in the modern animal as in the extinct ancestor

3 If we can experimentally determine the function of a gene in one of these organisms then there is a good chance the ORTHOLOGOUS gene in another organism will have the same function

A A

cyclin b1

cyclin b1

Function Conserved Longer than Detectable Similarity

start from first self-replicating sequence

same function detectable similarity

living organisms

whole genome duplication local duplication

Redundancy in the Genetic Code

GCA A alanine GCC A GCG A GCT A

TGC C cystine TGT C

GAC D aspartate GAT D

GGA G glycine GGC G GGG G GGT G

lsquoSynonymousrsquo or lsquosilentrsquo mutations in the third position of the codon triplets have no effect on the amino acid coded for ndash so there is no evolutionary pressure against thishellip

Protein Similarity Persists Longer

CTATCACGAGAACCTGTGCTATCCCGAGAACCTGTGCTATCCCGAGAACCAGTGCTATCCCGTGAACCAGTGCTATCCCGTGAGCCAGTGCTATCCCGTGAGCCAGTTCTGTCCCGTGAGCCAGTT

CTATCACGAGAACCTGTG

CTGTCCCGTGAGCCAGTT|| || || || || ||

LSREPV

LSREPV||||||

CTATCACGAGAACCTGTG

TTGTCCCGGTCGCCAGTT | || | || ||

LSREPV

LSRFPV||| ||

67 100

44 80

Always Compare Protein Sequences

ATGAATGCAGCCTATGATTGCCGAGCCAGAATGCTAAGG MNAAYDCRARMLR ||||| || || || || || || || ||||| || || | ||||||||+||ATGAAGGCCGCATACGACTGTCGTGCTAGAATCCTGAGA MKAAYDCRARILR

DNA comparison amino acid comparison

The DNA sequence can change while the amino acid sequence stays the same so always look for similarities by comparing amino acid sequences

Exercise 1nucleotide vs amino acid search

Go to the file example-sequenceshtml and locate the section for this exercise There should be two sequences lsquosurfeit1rsquo for frog and fly

Go to NCBI Blast home page then lsquoAlign two sequencesrsquo (bottom left lsquospecialrsquo panel) paste one sequence into each window and hit lsquoAlignrsquo ndash this will do a direct DNADNA comparison

Now find the open reading frames of the two genes and translate them into amino acid protein sequences then repeat the two sequences comparison

Go to NCBI ORF Finder ndash paste sequence ndash hit OrfFind ndash identify longest ORF ndash click on it ndash next screen hit Accept ndash change View to Fasta protein ndash hit View ndash copy sequence to Blast2Seqs Do the same with the other sequence

Before you hit lsquoAlignrsquo change the lsquoProgramrsquo (top left) to blastphellip

Answers Exercise 1

The Essential Taskexperiment data mining

gene sequence what is its function

database of proteins in other species

Cyclin-AFoxA1

cdc25

alpha-tubulin

Predicted protein

Gravin-like

Sprouty-2

calmodulin

KIAA10786568

frizzled

Wint8

Troponin T3

Gravin-like

we can only do this because of implied function based on orthology

Functional Orthologs

function known annotation lsquoGravinrsquo available

Human geneXenopus genefunction unknown

sequence similarityorthologs

same function But we know that function is largely determined by shape

similar shape

Which in general we cannot determine ndash but it is probably SHAPE not SEQUENCE that is conserved

We make an assumption that the same gene function is likely to be present in the two organisms and the ones that have this function are likely to be the most similar in sequence

Finding OrthologsSo how do we find orthologs and can we know when we have

The simplest is Reciprocal Best BLAST but it implicitly relies on having all the protein sequences of you own organism and the one you wish to find an ortholog in

frog proteindatabase of human proteins

best match human protein

database of frog proteins

x

Using Synteny is Better

We know that large regions of (say) vertebrate genomes have preserved their overall organisation from one organism to another

And we find the same genes (ie orthologs) in more or less the same order in the syntenic sectionsThese of course represent chromosomal re-arrangements since these organisms diverged

Human chromosome 5

Mouse chromosome 10

Mouse chromosome 2

MetazomeFortunately someone has done all the hard work for ushellip Dan Rokhsar httpwwwmetazomenet

Metazome Exercise

Go back to Entrez Gene and look for your favourite gene again

Pick probable ortholog vertebrate genes from common organisms (human mouse rat chicken frog fish) and paste their protein sequences into a temporary space

Go to Metazome (httpwwwmetazomenet) find the blast window open two versions of it and blast your sequences against the Tetrapod or Jawed vertebrate node

See if you get the same cluster ID as best top hit and have a look at the Metazome alignment(s)hellip

Part 3 Finding Sequence Similarities

We want computer programs which will compare sequences at all possible different alignments looking for a degree of similarity greater than we would expect to find by chance

But first we have to consider the implication of gapshellip

Insertions and deletions are other possible forms of mutations and they can really mess up our simple alignments

ATGCATGCTGCCAACGGATGTCCTG

ATGAAAGCCGCCTACGAAAGTCCTG||| | || ||| ||| | ||||||

ATGCATGCTGGCCAACGGATGTCCTG

ATGAAAGCCGCCTACGAAAGTCCTG||| | || ||| | | | |

Gaps in Alignments

Consider these two obviously similar sequences

TTCCCAACTCTCCTCTTTCACCATGAAGCTCAAGGACAGATTCCACTCGCCCCAAAATCAAGCTCACCCCGTCCAAGAA | || | || |||||||||||||||||||| ||||||||| ||| ||| | ||| | | |TTCCCACCTCTCCTCTTTGCACCATGAAGCTCAAGGACAAATTCCACTCCCCCAAAATCAAGCGCACCCCGTCCCAGAA

In fact we realise that the most probable alignment (regarding biological origin) is with a small gap in each sequence

TTCCCAACTCTCCTCTTT=CACCATGAAGCTCAAGGACAGATTCCACTCGCCCCAAAATCAAGCTCACCCCGTCCAAGAA |||||| ||||||||||| |||||||||||||||||||| ||||||||| |||||||||||||| |||||||||| ||||TTCCCACCTCTCCTCTTTGCACCATGAAGCTCAAGGACAAATTCCACTC=CCCCAAAATCAAGCGCACCCCGTCCCAGAA

So in general we allow ourselves to insert gaps until we find the optimal alignment

But where should this process stop

The Downside of GapsTake two random sequences with no lsquorealrsquo similarity

GACACTAGGTCGATGCGTGGTGGCGAGA

ACGCATCCGGATGTGCACCGTGGAACTG

And allow lsquocost freersquo gaps

GAC--ACT----AGGTCGATGC---GTGG---TGGCGAGA || | | | | | ||| |||| || ACGCA-TCCGGA--T-G-TGCACCGTGGAACTG

Clearly although the alignment has no mismatches it is obviously not biologically meaningful

To prevent this we assign a cost to adding gaps which is offset against the benefit of finding matches ndash and this is the essence of lsquofinding gapped alignmentsrsquo

We want to find the lsquoalignmentrsquo between the two (or more) sequences which shows the greatest degree of similarity while introducing the fewest gaps hellip

BLAST

gtqueryAGACGAACCTAGCACAAGCGCGTCTGGAAAGACCCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGTTCAGGAGTATTTGGACTGCAATATTGGCCCTCGTTCAAGGGCGCCTACCATCACCCGACGGTCATGCCGGTCCCCAGCAGCTGCTAATAACTTCCTTCGCTACTCAAGTTACCACGCTAGCAAAACCCACGGCATACCGTTTACCCTTTAAAATCAGCTTCAACCAGCAACGAA

There are many programs used to find similarities between sequencesThey range from relatively slow programs which find the exact best matching alignment through ones which take progressively inexact shortcuts to speed things up Of this latter class the best known and easily most widely used is BLAST developed by Stephen Altschul and others and continuously refined over the last 10-15 years

The essential idea is to compare your query sequence against a collection or lsquodatabasersquo of target sequences looking for the one(s) that match the query sequence the best

gttarget1AAAACAGGAATATTTACCGGGACCGGGTAATGATGCATCTCGAGGTACACAATATACCTG GAGAACCGAATTATGAGTTGGCCACCTTACTTAACGAAACCAGCAGAGAAAATCCAACAT GGCAACACCCCTCTGACTACACTAGAAGGAACTACTATGTAAGAAAACAGCCTGTCCCTT GCAGTTTGAATGACTGGGTGATGCGAAATGGGGGTCCTGCCATAGAGCGCTTCCATGGTT TACCTTGCACATTTCAGAGAAGTCCTATGCCAGGAGTCCTTCCTACAGGGCCTTCCTGAA ACTATATATGTGCTTATTCTTGTTTGATTTGGCTTTGCAGgttarget2CTCTTAATTTATTTCTCTTCCTGCAGCTCCCTCGCTTTTTCCTTTCCCTGTTACATTCAT CTGACTTGAAGAGTTGCAAATTTTCAGTGTTTCTGTTTTTGTTGCTGATATGTTGTAAAC TTTTTAATAAAATCTATTTCTATAG gttarget3GCAGTTTGAATGACTGGGTGATGCGAAATGGGGGTCCTGCCATAGAGCGCTTCCATGGTT TACCTTGCACATTTCAGAGAAGTCCTATGCCAGGAGTCCTTCCTACAGGGCCTTCCTGAA ACTATATATGTGCTTATTCTTGTTTGATTTGGCTTTGCAGCTAGGGTTTTCACCTTTTCT GGAAAAAAAAATACTGGCTTCC gttarget4CTGCTATTAATGGGCAAAACAACTCAAATAAAGTCCCTCTGCCACCCTCAGACACTGCCC CTGGCCCCCAGCTGCCCGCTGATCCTTGTAGCCAGAGCAGTAAAGTTTTGAAAGTGGAGC CCAAGGAGAATAAAGTTATTAAAGAAACTGGCTTTGAACAAGGTGAAAAGTCTTGTGCAG CACCTCTAGATCATACTGTGAAGGAAAATCTTGGACAAACTTCTAAAGAACAGGTGGTAG

query

database

COMPARE

LIST MATCHES

Flavours of BLAST

ACGATAGATCCCATCCATAAAT ATGACGATAGATCCCATCAT CGATAGGACCACCACA GATAGACCAGGATACATAGGATAATTA AGCTCGCTTGGCTCGATGGCT

MKJLSPWERSYTRGHYTWER MGHTVNBZY MKLPWRHGDBKJGMNDFD MBKLRPIUHDFRTASGSLKWWRTVBN

MKJLSPWERSYTRGHYTWER MGHTVNBZY MKLPWRHGDBKJGMNDFD MBKLRPIUHDFRTASGSLKWWRTVBN

ATGACGATAGATCCCATCAT CGATAGGACCACCACA GATAGACCAGGATACATAGGATAATTA AGCTCGCTTGGCTCGATGGCT

query sequence other operation database sequences

ATGACGATAGATCCCATCAT CGATAGGACCACCACA GATAGACCAGGATACATAGGATAATTA AGCTCGCTTGGCTCGATGGCT

BLASTn

BLASTp

BLASTx

tBLASTn

tBLASTx

ACGATAGATCCCATCCATAAAT

ACGATAGATCCCATCCATAAAT

MQWCGYRWTYQGYRW

MQWCGYRWTYQGYRW

FAST

FAST

SLOW

SLOWER

HORRIBLY

SLOW

6 fra

me

trans

latio

n

How does it work The main task of any sequence comparison program is to test all possible mutual alignments of two sequence and see how good the match is

CTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGT

CCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGTTC | | | | | ||||||||||||||||||||||||| CTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGTCTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGT || | | | | | | | | |||||||||||||||||||||||| | | | | | |

CCGAGCTTCTCATTGCTCTTCCTAACAGTG=TGATAGGCTAACCGTAATGGCGTTC||||||||||||||||||||||||| ||||||||||||||||||||||||

query

1st database sequence

This would actually be a very slow search process if implemented like thishellip

BLAST achieves its speed through two strategies

- it takes a WORD based approach- it pre-INDEXES database sequences

BLAST WORDS and INDEXING1 GACAAATCCAAACCCCTGAAGTTCTCCACCAGCAAAGCCA

2 TAAGCAAATTTAATTTTGTTTACATTTTC

3 GTTAAGACCTTCCCTGACATTTGCAGCAGTTTCAAATGTA

AAAAAAAA 00001 AAAAAAAC 00002 AAAAAAAG 00003 ACAAATCC 07967 ACAAATCC 07968 ACAAATCC 07979 GACAAATC 33568 GACAAATG 33569 TCCAAACC 64321 TCCAAACC 64322

sequence position word

1 1 33658

1 2 07967

1 3 16210

3 15 33568

3 16 07967

Database of sequences

Numbered list of all possible lsquowordsrsquo

Build a position index of all words in the database

Analyse the Query Sequence gtquery AGACAAATCCAAACCCCTGAAGTTCTCCACCAGCAAAGCCA

AAAAAAAA 00001 AAAAAAAC 00002 AAAAAAAG 00003 ACAAATCC 07967 ACAAATCC 07968 ACAAATCC 07979 GACAAATC 33568 GACAAATG 33569 TCCAAACC 64321 TCCAAACC 64322

QUERY SEQUENCE

Numbered list of all possible lsquowordsrsquo

position word

1 14236

2 33658

3 07967

Analyse QUERY SEQUENCE

sequence position word

1 1 33658

1 2 07967

1 3 16210

3 15 33568

3 16 07967

Index of database

Expand from Word Based Matches We lsquoinstantlyrsquo know which sequences in the database have at least a word length match with our query sequence and at what relative position

Next the potential alignments are expanded adding up a score for (total matches ndash mismatches ndash gap penalties) to make the best possible alignment But this is usually for a tiny proportion of the sequences in the database ndash so overall it is much quicker

The highest scoring alignments are reported

But we can potentially miss alignments with no word-size bits in common consider BLASTn with a default word-size of 11

TCGGAAGTGGAAGCTGAACCTGATTGTAGAGTTGGAGGCCAGTGTTCTGGCTGAGC||||||||| ||||| |||||||||| |||||||||| |||| ||||| ||||||| TCGGAAGTGTAAGCTCAACCTGATTGCAGAGTTGGAGTCCAGAGTTCTAGCTGAGC

Care is sometimes neededhellip

BLAST ndashTypical OutputINPUT

gtpartial cDNA sequence Xenopus tropicalisCGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGTTCCCACCTCTCCTCTTTCACCATGAAGCTCAAGGACAAATTCCACTCCCCCAAAATCAAGCGCACCCCGTCCAAGAAGGGGAAGCCGGCCGACCTCACCGTCAAAACAGAAGAGAAACCCGTCAACAAAACCTTAAGCCGCTTGGAGGAACAGGAGAAAGAAGTCGTTAATGCCTTGCGTTACTTTAAGACAATTGTTGACAAGATGGCGGTGGACAAGATGGTGCTGGTGATGCTGCCAGGGTCGGCGA

OUTPUTQuery= (311 letters) Database NCBI Protein Reference Sequences 954378 sequences 347895532 total letters

gtgi|41055060|ref|NP_9574201| similar to guanine nucleotide-releasing factor 2 (specific for crk proto-oncogene) [Danio rerio]

Length=691

Score = 133 bits (335)Expect = 6e-31 Identities = 7698 (77) Positives = 8298 (83) Gaps = 498 (4) Frame = +2

Query 26 MSGKIE-KADSQRSHLSSFTMKLKDKFHSPKIKRTPSKKGKPA--DLTVKTEEKPVNKTL 196 MSGKIE K +SQ+SHLSSFTMKL KFHSPKIKRTPSKKGK + VKT EKPVNK + Sbjct 1 MSGKIESKHESQKSHLSSFTMKLM-KFHSPKIKRTPSKKGKQLQPEPAVKTPEKPVNKKV 59

Query 197 SRLEEQEKEVVNALRYFKTIVDKMAVDKMVLVMLPGSA 310 SRLEEQEK+VV+ALRYFKTIVDKM VD VL MLPGSA Sbjct 60 SRLEEQEKDVVSALRYFKTIVDKMNVDTKVLQMLPGSA 97

When is a match significant

RFKISDCQHPCTYSHNQYMTNHMRECPYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV

NFSWKKTSEKETNCQFDYPNDYNEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFNMCWLEVNSS

RF---KISDCQHPCTYSH-NQYMTNHMREC----PYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV F K S+ + C + + N Y N +C P+ + +W +P + D I N M ++ NFSWKKTSEKETNCQFDYPNDY--NEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFN------MCWLEVNSS

Here is a lsquotypicalrsquo weak alignment from BLASTp

In fact the sequences were randomly generated so there is no biologically significant alignmenthellip

E-values

The number of matches like the discovered match that I would expect to find by chance

An E-value of 00 implies that I would expect no matches like this to arise by chance thereforehellip

An E-value of 1 implies I would expect 1 match like this to arise by chance so if I have a match with such an E-valuehellip

Also ldquoexpect valueldquo or ldquoexpectationrdquo

E-values From First Principles

Some database statistics (23rd July 2005)

Database NCBI RefSeq mRNA 272619 sequences 503566580 total letters (~50 x 108)

Database NCBI nr 3329110 sequences 14601814750 total letters (~14 x 1010)

Notation

12e-35 = 12 x 10-35

48 x 106 = 4800000

We will consider first searching a nucleotide sequence (lsquoACGTAGACGTrsquo) against a nucleotide database eg the RefSeq mRNA above

Then we will consider the more complex case of amino acid sequence (protein) searches Which is of course what we mostly do

Calculating an E-valueThe RefSeq mRNA database has ~ 50 x 108 letters There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA

Query = lsquoArsquo

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACrsquo

AC AC AC AC

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACGrsquo

ACG

Expected number of matches = (50 x 108) 4 = ~12 x 108

Expected number of matches = (50 x 108) (4x 4) = ~31 x 107

Expected number of matches = (50 x 108) (4 x 4 x 4) = ~81 x 106

Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer

Expected number of matches = (50 x 108) (4 x 4 x 4 x 4 hellip 60 times ) = (50 x 108) 1036 = 50 x 10-28

E-value = 50 x 10-28

E-values In PracticeSo if I take a 60 nt sequencegtsequenceACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA and actually BLAST it against the RefSeq mRNA database I get

BLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds

Length=6060 Score = 119 bits (60) Expect = 2e-26 Identities = 6060 (100) Gaps = 060 (0) Strand=PlusPlus

Query 1 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 60 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 2977 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 3036

What do I get if I BLAST it against the larger nr databaseBLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds

Length=6060 Score = 119 bits (60) Expect = 6e-25 Identities = 6060 (100) Gaps = 060 (0) Strand=PlusPlus

Query 1 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 60 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 2977 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 3036

theoretical value was 50e-28 -

E-value Exercise

Given a transcription factor binding site

ACC[TG]TA

How many would you expect to find by chance in a 10k promoter sequence

How would this differ if there was an optional additional base between the 4th and 5th positionsIeACC[TG]TAOR ACC[TG]TA

E-value Exercise AnswerACC[TG]TA

Expect lsquoArsquo every 4 ntExpect lsquoACCrsquo every 4x4x4 = 64 nt

Expect lsquoT or Grsquo every 2nd ntExpect lsquoACC[TG]rsquo every 64x2 nt = 128 nt

Expect lsquoTArsquo every 4x4 = 16 ntExpect lsquoACC[TG]TArsquo every 128x16 nt = 2048 nt (4x4x4x2x4x4)We would expect ~5 of these promoter sites every 10k by chance

If also ACC[TG]TAA allowed

The two motifs independently have the same E-valueTo allow either means we expect twice as many

E-values Effect of Database SizeThe nr mRNA database has ~ 14 x 1010 letters (was RefSeq and 50 x108)There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA

Query = lsquoArsquo

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACrsquo

AC AC AC AC

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACGrsquo

ACG

Expected number of matches = (14 x 1010) 4 = ~12 x 108

Expected number of matches = (14 x 1010) (4x 4) = ~31 x 107

Expected number of matches = (14 x 1010) (4 x 4 x 4) = ~81 x 106

Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer

Expected number of matches = (14 x 1010) (4 x 4 x 4 x 4 hellip 60 times ) = (14 x 1010) 1036 = 14 x 10-26

E-value = 14 x 10-26

(was E-value = 50 x 10-28)

E-values Effect of Database Size

The E-value is simply dependent on database size

RefSeq

nr

14 x 1010 letters

50 x108 letters

30 x bigger

BLAST the same sequenceagainst each E-value = 14e-26

E-value = 50e-28

The database was ~30 times bigger and so the E-value was ~30 times bigger

Why were the values differentOur calculated E-value for searching against the RefSeq mRNA database was 50 x 10-28But our actual BLAST search at NCBI gave a value of

20 x 10-26 - about 40x larger - why is this

Gapped alignments

If we were expecting N matches for a query sequence lsquoACGTACGTACGTrsquo imagine what would happen to N if we allowed gaps in our matches

ACGTACGTACGT

This would now give us additional possible alignments that would meet our lsquomatchrsquo criteria

ACGTACGTACGT ACGTACAGTACGT ACGTACCGTACGT etc|||||||||||| |||||| |||||| |||||| ||||||ACGTACGTACGT ACGTAC-GTACGT ACGTAC-GTACGT

We will expect many more matches in a given database if we allow our alignments to have gaps The E-value will be larger

E-values Effect of Query Length

Biologically itrsquos the same match Does it mean we are any less sure that this match didnrsquot occur by chance The E-value is simply dependent on match length

database

BLAST 500 nt sequence against a database

BLASTn Get a full length match with sequence XYZ at an E-value = 50e-160

gtsequence ACTAGTCTAGCTAGACATCGATCGATGATGCTACACAGATAGACGATAGATAGTAAGTCGATCGATCGCGCATCGATCGTCTAGATCGATCGCTCGCTGTGTAGATAGATCGGCGATAGA

database

BLAST half of the same sequence against the same database

BLASTn

gtsequence ACTAGTCTAGCTAGACATCGATCGATGATGCTACACAGATAGACGATAGATAGTAAGTCG

Get a match with sequence XYZ again but at an E-value = 50e-80

Why not just use identityAt some levels this a good question

But consider two very different searches both of which give a 75 identity match

Query1 was 60 nt longCGGAGCTCAGGGCTTAACGACTGATATCTCCGCGCATGTCGAGAAACGATACAGCCAGCG||||||||||| || | || | || || |||| | | | |||||| | ||||||||||CGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGWhich would have an E-value ~ 50 x 10-19

And Query2 only 16 nt longACGTACGTACGTACGT||| || | |||| ||ACGCACCTTCGTAGGTWhich would have an E-value ~ 30

And intuitively we feel we would expect to see that sort of number of matches in the database just by chancehellip

So whatrsquos the real problemBasically you are usually trying to answer the question

Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism

Are there any useful guidelines though at least for biological meaningfulness

Basically you are usually trying to answer the question

Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism

BLAST

The difficulty is because

ORTHOLOGY

BLAST Similarity + Probability

biological knowledge

nature of query sequence

phylogenetic relationship

match length PI size of databasehellip

Rules of ThumbHow good does an E-value have to be before we might even think we have an ortholog

largerworse smallerbetter

E-values 10-5 10-10 10-40 10-100 00

fantasy borderline encouraging

pretty good canrsquot get

better

But note that in some gene families with closely related members you can get an E-value of 00 for several different matches and then identity may be more sensitive Also bear in mind in cases like this that ideas of lsquofunctionalrsquo orthology may break down with more than one locus producing identical proteins which share the same functionhellip

Protein BLASTItrsquos (nearly) always better to make comparisons at the amino acid level between protein sequences than the DNA level because the amino acid sequence is more conserved than the underlying DNA sequenceDoes this cause us to treat expected values any differently

If we follow the argument as before then for an exact match of a 20 amino acid sequence in the RefSeq protein database each additional amino acid will reduce the E-value by 120th (there are 20 different amino acids) And as there are 347895532 letters in that databaseE-value = ~35 x 108 (20 x 20 x 20 hellip20 times) = ~35 x 10-18

But this is what we get if we run the blast at NCBI

Score = 431 bits (100) Expect = 8e-04 Identities = 2020 (100) Positives = 2020 (100) Gaps = 020 (0) Frame = +3

Query 3 SSSSFRAYRAALSEVEPPCI 62 SSSSFRAYRAALSEVEPPCISbjct 972 SSSSFRAYRAALSEVEPPCI 991

Really too big a discrepancy to easily explain with hand wavinghellip

Amino Acid Substitutions

A SC F LWYG I LMVL IMFVM ILVP V ILMW FY

N DHSQ REHKS ANTT SY HFW

H NQYK RQER QK

D NEE DQK

In fact we need to take into account both amino acid substitutability as well as as before allowing gapped alignments On average any residue can be substituted for by about 2 others so each position has about 17th chance of lsquomatchingrsquo rather than 120th

So now we getE-value = ~35 x 108 (7 x 7 x 7 hellip20 times) = ~44 x 10-9which is much closer to the actual BLAST value

These substitutabilities are dealt with by the BLOSUM and PAM matrices

ExercisesGo to the file random-DNA-sequenceshtml select one of the 20 randomly generated nucleotide sequences and do a BLASTx (translated DNA-gtprotein) at NCBI against the nr protein database

Did you find any lsquosignificantrsquo hits

Repeat with a second sequence

What conclusions might you draw from this exercise

Try the same sequence(s) against the nr nucleotide database

Is there any general difference

Part 4 Tweaking BLASTAlthough you normally see BLAST as a web page with boxes to place data in and tick boxes etc it is actually a command line program that can be run just by typing the appropriate command and options eg

promptgt blastall ndashp ltblast typegt ndashi ltinput sequencegt ndashd ltdatabasegt

This is the simplest form where the basic program lsquoblastallrsquo takes a number of different options or parameters indicated by the ndashx and followed by its value -p ltwhich blast flavour to rungt-i ltfile with query sequence ingt-d ltpre-indexed database namegt

Not All Parameters are hellipThere are many other parameters and if not listed explicitly they will use a default value most appropriate to the blast flavour requested Eg for ndashW ltword sizegt blastn uses ndashW 11 where blastx uses ndashW 3

There are also some options that appear on the web pages that are not really parameters but manage the job in a similar way One of the most useful of these is on the NCBI blast pages where you can use Entrez queries or pick from an organism list to modify your search

The Many Parameters of BLASTThere are almost literally hundreds of parameters but most are way too obscure even for die-hard techies like me Very few of them are regularly useful in any but their default value but just occasionally they are very necessary

Here are some of the ones that I have used

-e max expected value -m output format (graphical or tabularspreadsheet)-F filter query sequence for low complexity (default TRUE)-U use only upper case regions of query (default FALSE)-G gap opening cost-E gap extension cost-q nucleotide mismatch penalty (BLASTx uses matrices)-r nucleotide match reward-b number of matching sequences to report-g allow gaps (default TRUE)-W word size-z effective database size (removes effect of actual database size)-S query strands to search (default both directions)-l restrict database sequences to given list of lsquogilsquo numbers

BLAST Parameters Exercises1 BLASTn vs BLASTx

Open the file example-sequenceshtml copy the sequence gtblastn-vs-blastxThis is a Xenopus tropicalis cDNA sequence

Go to the NCBI BLAST Home PageNucleotide-nucleotide BLAST (blastn) section Paste your sequence into the box

Run BLASTn against the nr nucleotide database using all default optionsThen hit [format] to wait for the results in a new page(hint if you paste the sequence definition line lsquogtnamersquo into the box as well your results will be labelled accordingly which can be useful)

Now repeat but go to the TRANSLATED BLAST section and BLAST against the nr protein database using BLASTx

How might the different results help us view the presence of this gene in other vertebrates

Results for Exercise 1 BLASTn

BLASTx

BLAST Parameters Exercises2 Low complexity filtering

Open the file example-sequenceshtml copy the sequence gtlow-complexity-filtering-A This sequence contains a long AT tandem repeat

Go to the NCBI BLAST Home PageTRANSLATED BLAST sectionBLASTx Paste your sequence into the box

Carefully UNTICK the ldquoChoose filter [ ] Low complexityrdquo BOX in the second section And then run BLASTx against the nr database

What do you feel about these alignmentsRe-run but leave the low-complexity filter ON this timeDoes this change our view of the protein matches

Now continue with gtlow-complexity-filtering-B and ndashCC is an especially interesting case ndash what can we deduce about the cDNA sequence Annotators beware

Results for Exercise 2A (OFF) BLASTn ndash low complexity filtering OFF

Results for Exercise 2A (ON) BLASTn ndash low complexity filtering ON

Results for Exercise 2B

ON OFF

Results for Exercise 2C

There is a sequence error an extra G at position 117 in the sequence cDNA (117)AGAAAAGAAGAAACATGGCAATGGATCAGAA|||||||||||||||| ||||||||||||||AGAAAAGAAGAAACAT-GCAATGGATCAGAA

Genomic sequence

ON

OFF

BLAST Parameters Exercises3 Limit by Entrez query

Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching sequences in fruit fly enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT

Open the file example-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home Page TRANSLATED BLAST sectionBLASTx and paste the sequence

Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip

BLAST Parameters Exercises4 BLASTn vs tBLASTx and nucleotide mismatch penalties

Open the file example-sequenceshtml

Also open the NCBI BLAST Home PageSPECIAL ndash Align two sequences section

There are several Xenopus tropicalis cyclins in the examples fileCopy the sequence gtcyclin-A1-Xt to the Sequence 1 BLAST windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 BLAST window(i) Run the default comparison should be BLASTn Note the alignmentNow run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these

(ii) Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1 Then try reducing the gap open and gap extension penaltieshellipWhat do we learn from this

(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip

Results for Exercise 4 (i)

BLASTn tBLASTx

Results for Exercise 4 (ii)

Mismatch penalty = -2 (default) Mismatch penalty = -1

BLAST Parameters Exercises5 E-Value maximum for reporting

Open the file example-sequenceshtml

Copy the sequence gtsumo-binding-motif and go to the NCBI BLAST Home PageGo to the PROTEIN BLAST section BLASTp and paste the sequence

Run the search with the default values

Now re-run the search setting the maximum E-value in the box

Expect 100

What difference does this make

BLAST Parameters Exercises6 Word Size

Open the file example-sequenceshtml

Copy the sequence gtmorpholino and go to the NCBI BLAST Home PageGo to the NUCLEOTIDE BLAST section BLASTn and paste the sequence

Check OFF the low complexity filter and then run the search

Now re-run the search setting the following parameters

Low complexity OFFExpect 100Word Size 7Other advanced -q-1 (mismatch penalty -1 instead of default -3)

What difference does this make

END

  • Bioinformatics Workshop 1 Sequences and Similarity Searches
  • The Basic Questions
  • Part 1 Structural Genomics
  • Chromosomes and Genes
  • Gene to Protein
  • Sequence Signals
  • Genomic Signals
  • Derivative Sequences
  • Gene Models
  • Sequences and Genes (Accession Numbers and Names)
  • Gene Symbols Names Etc
  • A Gene-Centric View
  • Sequences and Accession Numbers
  • mRNA Splicing Signals
  • Gene Predictions
  • Supporting Evidence
  • TheoreticalPredicted Sequences
  • Sequences for a model organism
  • So Whatrsquos in the Databases Now
  • Part 2 Comparative Genomics
  • Speciation
  • Residual Similarity
  • Computers Can Detect Homology
  • Orthologs
  • Paralogs
  • lsquoOtherrsquo-logs
  • The Essential Paradigm
  • Function Conserved Longer than Detectable Similarity
  • Redundancy in the Genetic Code
  • Protein Similarity Persists Longer
  • Always Compare Protein Sequences
  • Exercise 1 nucleotide vs amino acid search
  • Answers Exercise 1
  • The Essential Task
  • Functional Orthologs
  • Finding Orthologs
  • Using Synteny is Better
  • Metazome
  • Metazome Exercise
  • Part 3 Finding Sequence Similarities
  • Gaps in Alignments
  • The Downside of Gaps
  • BLAST
  • Flavours of BLAST
  • How does it work
  • BLAST WORDS and INDEXING
  • Analyse the Query Sequence
  • Expand from Word Based Matches
  • BLAST ndashTypical Output
  • When is a match significant
  • E-values
  • E-values From First Principles
  • Calculating an E-value
  • E-values In Practice
  • E-value Exercise
  • E-value Exercise Answer
  • E-values Effect of Database Size
  • Slide 58
  • Why were the values different
  • E-values Effect of Query Length
  • Why not just use identity
  • So whatrsquos the real problem
  • Rules of Thumb
  • Protein BLAST
  • Amino Acid Substitutions
  • Exercises
  • Part 4 Tweaking BLAST
  • Not All Parameters are hellip
  • The Many Parameters of BLAST
  • Slide 70
  • BLAST Parameters Exercises
  • Results for Exercise 1
  • Slide 73
  • Results for Exercise 2A (OFF)
  • Results for Exercise 2A (ON)
  • Results for Exercise 2B
  • Results for Exercise 2C
  • Slide 78
  • Slide 79
  • Results for Exercise 4 (i)
  • Results for Exercise 4 (ii)
  • Slide 82
  • Slide 83
  • END
Page 13: Bioinformatics Workshop 1 Sequences and Similarity Searches

Sequences and Accession Numbers

NM_0010159221 gi=62860271

GATCGTTCGATTAGCTAGGGACACCACCGATCGATATGACCACAAAAA

BC0096381 gi=16307106

GTTCGATTAGCTAGGGACACCACCGATCGATATGACCACAAAA

NM_0010159222 gi=62860589

GACCGTTCGATTAGCTAGGGACACCACCGATCGATATGACCACAAA

NP_0010159221 protein translated from mRNA

XM_0011025671 predicted mRNA

XP_0010897651 predicted protein translated from predicted mRNA

mRNA Splicing Signals

gene model

genome

CTACCATCCATGCTAACCATTCTACCATTTTATACTCATGCAACGGACCGTAGCGTAGTCGCTTAGCATCCTTTATAACTGGCTA

CTACCATCCATGCTAACCATTCTAC CATTTTATACTCATGCAACGGACCGT AGCGTAGTCGCTTAGCATCCTTTATAACTGGCTA

CTACCATCCATGCTAACCATTCTACGTAAGTCATCTATATCAATATTATTTCAGCATTTTATACTCATGCAACGGACCGTGTCAGTATTACAGAGCGTAGTCGCTTAGCATCCTTTATAACTGGCTA

GTAAGdonor

TTTCAG acceptor

mRNA

exon intron exon intron exon

splice sites

Gene PredictionsGiven- coding sequence must run from ATG ndash STOP codon in-frame- introns GT AG can be spliced out

Also take a statistical approach- coding and non-coding sequence are slightly different in composition- some lsquopossiblersquo splice sites are more likely than others

CGTCGTATGGCTTCGATGTAGTACATCGGATCGGTATGGAATCATTTCAGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA

CGTCGTATGGCTTCGATGTAGTACATCGGATCGGTATGGAATCATTTCAGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA CGTCGTATGGCTTCGATGTAGTACATCGGATCGGTATGGAATCATTTCAGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA

CGTCGTATGGCTTCGATGTAGTACATCGGATCGGTATGGAATCATTTCAGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA

CGTCGTATGGCTTCGATTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA

CGTCGTATGGCTTCGATGTAGTACATCGGATCGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA CGTCGTATGGCTTCGATGTAGTACATCGGATCGGTATGGAATCATTTCAGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA

CGTCGTATGGCTTCGATTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA

CGTCGTATGGCTTCGATGTAGTACATCGGATCGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA

scan genomic sequence hellip

CGTCGTATGGCTTCGATGTAGTACATCGGATCGGTATGGAATCATTTCAGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA

most likely gene model

Supporting Evidence

EST evidence

genome

gene model

We note that in the absence of EST evidence it is only really possible to predict coding sequence with any confidence (and even thenhellip)

So predicted genes based on computational gene models alone will usually lack UTR regions which has some important consequences

exons 1 2 3 4

TheoreticalPredicted Sequences

genome

predicted gene modelexons 1 2 3 4

Wersquove now reversed the process of working out exon structure from aligning cDNA sequences against the genome sequence but we shouldnrsquot lose sight of the fact that we donrsquot really know if these predicted proteins exists ndash especially where supporting EST evidence is weak or non-existent

predicted transcript

predicted protein

Sequences for a model organism

ESTs ndash millions pound10 eachCheap to sequence ndash so we get millions per organismBut lots of errorsAnd incomplete gene sequencesCan give us relative expression levels

cDNAs ndash tens of thousands pound1000 eachExpensive ndash but only need to do one (or a small number) per geneFew errors with multipass sequencingGives us protein sequences

Genomes ndash one pound30000000Extremely expensiveBut the only way to get the whole pictureGives us gene regulation

So Whatrsquos in the Databases Now

15000000ESTs

3300000cDNAs

NCBI July 2005

2700000proteins

950000proteins

nrRefSeq

DNA

Proteins

Part 2 Comparative Genomics

ATGAAGGCTGCCTACGACTGCCGTGATGCAGGCTGCCTACGACTGCCGTGATGCAGGCTGCCAACGACTGCCGTGATGCATGCTGCCAACGACTGCCGTGATGCATGCTGCCAACGACTGCCCTGATGCATGCTGCCAACGGCTGCCCTGATGCATGCTGCCAACGGATGCCCTGATGCATGCCGCCAACGGATGCCCTGATGCATGCCGCCAACGGATGTCCTG

Imagine one mutation gets fixed every 100000 years in this gene sequencehellip

Gene sequence

Evolution by sequence mutation

Speciation

ATGAAGGCTGCCTACGACTGCCGTG

ATGCAGGCTGCCTACGACTGCCGTGATGCAGGCTGCCAACGACTGCCGTGATGCATGCTGCCAACGACTGCCGTGATGCATGCTGCCAACGACTGCCCTGATGCATGCTGCCAACGGCTGCCCTGATGCATGCTGCCAACGGATGCCCTG

Gene AATGAAGGCTGCCTACGACTGCCGTG

ATGAAGGCCGCCTACGACTGCCGTGATGAAGGCCGCCAACGACTGTCGTGATGAAAGCCGCCAACGACTGTCGTGATGAAAGCCGCCAACGACAGTCGTGATGAAAGCCGCCTACGACAGTCGTGATGAAAGCCGCCTACGACAGTCCTG

ATGCATGCTGCCAACGGATGCCCTG

ATGAAAGCCGCCTACGACAGTCCTG||| | || ||| ||| | ||||

If the genetic difference means they can no longer interbreed with fertile offspring ndash then we have a new specieshellip

Residual Similarity

ATGCATGCTGCCAACGGATGCCCTG

ATGAAAGCCGCCTACGACAGTCCTG||| | || ||| ||| | ||||

ATGCATGCTGCCAACGGATGCCCTG

ATGGAAGGCGCTTAGGATAGTCCAG||| | | || | | | || |

After longer periods of evolution homology may no longer be detectable in the DNA sequencehellip

We can still easily detect residual similarity between these sequences this is what we call homology ndash detectable similarity because of common evolutionary origin

Computers Can Detect Homology

In fact computers are very good at this task ndash the two primary challenges are

(a) performing the search fast enough to look through millions of sequence in a timescale compatible with a lab scientistrsquos attention span

(b) at low levels of similarity being able to distinguish between biologically related sequences and chance matcheshellip

ATGCATGCTGCCAACGGATGCCCTG

ATGAAAGCCGCCTACGACAGTCCTG||| | || ||| ||| | ||||

GCTGACTCGTAGCGCTTAGCTAGCT

CCAACATCTAGCCAGATTAGTTAGT | || | | | |

Orthologs

A A

A Gene duplication though speciation The two copies of Gene

A will now evolve independently but will continue to have the ~same function

They are ORTHOLOGS

Paralogs

A

Gene duplication though internal genome duplication

The two copies of Gene A will now evolve independently but will probably not continue to have exactly the same function

They are PARALOGS

A

A Arsquo

A

lsquoOtherrsquo-logsWhat about gene duplication after speciation

How can we describe the relationship(s) between the various copies of gene A in the two frogs

Bear in mind that understanding gene function is more important than semanticshellip

The two copies of A in the orange frog are sometimes called IN-PARALOGS

If they were also present in the green frog (and therefore were in the ancestor species) they would be OUT-PARALOGS

A

A

A

Arsquo A

The Essential Paradigm

1 any group of modern species can be traced back to some extinct common ancestor

A

A

2 in all likelihood they share orthologous genes which have the same function in the modern animal as in the extinct ancestor

3 If we can experimentally determine the function of a gene in one of these organisms then there is a good chance the ORTHOLOGOUS gene in another organism will have the same function

A A

cyclin b1

cyclin b1

Function Conserved Longer than Detectable Similarity

start from first self-replicating sequence

same function detectable similarity

living organisms

whole genome duplication local duplication

Redundancy in the Genetic Code

GCA A alanine GCC A GCG A GCT A

TGC C cystine TGT C

GAC D aspartate GAT D

GGA G glycine GGC G GGG G GGT G

lsquoSynonymousrsquo or lsquosilentrsquo mutations in the third position of the codon triplets have no effect on the amino acid coded for ndash so there is no evolutionary pressure against thishellip

Protein Similarity Persists Longer

CTATCACGAGAACCTGTGCTATCCCGAGAACCTGTGCTATCCCGAGAACCAGTGCTATCCCGTGAACCAGTGCTATCCCGTGAGCCAGTGCTATCCCGTGAGCCAGTTCTGTCCCGTGAGCCAGTT

CTATCACGAGAACCTGTG

CTGTCCCGTGAGCCAGTT|| || || || || ||

LSREPV

LSREPV||||||

CTATCACGAGAACCTGTG

TTGTCCCGGTCGCCAGTT | || | || ||

LSREPV

LSRFPV||| ||

67 100

44 80

Always Compare Protein Sequences

ATGAATGCAGCCTATGATTGCCGAGCCAGAATGCTAAGG MNAAYDCRARMLR ||||| || || || || || || || ||||| || || | ||||||||+||ATGAAGGCCGCATACGACTGTCGTGCTAGAATCCTGAGA MKAAYDCRARILR

DNA comparison amino acid comparison

The DNA sequence can change while the amino acid sequence stays the same so always look for similarities by comparing amino acid sequences

Exercise 1nucleotide vs amino acid search

Go to the file example-sequenceshtml and locate the section for this exercise There should be two sequences lsquosurfeit1rsquo for frog and fly

Go to NCBI Blast home page then lsquoAlign two sequencesrsquo (bottom left lsquospecialrsquo panel) paste one sequence into each window and hit lsquoAlignrsquo ndash this will do a direct DNADNA comparison

Now find the open reading frames of the two genes and translate them into amino acid protein sequences then repeat the two sequences comparison

Go to NCBI ORF Finder ndash paste sequence ndash hit OrfFind ndash identify longest ORF ndash click on it ndash next screen hit Accept ndash change View to Fasta protein ndash hit View ndash copy sequence to Blast2Seqs Do the same with the other sequence

Before you hit lsquoAlignrsquo change the lsquoProgramrsquo (top left) to blastphellip

Answers Exercise 1

The Essential Taskexperiment data mining

gene sequence what is its function

database of proteins in other species

Cyclin-AFoxA1

cdc25

alpha-tubulin

Predicted protein

Gravin-like

Sprouty-2

calmodulin

KIAA10786568

frizzled

Wint8

Troponin T3

Gravin-like

we can only do this because of implied function based on orthology

Functional Orthologs

function known annotation lsquoGravinrsquo available

Human geneXenopus genefunction unknown

sequence similarityorthologs

same function But we know that function is largely determined by shape

similar shape

Which in general we cannot determine ndash but it is probably SHAPE not SEQUENCE that is conserved

We make an assumption that the same gene function is likely to be present in the two organisms and the ones that have this function are likely to be the most similar in sequence

Finding OrthologsSo how do we find orthologs and can we know when we have

The simplest is Reciprocal Best BLAST but it implicitly relies on having all the protein sequences of you own organism and the one you wish to find an ortholog in

frog proteindatabase of human proteins

best match human protein

database of frog proteins

x

Using Synteny is Better

We know that large regions of (say) vertebrate genomes have preserved their overall organisation from one organism to another

And we find the same genes (ie orthologs) in more or less the same order in the syntenic sectionsThese of course represent chromosomal re-arrangements since these organisms diverged

Human chromosome 5

Mouse chromosome 10

Mouse chromosome 2

MetazomeFortunately someone has done all the hard work for ushellip Dan Rokhsar httpwwwmetazomenet

Metazome Exercise

Go back to Entrez Gene and look for your favourite gene again

Pick probable ortholog vertebrate genes from common organisms (human mouse rat chicken frog fish) and paste their protein sequences into a temporary space

Go to Metazome (httpwwwmetazomenet) find the blast window open two versions of it and blast your sequences against the Tetrapod or Jawed vertebrate node

See if you get the same cluster ID as best top hit and have a look at the Metazome alignment(s)hellip

Part 3 Finding Sequence Similarities

We want computer programs which will compare sequences at all possible different alignments looking for a degree of similarity greater than we would expect to find by chance

But first we have to consider the implication of gapshellip

Insertions and deletions are other possible forms of mutations and they can really mess up our simple alignments

ATGCATGCTGCCAACGGATGTCCTG

ATGAAAGCCGCCTACGAAAGTCCTG||| | || ||| ||| | ||||||

ATGCATGCTGGCCAACGGATGTCCTG

ATGAAAGCCGCCTACGAAAGTCCTG||| | || ||| | | | |

Gaps in Alignments

Consider these two obviously similar sequences

TTCCCAACTCTCCTCTTTCACCATGAAGCTCAAGGACAGATTCCACTCGCCCCAAAATCAAGCTCACCCCGTCCAAGAA | || | || |||||||||||||||||||| ||||||||| ||| ||| | ||| | | |TTCCCACCTCTCCTCTTTGCACCATGAAGCTCAAGGACAAATTCCACTCCCCCAAAATCAAGCGCACCCCGTCCCAGAA

In fact we realise that the most probable alignment (regarding biological origin) is with a small gap in each sequence

TTCCCAACTCTCCTCTTT=CACCATGAAGCTCAAGGACAGATTCCACTCGCCCCAAAATCAAGCTCACCCCGTCCAAGAA |||||| ||||||||||| |||||||||||||||||||| ||||||||| |||||||||||||| |||||||||| ||||TTCCCACCTCTCCTCTTTGCACCATGAAGCTCAAGGACAAATTCCACTC=CCCCAAAATCAAGCGCACCCCGTCCCAGAA

So in general we allow ourselves to insert gaps until we find the optimal alignment

But where should this process stop

The Downside of GapsTake two random sequences with no lsquorealrsquo similarity

GACACTAGGTCGATGCGTGGTGGCGAGA

ACGCATCCGGATGTGCACCGTGGAACTG

And allow lsquocost freersquo gaps

GAC--ACT----AGGTCGATGC---GTGG---TGGCGAGA || | | | | | ||| |||| || ACGCA-TCCGGA--T-G-TGCACCGTGGAACTG

Clearly although the alignment has no mismatches it is obviously not biologically meaningful

To prevent this we assign a cost to adding gaps which is offset against the benefit of finding matches ndash and this is the essence of lsquofinding gapped alignmentsrsquo

We want to find the lsquoalignmentrsquo between the two (or more) sequences which shows the greatest degree of similarity while introducing the fewest gaps hellip

BLAST

gtqueryAGACGAACCTAGCACAAGCGCGTCTGGAAAGACCCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGTTCAGGAGTATTTGGACTGCAATATTGGCCCTCGTTCAAGGGCGCCTACCATCACCCGACGGTCATGCCGGTCCCCAGCAGCTGCTAATAACTTCCTTCGCTACTCAAGTTACCACGCTAGCAAAACCCACGGCATACCGTTTACCCTTTAAAATCAGCTTCAACCAGCAACGAA

There are many programs used to find similarities between sequencesThey range from relatively slow programs which find the exact best matching alignment through ones which take progressively inexact shortcuts to speed things up Of this latter class the best known and easily most widely used is BLAST developed by Stephen Altschul and others and continuously refined over the last 10-15 years

The essential idea is to compare your query sequence against a collection or lsquodatabasersquo of target sequences looking for the one(s) that match the query sequence the best

gttarget1AAAACAGGAATATTTACCGGGACCGGGTAATGATGCATCTCGAGGTACACAATATACCTG GAGAACCGAATTATGAGTTGGCCACCTTACTTAACGAAACCAGCAGAGAAAATCCAACAT GGCAACACCCCTCTGACTACACTAGAAGGAACTACTATGTAAGAAAACAGCCTGTCCCTT GCAGTTTGAATGACTGGGTGATGCGAAATGGGGGTCCTGCCATAGAGCGCTTCCATGGTT TACCTTGCACATTTCAGAGAAGTCCTATGCCAGGAGTCCTTCCTACAGGGCCTTCCTGAA ACTATATATGTGCTTATTCTTGTTTGATTTGGCTTTGCAGgttarget2CTCTTAATTTATTTCTCTTCCTGCAGCTCCCTCGCTTTTTCCTTTCCCTGTTACATTCAT CTGACTTGAAGAGTTGCAAATTTTCAGTGTTTCTGTTTTTGTTGCTGATATGTTGTAAAC TTTTTAATAAAATCTATTTCTATAG gttarget3GCAGTTTGAATGACTGGGTGATGCGAAATGGGGGTCCTGCCATAGAGCGCTTCCATGGTT TACCTTGCACATTTCAGAGAAGTCCTATGCCAGGAGTCCTTCCTACAGGGCCTTCCTGAA ACTATATATGTGCTTATTCTTGTTTGATTTGGCTTTGCAGCTAGGGTTTTCACCTTTTCT GGAAAAAAAAATACTGGCTTCC gttarget4CTGCTATTAATGGGCAAAACAACTCAAATAAAGTCCCTCTGCCACCCTCAGACACTGCCC CTGGCCCCCAGCTGCCCGCTGATCCTTGTAGCCAGAGCAGTAAAGTTTTGAAAGTGGAGC CCAAGGAGAATAAAGTTATTAAAGAAACTGGCTTTGAACAAGGTGAAAAGTCTTGTGCAG CACCTCTAGATCATACTGTGAAGGAAAATCTTGGACAAACTTCTAAAGAACAGGTGGTAG

query

database

COMPARE

LIST MATCHES

Flavours of BLAST

ACGATAGATCCCATCCATAAAT ATGACGATAGATCCCATCAT CGATAGGACCACCACA GATAGACCAGGATACATAGGATAATTA AGCTCGCTTGGCTCGATGGCT

MKJLSPWERSYTRGHYTWER MGHTVNBZY MKLPWRHGDBKJGMNDFD MBKLRPIUHDFRTASGSLKWWRTVBN

MKJLSPWERSYTRGHYTWER MGHTVNBZY MKLPWRHGDBKJGMNDFD MBKLRPIUHDFRTASGSLKWWRTVBN

ATGACGATAGATCCCATCAT CGATAGGACCACCACA GATAGACCAGGATACATAGGATAATTA AGCTCGCTTGGCTCGATGGCT

query sequence other operation database sequences

ATGACGATAGATCCCATCAT CGATAGGACCACCACA GATAGACCAGGATACATAGGATAATTA AGCTCGCTTGGCTCGATGGCT

BLASTn

BLASTp

BLASTx

tBLASTn

tBLASTx

ACGATAGATCCCATCCATAAAT

ACGATAGATCCCATCCATAAAT

MQWCGYRWTYQGYRW

MQWCGYRWTYQGYRW

FAST

FAST

SLOW

SLOWER

HORRIBLY

SLOW

6 fra

me

trans

latio

n

How does it work The main task of any sequence comparison program is to test all possible mutual alignments of two sequence and see how good the match is

CTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGT

CCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGTTC | | | | | ||||||||||||||||||||||||| CTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGTCTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGT || | | | | | | | | |||||||||||||||||||||||| | | | | | |

CCGAGCTTCTCATTGCTCTTCCTAACAGTG=TGATAGGCTAACCGTAATGGCGTTC||||||||||||||||||||||||| ||||||||||||||||||||||||

query

1st database sequence

This would actually be a very slow search process if implemented like thishellip

BLAST achieves its speed through two strategies

- it takes a WORD based approach- it pre-INDEXES database sequences

BLAST WORDS and INDEXING1 GACAAATCCAAACCCCTGAAGTTCTCCACCAGCAAAGCCA

2 TAAGCAAATTTAATTTTGTTTACATTTTC

3 GTTAAGACCTTCCCTGACATTTGCAGCAGTTTCAAATGTA

AAAAAAAA 00001 AAAAAAAC 00002 AAAAAAAG 00003 ACAAATCC 07967 ACAAATCC 07968 ACAAATCC 07979 GACAAATC 33568 GACAAATG 33569 TCCAAACC 64321 TCCAAACC 64322

sequence position word

1 1 33658

1 2 07967

1 3 16210

3 15 33568

3 16 07967

Database of sequences

Numbered list of all possible lsquowordsrsquo

Build a position index of all words in the database

Analyse the Query Sequence gtquery AGACAAATCCAAACCCCTGAAGTTCTCCACCAGCAAAGCCA

AAAAAAAA 00001 AAAAAAAC 00002 AAAAAAAG 00003 ACAAATCC 07967 ACAAATCC 07968 ACAAATCC 07979 GACAAATC 33568 GACAAATG 33569 TCCAAACC 64321 TCCAAACC 64322

QUERY SEQUENCE

Numbered list of all possible lsquowordsrsquo

position word

1 14236

2 33658

3 07967

Analyse QUERY SEQUENCE

sequence position word

1 1 33658

1 2 07967

1 3 16210

3 15 33568

3 16 07967

Index of database

Expand from Word Based Matches We lsquoinstantlyrsquo know which sequences in the database have at least a word length match with our query sequence and at what relative position

Next the potential alignments are expanded adding up a score for (total matches ndash mismatches ndash gap penalties) to make the best possible alignment But this is usually for a tiny proportion of the sequences in the database ndash so overall it is much quicker

The highest scoring alignments are reported

But we can potentially miss alignments with no word-size bits in common consider BLASTn with a default word-size of 11

TCGGAAGTGGAAGCTGAACCTGATTGTAGAGTTGGAGGCCAGTGTTCTGGCTGAGC||||||||| ||||| |||||||||| |||||||||| |||| ||||| ||||||| TCGGAAGTGTAAGCTCAACCTGATTGCAGAGTTGGAGTCCAGAGTTCTAGCTGAGC

Care is sometimes neededhellip

BLAST ndashTypical OutputINPUT

gtpartial cDNA sequence Xenopus tropicalisCGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGTTCCCACCTCTCCTCTTTCACCATGAAGCTCAAGGACAAATTCCACTCCCCCAAAATCAAGCGCACCCCGTCCAAGAAGGGGAAGCCGGCCGACCTCACCGTCAAAACAGAAGAGAAACCCGTCAACAAAACCTTAAGCCGCTTGGAGGAACAGGAGAAAGAAGTCGTTAATGCCTTGCGTTACTTTAAGACAATTGTTGACAAGATGGCGGTGGACAAGATGGTGCTGGTGATGCTGCCAGGGTCGGCGA

OUTPUTQuery= (311 letters) Database NCBI Protein Reference Sequences 954378 sequences 347895532 total letters

gtgi|41055060|ref|NP_9574201| similar to guanine nucleotide-releasing factor 2 (specific for crk proto-oncogene) [Danio rerio]

Length=691

Score = 133 bits (335)Expect = 6e-31 Identities = 7698 (77) Positives = 8298 (83) Gaps = 498 (4) Frame = +2

Query 26 MSGKIE-KADSQRSHLSSFTMKLKDKFHSPKIKRTPSKKGKPA--DLTVKTEEKPVNKTL 196 MSGKIE K +SQ+SHLSSFTMKL KFHSPKIKRTPSKKGK + VKT EKPVNK + Sbjct 1 MSGKIESKHESQKSHLSSFTMKLM-KFHSPKIKRTPSKKGKQLQPEPAVKTPEKPVNKKV 59

Query 197 SRLEEQEKEVVNALRYFKTIVDKMAVDKMVLVMLPGSA 310 SRLEEQEK+VV+ALRYFKTIVDKM VD VL MLPGSA Sbjct 60 SRLEEQEKDVVSALRYFKTIVDKMNVDTKVLQMLPGSA 97

When is a match significant

RFKISDCQHPCTYSHNQYMTNHMRECPYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV

NFSWKKTSEKETNCQFDYPNDYNEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFNMCWLEVNSS

RF---KISDCQHPCTYSH-NQYMTNHMREC----PYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV F K S+ + C + + N Y N +C P+ + +W +P + D I N M ++ NFSWKKTSEKETNCQFDYPNDY--NEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFN------MCWLEVNSS

Here is a lsquotypicalrsquo weak alignment from BLASTp

In fact the sequences were randomly generated so there is no biologically significant alignmenthellip

E-values

The number of matches like the discovered match that I would expect to find by chance

An E-value of 00 implies that I would expect no matches like this to arise by chance thereforehellip

An E-value of 1 implies I would expect 1 match like this to arise by chance so if I have a match with such an E-valuehellip

Also ldquoexpect valueldquo or ldquoexpectationrdquo

E-values From First Principles

Some database statistics (23rd July 2005)

Database NCBI RefSeq mRNA 272619 sequences 503566580 total letters (~50 x 108)

Database NCBI nr 3329110 sequences 14601814750 total letters (~14 x 1010)

Notation

12e-35 = 12 x 10-35

48 x 106 = 4800000

We will consider first searching a nucleotide sequence (lsquoACGTAGACGTrsquo) against a nucleotide database eg the RefSeq mRNA above

Then we will consider the more complex case of amino acid sequence (protein) searches Which is of course what we mostly do

Calculating an E-valueThe RefSeq mRNA database has ~ 50 x 108 letters There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA

Query = lsquoArsquo

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACrsquo

AC AC AC AC

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACGrsquo

ACG

Expected number of matches = (50 x 108) 4 = ~12 x 108

Expected number of matches = (50 x 108) (4x 4) = ~31 x 107

Expected number of matches = (50 x 108) (4 x 4 x 4) = ~81 x 106

Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer

Expected number of matches = (50 x 108) (4 x 4 x 4 x 4 hellip 60 times ) = (50 x 108) 1036 = 50 x 10-28

E-value = 50 x 10-28

E-values In PracticeSo if I take a 60 nt sequencegtsequenceACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA and actually BLAST it against the RefSeq mRNA database I get

BLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds

Length=6060 Score = 119 bits (60) Expect = 2e-26 Identities = 6060 (100) Gaps = 060 (0) Strand=PlusPlus

Query 1 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 60 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 2977 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 3036

What do I get if I BLAST it against the larger nr databaseBLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds

Length=6060 Score = 119 bits (60) Expect = 6e-25 Identities = 6060 (100) Gaps = 060 (0) Strand=PlusPlus

Query 1 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 60 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 2977 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 3036

theoretical value was 50e-28 -

E-value Exercise

Given a transcription factor binding site

ACC[TG]TA

How many would you expect to find by chance in a 10k promoter sequence

How would this differ if there was an optional additional base between the 4th and 5th positionsIeACC[TG]TAOR ACC[TG]TA

E-value Exercise AnswerACC[TG]TA

Expect lsquoArsquo every 4 ntExpect lsquoACCrsquo every 4x4x4 = 64 nt

Expect lsquoT or Grsquo every 2nd ntExpect lsquoACC[TG]rsquo every 64x2 nt = 128 nt

Expect lsquoTArsquo every 4x4 = 16 ntExpect lsquoACC[TG]TArsquo every 128x16 nt = 2048 nt (4x4x4x2x4x4)We would expect ~5 of these promoter sites every 10k by chance

If also ACC[TG]TAA allowed

The two motifs independently have the same E-valueTo allow either means we expect twice as many

E-values Effect of Database SizeThe nr mRNA database has ~ 14 x 1010 letters (was RefSeq and 50 x108)There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA

Query = lsquoArsquo

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACrsquo

AC AC AC AC

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACGrsquo

ACG

Expected number of matches = (14 x 1010) 4 = ~12 x 108

Expected number of matches = (14 x 1010) (4x 4) = ~31 x 107

Expected number of matches = (14 x 1010) (4 x 4 x 4) = ~81 x 106

Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer

Expected number of matches = (14 x 1010) (4 x 4 x 4 x 4 hellip 60 times ) = (14 x 1010) 1036 = 14 x 10-26

E-value = 14 x 10-26

(was E-value = 50 x 10-28)

E-values Effect of Database Size

The E-value is simply dependent on database size

RefSeq

nr

14 x 1010 letters

50 x108 letters

30 x bigger

BLAST the same sequenceagainst each E-value = 14e-26

E-value = 50e-28

The database was ~30 times bigger and so the E-value was ~30 times bigger

Why were the values differentOur calculated E-value for searching against the RefSeq mRNA database was 50 x 10-28But our actual BLAST search at NCBI gave a value of

20 x 10-26 - about 40x larger - why is this

Gapped alignments

If we were expecting N matches for a query sequence lsquoACGTACGTACGTrsquo imagine what would happen to N if we allowed gaps in our matches

ACGTACGTACGT

This would now give us additional possible alignments that would meet our lsquomatchrsquo criteria

ACGTACGTACGT ACGTACAGTACGT ACGTACCGTACGT etc|||||||||||| |||||| |||||| |||||| ||||||ACGTACGTACGT ACGTAC-GTACGT ACGTAC-GTACGT

We will expect many more matches in a given database if we allow our alignments to have gaps The E-value will be larger

E-values Effect of Query Length

Biologically itrsquos the same match Does it mean we are any less sure that this match didnrsquot occur by chance The E-value is simply dependent on match length

database

BLAST 500 nt sequence against a database

BLASTn Get a full length match with sequence XYZ at an E-value = 50e-160

gtsequence ACTAGTCTAGCTAGACATCGATCGATGATGCTACACAGATAGACGATAGATAGTAAGTCGATCGATCGCGCATCGATCGTCTAGATCGATCGCTCGCTGTGTAGATAGATCGGCGATAGA

database

BLAST half of the same sequence against the same database

BLASTn

gtsequence ACTAGTCTAGCTAGACATCGATCGATGATGCTACACAGATAGACGATAGATAGTAAGTCG

Get a match with sequence XYZ again but at an E-value = 50e-80

Why not just use identityAt some levels this a good question

But consider two very different searches both of which give a 75 identity match

Query1 was 60 nt longCGGAGCTCAGGGCTTAACGACTGATATCTCCGCGCATGTCGAGAAACGATACAGCCAGCG||||||||||| || | || | || || |||| | | | |||||| | ||||||||||CGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGWhich would have an E-value ~ 50 x 10-19

And Query2 only 16 nt longACGTACGTACGTACGT||| || | |||| ||ACGCACCTTCGTAGGTWhich would have an E-value ~ 30

And intuitively we feel we would expect to see that sort of number of matches in the database just by chancehellip

So whatrsquos the real problemBasically you are usually trying to answer the question

Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism

Are there any useful guidelines though at least for biological meaningfulness

Basically you are usually trying to answer the question

Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism

BLAST

The difficulty is because

ORTHOLOGY

BLAST Similarity + Probability

biological knowledge

nature of query sequence

phylogenetic relationship

match length PI size of databasehellip

Rules of ThumbHow good does an E-value have to be before we might even think we have an ortholog

largerworse smallerbetter

E-values 10-5 10-10 10-40 10-100 00

fantasy borderline encouraging

pretty good canrsquot get

better

But note that in some gene families with closely related members you can get an E-value of 00 for several different matches and then identity may be more sensitive Also bear in mind in cases like this that ideas of lsquofunctionalrsquo orthology may break down with more than one locus producing identical proteins which share the same functionhellip

Protein BLASTItrsquos (nearly) always better to make comparisons at the amino acid level between protein sequences than the DNA level because the amino acid sequence is more conserved than the underlying DNA sequenceDoes this cause us to treat expected values any differently

If we follow the argument as before then for an exact match of a 20 amino acid sequence in the RefSeq protein database each additional amino acid will reduce the E-value by 120th (there are 20 different amino acids) And as there are 347895532 letters in that databaseE-value = ~35 x 108 (20 x 20 x 20 hellip20 times) = ~35 x 10-18

But this is what we get if we run the blast at NCBI

Score = 431 bits (100) Expect = 8e-04 Identities = 2020 (100) Positives = 2020 (100) Gaps = 020 (0) Frame = +3

Query 3 SSSSFRAYRAALSEVEPPCI 62 SSSSFRAYRAALSEVEPPCISbjct 972 SSSSFRAYRAALSEVEPPCI 991

Really too big a discrepancy to easily explain with hand wavinghellip

Amino Acid Substitutions

A SC F LWYG I LMVL IMFVM ILVP V ILMW FY

N DHSQ REHKS ANTT SY HFW

H NQYK RQER QK

D NEE DQK

In fact we need to take into account both amino acid substitutability as well as as before allowing gapped alignments On average any residue can be substituted for by about 2 others so each position has about 17th chance of lsquomatchingrsquo rather than 120th

So now we getE-value = ~35 x 108 (7 x 7 x 7 hellip20 times) = ~44 x 10-9which is much closer to the actual BLAST value

These substitutabilities are dealt with by the BLOSUM and PAM matrices

ExercisesGo to the file random-DNA-sequenceshtml select one of the 20 randomly generated nucleotide sequences and do a BLASTx (translated DNA-gtprotein) at NCBI against the nr protein database

Did you find any lsquosignificantrsquo hits

Repeat with a second sequence

What conclusions might you draw from this exercise

Try the same sequence(s) against the nr nucleotide database

Is there any general difference

Part 4 Tweaking BLASTAlthough you normally see BLAST as a web page with boxes to place data in and tick boxes etc it is actually a command line program that can be run just by typing the appropriate command and options eg

promptgt blastall ndashp ltblast typegt ndashi ltinput sequencegt ndashd ltdatabasegt

This is the simplest form where the basic program lsquoblastallrsquo takes a number of different options or parameters indicated by the ndashx and followed by its value -p ltwhich blast flavour to rungt-i ltfile with query sequence ingt-d ltpre-indexed database namegt

Not All Parameters are hellipThere are many other parameters and if not listed explicitly they will use a default value most appropriate to the blast flavour requested Eg for ndashW ltword sizegt blastn uses ndashW 11 where blastx uses ndashW 3

There are also some options that appear on the web pages that are not really parameters but manage the job in a similar way One of the most useful of these is on the NCBI blast pages where you can use Entrez queries or pick from an organism list to modify your search

The Many Parameters of BLASTThere are almost literally hundreds of parameters but most are way too obscure even for die-hard techies like me Very few of them are regularly useful in any but their default value but just occasionally they are very necessary

Here are some of the ones that I have used

-e max expected value -m output format (graphical or tabularspreadsheet)-F filter query sequence for low complexity (default TRUE)-U use only upper case regions of query (default FALSE)-G gap opening cost-E gap extension cost-q nucleotide mismatch penalty (BLASTx uses matrices)-r nucleotide match reward-b number of matching sequences to report-g allow gaps (default TRUE)-W word size-z effective database size (removes effect of actual database size)-S query strands to search (default both directions)-l restrict database sequences to given list of lsquogilsquo numbers

BLAST Parameters Exercises1 BLASTn vs BLASTx

Open the file example-sequenceshtml copy the sequence gtblastn-vs-blastxThis is a Xenopus tropicalis cDNA sequence

Go to the NCBI BLAST Home PageNucleotide-nucleotide BLAST (blastn) section Paste your sequence into the box

Run BLASTn against the nr nucleotide database using all default optionsThen hit [format] to wait for the results in a new page(hint if you paste the sequence definition line lsquogtnamersquo into the box as well your results will be labelled accordingly which can be useful)

Now repeat but go to the TRANSLATED BLAST section and BLAST against the nr protein database using BLASTx

How might the different results help us view the presence of this gene in other vertebrates

Results for Exercise 1 BLASTn

BLASTx

BLAST Parameters Exercises2 Low complexity filtering

Open the file example-sequenceshtml copy the sequence gtlow-complexity-filtering-A This sequence contains a long AT tandem repeat

Go to the NCBI BLAST Home PageTRANSLATED BLAST sectionBLASTx Paste your sequence into the box

Carefully UNTICK the ldquoChoose filter [ ] Low complexityrdquo BOX in the second section And then run BLASTx against the nr database

What do you feel about these alignmentsRe-run but leave the low-complexity filter ON this timeDoes this change our view of the protein matches

Now continue with gtlow-complexity-filtering-B and ndashCC is an especially interesting case ndash what can we deduce about the cDNA sequence Annotators beware

Results for Exercise 2A (OFF) BLASTn ndash low complexity filtering OFF

Results for Exercise 2A (ON) BLASTn ndash low complexity filtering ON

Results for Exercise 2B

ON OFF

Results for Exercise 2C

There is a sequence error an extra G at position 117 in the sequence cDNA (117)AGAAAAGAAGAAACATGGCAATGGATCAGAA|||||||||||||||| ||||||||||||||AGAAAAGAAGAAACAT-GCAATGGATCAGAA

Genomic sequence

ON

OFF

BLAST Parameters Exercises3 Limit by Entrez query

Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching sequences in fruit fly enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT

Open the file example-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home Page TRANSLATED BLAST sectionBLASTx and paste the sequence

Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip

BLAST Parameters Exercises4 BLASTn vs tBLASTx and nucleotide mismatch penalties

Open the file example-sequenceshtml

Also open the NCBI BLAST Home PageSPECIAL ndash Align two sequences section

There are several Xenopus tropicalis cyclins in the examples fileCopy the sequence gtcyclin-A1-Xt to the Sequence 1 BLAST windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 BLAST window(i) Run the default comparison should be BLASTn Note the alignmentNow run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these

(ii) Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1 Then try reducing the gap open and gap extension penaltieshellipWhat do we learn from this

(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip

Results for Exercise 4 (i)

BLASTn tBLASTx

Results for Exercise 4 (ii)

Mismatch penalty = -2 (default) Mismatch penalty = -1

BLAST Parameters Exercises5 E-Value maximum for reporting

Open the file example-sequenceshtml

Copy the sequence gtsumo-binding-motif and go to the NCBI BLAST Home PageGo to the PROTEIN BLAST section BLASTp and paste the sequence

Run the search with the default values

Now re-run the search setting the maximum E-value in the box

Expect 100

What difference does this make

BLAST Parameters Exercises6 Word Size

Open the file example-sequenceshtml

Copy the sequence gtmorpholino and go to the NCBI BLAST Home PageGo to the NUCLEOTIDE BLAST section BLASTn and paste the sequence

Check OFF the low complexity filter and then run the search

Now re-run the search setting the following parameters

Low complexity OFFExpect 100Word Size 7Other advanced -q-1 (mismatch penalty -1 instead of default -3)

What difference does this make

END

  • Bioinformatics Workshop 1 Sequences and Similarity Searches
  • The Basic Questions
  • Part 1 Structural Genomics
  • Chromosomes and Genes
  • Gene to Protein
  • Sequence Signals
  • Genomic Signals
  • Derivative Sequences
  • Gene Models
  • Sequences and Genes (Accession Numbers and Names)
  • Gene Symbols Names Etc
  • A Gene-Centric View
  • Sequences and Accession Numbers
  • mRNA Splicing Signals
  • Gene Predictions
  • Supporting Evidence
  • TheoreticalPredicted Sequences
  • Sequences for a model organism
  • So Whatrsquos in the Databases Now
  • Part 2 Comparative Genomics
  • Speciation
  • Residual Similarity
  • Computers Can Detect Homology
  • Orthologs
  • Paralogs
  • lsquoOtherrsquo-logs
  • The Essential Paradigm
  • Function Conserved Longer than Detectable Similarity
  • Redundancy in the Genetic Code
  • Protein Similarity Persists Longer
  • Always Compare Protein Sequences
  • Exercise 1 nucleotide vs amino acid search
  • Answers Exercise 1
  • The Essential Task
  • Functional Orthologs
  • Finding Orthologs
  • Using Synteny is Better
  • Metazome
  • Metazome Exercise
  • Part 3 Finding Sequence Similarities
  • Gaps in Alignments
  • The Downside of Gaps
  • BLAST
  • Flavours of BLAST
  • How does it work
  • BLAST WORDS and INDEXING
  • Analyse the Query Sequence
  • Expand from Word Based Matches
  • BLAST ndashTypical Output
  • When is a match significant
  • E-values
  • E-values From First Principles
  • Calculating an E-value
  • E-values In Practice
  • E-value Exercise
  • E-value Exercise Answer
  • E-values Effect of Database Size
  • Slide 58
  • Why were the values different
  • E-values Effect of Query Length
  • Why not just use identity
  • So whatrsquos the real problem
  • Rules of Thumb
  • Protein BLAST
  • Amino Acid Substitutions
  • Exercises
  • Part 4 Tweaking BLAST
  • Not All Parameters are hellip
  • The Many Parameters of BLAST
  • Slide 70
  • BLAST Parameters Exercises
  • Results for Exercise 1
  • Slide 73
  • Results for Exercise 2A (OFF)
  • Results for Exercise 2A (ON)
  • Results for Exercise 2B
  • Results for Exercise 2C
  • Slide 78
  • Slide 79
  • Results for Exercise 4 (i)
  • Results for Exercise 4 (ii)
  • Slide 82
  • Slide 83
  • END
Page 14: Bioinformatics Workshop 1 Sequences and Similarity Searches

mRNA Splicing Signals

gene model

genome

CTACCATCCATGCTAACCATTCTACCATTTTATACTCATGCAACGGACCGTAGCGTAGTCGCTTAGCATCCTTTATAACTGGCTA

CTACCATCCATGCTAACCATTCTAC CATTTTATACTCATGCAACGGACCGT AGCGTAGTCGCTTAGCATCCTTTATAACTGGCTA

CTACCATCCATGCTAACCATTCTACGTAAGTCATCTATATCAATATTATTTCAGCATTTTATACTCATGCAACGGACCGTGTCAGTATTACAGAGCGTAGTCGCTTAGCATCCTTTATAACTGGCTA

GTAAGdonor

TTTCAG acceptor

mRNA

exon intron exon intron exon

splice sites

Gene PredictionsGiven- coding sequence must run from ATG ndash STOP codon in-frame- introns GT AG can be spliced out

Also take a statistical approach- coding and non-coding sequence are slightly different in composition- some lsquopossiblersquo splice sites are more likely than others

CGTCGTATGGCTTCGATGTAGTACATCGGATCGGTATGGAATCATTTCAGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA

CGTCGTATGGCTTCGATGTAGTACATCGGATCGGTATGGAATCATTTCAGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA CGTCGTATGGCTTCGATGTAGTACATCGGATCGGTATGGAATCATTTCAGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA

CGTCGTATGGCTTCGATGTAGTACATCGGATCGGTATGGAATCATTTCAGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA

CGTCGTATGGCTTCGATTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA

CGTCGTATGGCTTCGATGTAGTACATCGGATCGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA CGTCGTATGGCTTCGATGTAGTACATCGGATCGGTATGGAATCATTTCAGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA

CGTCGTATGGCTTCGATTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA

CGTCGTATGGCTTCGATGTAGTACATCGGATCGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA

scan genomic sequence hellip

CGTCGTATGGCTTCGATGTAGTACATCGGATCGGTATGGAATCATTTCAGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA

most likely gene model

Supporting Evidence

EST evidence

genome

gene model

We note that in the absence of EST evidence it is only really possible to predict coding sequence with any confidence (and even thenhellip)

So predicted genes based on computational gene models alone will usually lack UTR regions which has some important consequences

exons 1 2 3 4

TheoreticalPredicted Sequences

genome

predicted gene modelexons 1 2 3 4

Wersquove now reversed the process of working out exon structure from aligning cDNA sequences against the genome sequence but we shouldnrsquot lose sight of the fact that we donrsquot really know if these predicted proteins exists ndash especially where supporting EST evidence is weak or non-existent

predicted transcript

predicted protein

Sequences for a model organism

ESTs ndash millions pound10 eachCheap to sequence ndash so we get millions per organismBut lots of errorsAnd incomplete gene sequencesCan give us relative expression levels

cDNAs ndash tens of thousands pound1000 eachExpensive ndash but only need to do one (or a small number) per geneFew errors with multipass sequencingGives us protein sequences

Genomes ndash one pound30000000Extremely expensiveBut the only way to get the whole pictureGives us gene regulation

So Whatrsquos in the Databases Now

15000000ESTs

3300000cDNAs

NCBI July 2005

2700000proteins

950000proteins

nrRefSeq

DNA

Proteins

Part 2 Comparative Genomics

ATGAAGGCTGCCTACGACTGCCGTGATGCAGGCTGCCTACGACTGCCGTGATGCAGGCTGCCAACGACTGCCGTGATGCATGCTGCCAACGACTGCCGTGATGCATGCTGCCAACGACTGCCCTGATGCATGCTGCCAACGGCTGCCCTGATGCATGCTGCCAACGGATGCCCTGATGCATGCCGCCAACGGATGCCCTGATGCATGCCGCCAACGGATGTCCTG

Imagine one mutation gets fixed every 100000 years in this gene sequencehellip

Gene sequence

Evolution by sequence mutation

Speciation

ATGAAGGCTGCCTACGACTGCCGTG

ATGCAGGCTGCCTACGACTGCCGTGATGCAGGCTGCCAACGACTGCCGTGATGCATGCTGCCAACGACTGCCGTGATGCATGCTGCCAACGACTGCCCTGATGCATGCTGCCAACGGCTGCCCTGATGCATGCTGCCAACGGATGCCCTG

Gene AATGAAGGCTGCCTACGACTGCCGTG

ATGAAGGCCGCCTACGACTGCCGTGATGAAGGCCGCCAACGACTGTCGTGATGAAAGCCGCCAACGACTGTCGTGATGAAAGCCGCCAACGACAGTCGTGATGAAAGCCGCCTACGACAGTCGTGATGAAAGCCGCCTACGACAGTCCTG

ATGCATGCTGCCAACGGATGCCCTG

ATGAAAGCCGCCTACGACAGTCCTG||| | || ||| ||| | ||||

If the genetic difference means they can no longer interbreed with fertile offspring ndash then we have a new specieshellip

Residual Similarity

ATGCATGCTGCCAACGGATGCCCTG

ATGAAAGCCGCCTACGACAGTCCTG||| | || ||| ||| | ||||

ATGCATGCTGCCAACGGATGCCCTG

ATGGAAGGCGCTTAGGATAGTCCAG||| | | || | | | || |

After longer periods of evolution homology may no longer be detectable in the DNA sequencehellip

We can still easily detect residual similarity between these sequences this is what we call homology ndash detectable similarity because of common evolutionary origin

Computers Can Detect Homology

In fact computers are very good at this task ndash the two primary challenges are

(a) performing the search fast enough to look through millions of sequence in a timescale compatible with a lab scientistrsquos attention span

(b) at low levels of similarity being able to distinguish between biologically related sequences and chance matcheshellip

ATGCATGCTGCCAACGGATGCCCTG

ATGAAAGCCGCCTACGACAGTCCTG||| | || ||| ||| | ||||

GCTGACTCGTAGCGCTTAGCTAGCT

CCAACATCTAGCCAGATTAGTTAGT | || | | | |

Orthologs

A A

A Gene duplication though speciation The two copies of Gene

A will now evolve independently but will continue to have the ~same function

They are ORTHOLOGS

Paralogs

A

Gene duplication though internal genome duplication

The two copies of Gene A will now evolve independently but will probably not continue to have exactly the same function

They are PARALOGS

A

A Arsquo

A

lsquoOtherrsquo-logsWhat about gene duplication after speciation

How can we describe the relationship(s) between the various copies of gene A in the two frogs

Bear in mind that understanding gene function is more important than semanticshellip

The two copies of A in the orange frog are sometimes called IN-PARALOGS

If they were also present in the green frog (and therefore were in the ancestor species) they would be OUT-PARALOGS

A

A

A

Arsquo A

The Essential Paradigm

1 any group of modern species can be traced back to some extinct common ancestor

A

A

2 in all likelihood they share orthologous genes which have the same function in the modern animal as in the extinct ancestor

3 If we can experimentally determine the function of a gene in one of these organisms then there is a good chance the ORTHOLOGOUS gene in another organism will have the same function

A A

cyclin b1

cyclin b1

Function Conserved Longer than Detectable Similarity

start from first self-replicating sequence

same function detectable similarity

living organisms

whole genome duplication local duplication

Redundancy in the Genetic Code

GCA A alanine GCC A GCG A GCT A

TGC C cystine TGT C

GAC D aspartate GAT D

GGA G glycine GGC G GGG G GGT G

lsquoSynonymousrsquo or lsquosilentrsquo mutations in the third position of the codon triplets have no effect on the amino acid coded for ndash so there is no evolutionary pressure against thishellip

Protein Similarity Persists Longer

CTATCACGAGAACCTGTGCTATCCCGAGAACCTGTGCTATCCCGAGAACCAGTGCTATCCCGTGAACCAGTGCTATCCCGTGAGCCAGTGCTATCCCGTGAGCCAGTTCTGTCCCGTGAGCCAGTT

CTATCACGAGAACCTGTG

CTGTCCCGTGAGCCAGTT|| || || || || ||

LSREPV

LSREPV||||||

CTATCACGAGAACCTGTG

TTGTCCCGGTCGCCAGTT | || | || ||

LSREPV

LSRFPV||| ||

67 100

44 80

Always Compare Protein Sequences

ATGAATGCAGCCTATGATTGCCGAGCCAGAATGCTAAGG MNAAYDCRARMLR ||||| || || || || || || || ||||| || || | ||||||||+||ATGAAGGCCGCATACGACTGTCGTGCTAGAATCCTGAGA MKAAYDCRARILR

DNA comparison amino acid comparison

The DNA sequence can change while the amino acid sequence stays the same so always look for similarities by comparing amino acid sequences

Exercise 1nucleotide vs amino acid search

Go to the file example-sequenceshtml and locate the section for this exercise There should be two sequences lsquosurfeit1rsquo for frog and fly

Go to NCBI Blast home page then lsquoAlign two sequencesrsquo (bottom left lsquospecialrsquo panel) paste one sequence into each window and hit lsquoAlignrsquo ndash this will do a direct DNADNA comparison

Now find the open reading frames of the two genes and translate them into amino acid protein sequences then repeat the two sequences comparison

Go to NCBI ORF Finder ndash paste sequence ndash hit OrfFind ndash identify longest ORF ndash click on it ndash next screen hit Accept ndash change View to Fasta protein ndash hit View ndash copy sequence to Blast2Seqs Do the same with the other sequence

Before you hit lsquoAlignrsquo change the lsquoProgramrsquo (top left) to blastphellip

Answers Exercise 1

The Essential Taskexperiment data mining

gene sequence what is its function

database of proteins in other species

Cyclin-AFoxA1

cdc25

alpha-tubulin

Predicted protein

Gravin-like

Sprouty-2

calmodulin

KIAA10786568

frizzled

Wint8

Troponin T3

Gravin-like

we can only do this because of implied function based on orthology

Functional Orthologs

function known annotation lsquoGravinrsquo available

Human geneXenopus genefunction unknown

sequence similarityorthologs

same function But we know that function is largely determined by shape

similar shape

Which in general we cannot determine ndash but it is probably SHAPE not SEQUENCE that is conserved

We make an assumption that the same gene function is likely to be present in the two organisms and the ones that have this function are likely to be the most similar in sequence

Finding OrthologsSo how do we find orthologs and can we know when we have

The simplest is Reciprocal Best BLAST but it implicitly relies on having all the protein sequences of you own organism and the one you wish to find an ortholog in

frog proteindatabase of human proteins

best match human protein

database of frog proteins

x

Using Synteny is Better

We know that large regions of (say) vertebrate genomes have preserved their overall organisation from one organism to another

And we find the same genes (ie orthologs) in more or less the same order in the syntenic sectionsThese of course represent chromosomal re-arrangements since these organisms diverged

Human chromosome 5

Mouse chromosome 10

Mouse chromosome 2

MetazomeFortunately someone has done all the hard work for ushellip Dan Rokhsar httpwwwmetazomenet

Metazome Exercise

Go back to Entrez Gene and look for your favourite gene again

Pick probable ortholog vertebrate genes from common organisms (human mouse rat chicken frog fish) and paste their protein sequences into a temporary space

Go to Metazome (httpwwwmetazomenet) find the blast window open two versions of it and blast your sequences against the Tetrapod or Jawed vertebrate node

See if you get the same cluster ID as best top hit and have a look at the Metazome alignment(s)hellip

Part 3 Finding Sequence Similarities

We want computer programs which will compare sequences at all possible different alignments looking for a degree of similarity greater than we would expect to find by chance

But first we have to consider the implication of gapshellip

Insertions and deletions are other possible forms of mutations and they can really mess up our simple alignments

ATGCATGCTGCCAACGGATGTCCTG

ATGAAAGCCGCCTACGAAAGTCCTG||| | || ||| ||| | ||||||

ATGCATGCTGGCCAACGGATGTCCTG

ATGAAAGCCGCCTACGAAAGTCCTG||| | || ||| | | | |

Gaps in Alignments

Consider these two obviously similar sequences

TTCCCAACTCTCCTCTTTCACCATGAAGCTCAAGGACAGATTCCACTCGCCCCAAAATCAAGCTCACCCCGTCCAAGAA | || | || |||||||||||||||||||| ||||||||| ||| ||| | ||| | | |TTCCCACCTCTCCTCTTTGCACCATGAAGCTCAAGGACAAATTCCACTCCCCCAAAATCAAGCGCACCCCGTCCCAGAA

In fact we realise that the most probable alignment (regarding biological origin) is with a small gap in each sequence

TTCCCAACTCTCCTCTTT=CACCATGAAGCTCAAGGACAGATTCCACTCGCCCCAAAATCAAGCTCACCCCGTCCAAGAA |||||| ||||||||||| |||||||||||||||||||| ||||||||| |||||||||||||| |||||||||| ||||TTCCCACCTCTCCTCTTTGCACCATGAAGCTCAAGGACAAATTCCACTC=CCCCAAAATCAAGCGCACCCCGTCCCAGAA

So in general we allow ourselves to insert gaps until we find the optimal alignment

But where should this process stop

The Downside of GapsTake two random sequences with no lsquorealrsquo similarity

GACACTAGGTCGATGCGTGGTGGCGAGA

ACGCATCCGGATGTGCACCGTGGAACTG

And allow lsquocost freersquo gaps

GAC--ACT----AGGTCGATGC---GTGG---TGGCGAGA || | | | | | ||| |||| || ACGCA-TCCGGA--T-G-TGCACCGTGGAACTG

Clearly although the alignment has no mismatches it is obviously not biologically meaningful

To prevent this we assign a cost to adding gaps which is offset against the benefit of finding matches ndash and this is the essence of lsquofinding gapped alignmentsrsquo

We want to find the lsquoalignmentrsquo between the two (or more) sequences which shows the greatest degree of similarity while introducing the fewest gaps hellip

BLAST

gtqueryAGACGAACCTAGCACAAGCGCGTCTGGAAAGACCCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGTTCAGGAGTATTTGGACTGCAATATTGGCCCTCGTTCAAGGGCGCCTACCATCACCCGACGGTCATGCCGGTCCCCAGCAGCTGCTAATAACTTCCTTCGCTACTCAAGTTACCACGCTAGCAAAACCCACGGCATACCGTTTACCCTTTAAAATCAGCTTCAACCAGCAACGAA

There are many programs used to find similarities between sequencesThey range from relatively slow programs which find the exact best matching alignment through ones which take progressively inexact shortcuts to speed things up Of this latter class the best known and easily most widely used is BLAST developed by Stephen Altschul and others and continuously refined over the last 10-15 years

The essential idea is to compare your query sequence against a collection or lsquodatabasersquo of target sequences looking for the one(s) that match the query sequence the best

gttarget1AAAACAGGAATATTTACCGGGACCGGGTAATGATGCATCTCGAGGTACACAATATACCTG GAGAACCGAATTATGAGTTGGCCACCTTACTTAACGAAACCAGCAGAGAAAATCCAACAT GGCAACACCCCTCTGACTACACTAGAAGGAACTACTATGTAAGAAAACAGCCTGTCCCTT GCAGTTTGAATGACTGGGTGATGCGAAATGGGGGTCCTGCCATAGAGCGCTTCCATGGTT TACCTTGCACATTTCAGAGAAGTCCTATGCCAGGAGTCCTTCCTACAGGGCCTTCCTGAA ACTATATATGTGCTTATTCTTGTTTGATTTGGCTTTGCAGgttarget2CTCTTAATTTATTTCTCTTCCTGCAGCTCCCTCGCTTTTTCCTTTCCCTGTTACATTCAT CTGACTTGAAGAGTTGCAAATTTTCAGTGTTTCTGTTTTTGTTGCTGATATGTTGTAAAC TTTTTAATAAAATCTATTTCTATAG gttarget3GCAGTTTGAATGACTGGGTGATGCGAAATGGGGGTCCTGCCATAGAGCGCTTCCATGGTT TACCTTGCACATTTCAGAGAAGTCCTATGCCAGGAGTCCTTCCTACAGGGCCTTCCTGAA ACTATATATGTGCTTATTCTTGTTTGATTTGGCTTTGCAGCTAGGGTTTTCACCTTTTCT GGAAAAAAAAATACTGGCTTCC gttarget4CTGCTATTAATGGGCAAAACAACTCAAATAAAGTCCCTCTGCCACCCTCAGACACTGCCC CTGGCCCCCAGCTGCCCGCTGATCCTTGTAGCCAGAGCAGTAAAGTTTTGAAAGTGGAGC CCAAGGAGAATAAAGTTATTAAAGAAACTGGCTTTGAACAAGGTGAAAAGTCTTGTGCAG CACCTCTAGATCATACTGTGAAGGAAAATCTTGGACAAACTTCTAAAGAACAGGTGGTAG

query

database

COMPARE

LIST MATCHES

Flavours of BLAST

ACGATAGATCCCATCCATAAAT ATGACGATAGATCCCATCAT CGATAGGACCACCACA GATAGACCAGGATACATAGGATAATTA AGCTCGCTTGGCTCGATGGCT

MKJLSPWERSYTRGHYTWER MGHTVNBZY MKLPWRHGDBKJGMNDFD MBKLRPIUHDFRTASGSLKWWRTVBN

MKJLSPWERSYTRGHYTWER MGHTVNBZY MKLPWRHGDBKJGMNDFD MBKLRPIUHDFRTASGSLKWWRTVBN

ATGACGATAGATCCCATCAT CGATAGGACCACCACA GATAGACCAGGATACATAGGATAATTA AGCTCGCTTGGCTCGATGGCT

query sequence other operation database sequences

ATGACGATAGATCCCATCAT CGATAGGACCACCACA GATAGACCAGGATACATAGGATAATTA AGCTCGCTTGGCTCGATGGCT

BLASTn

BLASTp

BLASTx

tBLASTn

tBLASTx

ACGATAGATCCCATCCATAAAT

ACGATAGATCCCATCCATAAAT

MQWCGYRWTYQGYRW

MQWCGYRWTYQGYRW

FAST

FAST

SLOW

SLOWER

HORRIBLY

SLOW

6 fra

me

trans

latio

n

How does it work The main task of any sequence comparison program is to test all possible mutual alignments of two sequence and see how good the match is

CTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGT

CCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGTTC | | | | | ||||||||||||||||||||||||| CTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGTCTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGT || | | | | | | | | |||||||||||||||||||||||| | | | | | |

CCGAGCTTCTCATTGCTCTTCCTAACAGTG=TGATAGGCTAACCGTAATGGCGTTC||||||||||||||||||||||||| ||||||||||||||||||||||||

query

1st database sequence

This would actually be a very slow search process if implemented like thishellip

BLAST achieves its speed through two strategies

- it takes a WORD based approach- it pre-INDEXES database sequences

BLAST WORDS and INDEXING1 GACAAATCCAAACCCCTGAAGTTCTCCACCAGCAAAGCCA

2 TAAGCAAATTTAATTTTGTTTACATTTTC

3 GTTAAGACCTTCCCTGACATTTGCAGCAGTTTCAAATGTA

AAAAAAAA 00001 AAAAAAAC 00002 AAAAAAAG 00003 ACAAATCC 07967 ACAAATCC 07968 ACAAATCC 07979 GACAAATC 33568 GACAAATG 33569 TCCAAACC 64321 TCCAAACC 64322

sequence position word

1 1 33658

1 2 07967

1 3 16210

3 15 33568

3 16 07967

Database of sequences

Numbered list of all possible lsquowordsrsquo

Build a position index of all words in the database

Analyse the Query Sequence gtquery AGACAAATCCAAACCCCTGAAGTTCTCCACCAGCAAAGCCA

AAAAAAAA 00001 AAAAAAAC 00002 AAAAAAAG 00003 ACAAATCC 07967 ACAAATCC 07968 ACAAATCC 07979 GACAAATC 33568 GACAAATG 33569 TCCAAACC 64321 TCCAAACC 64322

QUERY SEQUENCE

Numbered list of all possible lsquowordsrsquo

position word

1 14236

2 33658

3 07967

Analyse QUERY SEQUENCE

sequence position word

1 1 33658

1 2 07967

1 3 16210

3 15 33568

3 16 07967

Index of database

Expand from Word Based Matches We lsquoinstantlyrsquo know which sequences in the database have at least a word length match with our query sequence and at what relative position

Next the potential alignments are expanded adding up a score for (total matches ndash mismatches ndash gap penalties) to make the best possible alignment But this is usually for a tiny proportion of the sequences in the database ndash so overall it is much quicker

The highest scoring alignments are reported

But we can potentially miss alignments with no word-size bits in common consider BLASTn with a default word-size of 11

TCGGAAGTGGAAGCTGAACCTGATTGTAGAGTTGGAGGCCAGTGTTCTGGCTGAGC||||||||| ||||| |||||||||| |||||||||| |||| ||||| ||||||| TCGGAAGTGTAAGCTCAACCTGATTGCAGAGTTGGAGTCCAGAGTTCTAGCTGAGC

Care is sometimes neededhellip

BLAST ndashTypical OutputINPUT

gtpartial cDNA sequence Xenopus tropicalisCGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGTTCCCACCTCTCCTCTTTCACCATGAAGCTCAAGGACAAATTCCACTCCCCCAAAATCAAGCGCACCCCGTCCAAGAAGGGGAAGCCGGCCGACCTCACCGTCAAAACAGAAGAGAAACCCGTCAACAAAACCTTAAGCCGCTTGGAGGAACAGGAGAAAGAAGTCGTTAATGCCTTGCGTTACTTTAAGACAATTGTTGACAAGATGGCGGTGGACAAGATGGTGCTGGTGATGCTGCCAGGGTCGGCGA

OUTPUTQuery= (311 letters) Database NCBI Protein Reference Sequences 954378 sequences 347895532 total letters

gtgi|41055060|ref|NP_9574201| similar to guanine nucleotide-releasing factor 2 (specific for crk proto-oncogene) [Danio rerio]

Length=691

Score = 133 bits (335)Expect = 6e-31 Identities = 7698 (77) Positives = 8298 (83) Gaps = 498 (4) Frame = +2

Query 26 MSGKIE-KADSQRSHLSSFTMKLKDKFHSPKIKRTPSKKGKPA--DLTVKTEEKPVNKTL 196 MSGKIE K +SQ+SHLSSFTMKL KFHSPKIKRTPSKKGK + VKT EKPVNK + Sbjct 1 MSGKIESKHESQKSHLSSFTMKLM-KFHSPKIKRTPSKKGKQLQPEPAVKTPEKPVNKKV 59

Query 197 SRLEEQEKEVVNALRYFKTIVDKMAVDKMVLVMLPGSA 310 SRLEEQEK+VV+ALRYFKTIVDKM VD VL MLPGSA Sbjct 60 SRLEEQEKDVVSALRYFKTIVDKMNVDTKVLQMLPGSA 97

When is a match significant

RFKISDCQHPCTYSHNQYMTNHMRECPYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV

NFSWKKTSEKETNCQFDYPNDYNEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFNMCWLEVNSS

RF---KISDCQHPCTYSH-NQYMTNHMREC----PYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV F K S+ + C + + N Y N +C P+ + +W +P + D I N M ++ NFSWKKTSEKETNCQFDYPNDY--NEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFN------MCWLEVNSS

Here is a lsquotypicalrsquo weak alignment from BLASTp

In fact the sequences were randomly generated so there is no biologically significant alignmenthellip

E-values

The number of matches like the discovered match that I would expect to find by chance

An E-value of 00 implies that I would expect no matches like this to arise by chance thereforehellip

An E-value of 1 implies I would expect 1 match like this to arise by chance so if I have a match with such an E-valuehellip

Also ldquoexpect valueldquo or ldquoexpectationrdquo

E-values From First Principles

Some database statistics (23rd July 2005)

Database NCBI RefSeq mRNA 272619 sequences 503566580 total letters (~50 x 108)

Database NCBI nr 3329110 sequences 14601814750 total letters (~14 x 1010)

Notation

12e-35 = 12 x 10-35

48 x 106 = 4800000

We will consider first searching a nucleotide sequence (lsquoACGTAGACGTrsquo) against a nucleotide database eg the RefSeq mRNA above

Then we will consider the more complex case of amino acid sequence (protein) searches Which is of course what we mostly do

Calculating an E-valueThe RefSeq mRNA database has ~ 50 x 108 letters There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA

Query = lsquoArsquo

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACrsquo

AC AC AC AC

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACGrsquo

ACG

Expected number of matches = (50 x 108) 4 = ~12 x 108

Expected number of matches = (50 x 108) (4x 4) = ~31 x 107

Expected number of matches = (50 x 108) (4 x 4 x 4) = ~81 x 106

Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer

Expected number of matches = (50 x 108) (4 x 4 x 4 x 4 hellip 60 times ) = (50 x 108) 1036 = 50 x 10-28

E-value = 50 x 10-28

E-values In PracticeSo if I take a 60 nt sequencegtsequenceACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA and actually BLAST it against the RefSeq mRNA database I get

BLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds

Length=6060 Score = 119 bits (60) Expect = 2e-26 Identities = 6060 (100) Gaps = 060 (0) Strand=PlusPlus

Query 1 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 60 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 2977 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 3036

What do I get if I BLAST it against the larger nr databaseBLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds

Length=6060 Score = 119 bits (60) Expect = 6e-25 Identities = 6060 (100) Gaps = 060 (0) Strand=PlusPlus

Query 1 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 60 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 2977 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 3036

theoretical value was 50e-28 -

E-value Exercise

Given a transcription factor binding site

ACC[TG]TA

How many would you expect to find by chance in a 10k promoter sequence

How would this differ if there was an optional additional base between the 4th and 5th positionsIeACC[TG]TAOR ACC[TG]TA

E-value Exercise AnswerACC[TG]TA

Expect lsquoArsquo every 4 ntExpect lsquoACCrsquo every 4x4x4 = 64 nt

Expect lsquoT or Grsquo every 2nd ntExpect lsquoACC[TG]rsquo every 64x2 nt = 128 nt

Expect lsquoTArsquo every 4x4 = 16 ntExpect lsquoACC[TG]TArsquo every 128x16 nt = 2048 nt (4x4x4x2x4x4)We would expect ~5 of these promoter sites every 10k by chance

If also ACC[TG]TAA allowed

The two motifs independently have the same E-valueTo allow either means we expect twice as many

E-values Effect of Database SizeThe nr mRNA database has ~ 14 x 1010 letters (was RefSeq and 50 x108)There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA

Query = lsquoArsquo

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACrsquo

AC AC AC AC

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACGrsquo

ACG

Expected number of matches = (14 x 1010) 4 = ~12 x 108

Expected number of matches = (14 x 1010) (4x 4) = ~31 x 107

Expected number of matches = (14 x 1010) (4 x 4 x 4) = ~81 x 106

Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer

Expected number of matches = (14 x 1010) (4 x 4 x 4 x 4 hellip 60 times ) = (14 x 1010) 1036 = 14 x 10-26

E-value = 14 x 10-26

(was E-value = 50 x 10-28)

E-values Effect of Database Size

The E-value is simply dependent on database size

RefSeq

nr

14 x 1010 letters

50 x108 letters

30 x bigger

BLAST the same sequenceagainst each E-value = 14e-26

E-value = 50e-28

The database was ~30 times bigger and so the E-value was ~30 times bigger

Why were the values differentOur calculated E-value for searching against the RefSeq mRNA database was 50 x 10-28But our actual BLAST search at NCBI gave a value of

20 x 10-26 - about 40x larger - why is this

Gapped alignments

If we were expecting N matches for a query sequence lsquoACGTACGTACGTrsquo imagine what would happen to N if we allowed gaps in our matches

ACGTACGTACGT

This would now give us additional possible alignments that would meet our lsquomatchrsquo criteria

ACGTACGTACGT ACGTACAGTACGT ACGTACCGTACGT etc|||||||||||| |||||| |||||| |||||| ||||||ACGTACGTACGT ACGTAC-GTACGT ACGTAC-GTACGT

We will expect many more matches in a given database if we allow our alignments to have gaps The E-value will be larger

E-values Effect of Query Length

Biologically itrsquos the same match Does it mean we are any less sure that this match didnrsquot occur by chance The E-value is simply dependent on match length

database

BLAST 500 nt sequence against a database

BLASTn Get a full length match with sequence XYZ at an E-value = 50e-160

gtsequence ACTAGTCTAGCTAGACATCGATCGATGATGCTACACAGATAGACGATAGATAGTAAGTCGATCGATCGCGCATCGATCGTCTAGATCGATCGCTCGCTGTGTAGATAGATCGGCGATAGA

database

BLAST half of the same sequence against the same database

BLASTn

gtsequence ACTAGTCTAGCTAGACATCGATCGATGATGCTACACAGATAGACGATAGATAGTAAGTCG

Get a match with sequence XYZ again but at an E-value = 50e-80

Why not just use identityAt some levels this a good question

But consider two very different searches both of which give a 75 identity match

Query1 was 60 nt longCGGAGCTCAGGGCTTAACGACTGATATCTCCGCGCATGTCGAGAAACGATACAGCCAGCG||||||||||| || | || | || || |||| | | | |||||| | ||||||||||CGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGWhich would have an E-value ~ 50 x 10-19

And Query2 only 16 nt longACGTACGTACGTACGT||| || | |||| ||ACGCACCTTCGTAGGTWhich would have an E-value ~ 30

And intuitively we feel we would expect to see that sort of number of matches in the database just by chancehellip

So whatrsquos the real problemBasically you are usually trying to answer the question

Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism

Are there any useful guidelines though at least for biological meaningfulness

Basically you are usually trying to answer the question

Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism

BLAST

The difficulty is because

ORTHOLOGY

BLAST Similarity + Probability

biological knowledge

nature of query sequence

phylogenetic relationship

match length PI size of databasehellip

Rules of ThumbHow good does an E-value have to be before we might even think we have an ortholog

largerworse smallerbetter

E-values 10-5 10-10 10-40 10-100 00

fantasy borderline encouraging

pretty good canrsquot get

better

But note that in some gene families with closely related members you can get an E-value of 00 for several different matches and then identity may be more sensitive Also bear in mind in cases like this that ideas of lsquofunctionalrsquo orthology may break down with more than one locus producing identical proteins which share the same functionhellip

Protein BLASTItrsquos (nearly) always better to make comparisons at the amino acid level between protein sequences than the DNA level because the amino acid sequence is more conserved than the underlying DNA sequenceDoes this cause us to treat expected values any differently

If we follow the argument as before then for an exact match of a 20 amino acid sequence in the RefSeq protein database each additional amino acid will reduce the E-value by 120th (there are 20 different amino acids) And as there are 347895532 letters in that databaseE-value = ~35 x 108 (20 x 20 x 20 hellip20 times) = ~35 x 10-18

But this is what we get if we run the blast at NCBI

Score = 431 bits (100) Expect = 8e-04 Identities = 2020 (100) Positives = 2020 (100) Gaps = 020 (0) Frame = +3

Query 3 SSSSFRAYRAALSEVEPPCI 62 SSSSFRAYRAALSEVEPPCISbjct 972 SSSSFRAYRAALSEVEPPCI 991

Really too big a discrepancy to easily explain with hand wavinghellip

Amino Acid Substitutions

A SC F LWYG I LMVL IMFVM ILVP V ILMW FY

N DHSQ REHKS ANTT SY HFW

H NQYK RQER QK

D NEE DQK

In fact we need to take into account both amino acid substitutability as well as as before allowing gapped alignments On average any residue can be substituted for by about 2 others so each position has about 17th chance of lsquomatchingrsquo rather than 120th

So now we getE-value = ~35 x 108 (7 x 7 x 7 hellip20 times) = ~44 x 10-9which is much closer to the actual BLAST value

These substitutabilities are dealt with by the BLOSUM and PAM matrices

ExercisesGo to the file random-DNA-sequenceshtml select one of the 20 randomly generated nucleotide sequences and do a BLASTx (translated DNA-gtprotein) at NCBI against the nr protein database

Did you find any lsquosignificantrsquo hits

Repeat with a second sequence

What conclusions might you draw from this exercise

Try the same sequence(s) against the nr nucleotide database

Is there any general difference

Part 4 Tweaking BLASTAlthough you normally see BLAST as a web page with boxes to place data in and tick boxes etc it is actually a command line program that can be run just by typing the appropriate command and options eg

promptgt blastall ndashp ltblast typegt ndashi ltinput sequencegt ndashd ltdatabasegt

This is the simplest form where the basic program lsquoblastallrsquo takes a number of different options or parameters indicated by the ndashx and followed by its value -p ltwhich blast flavour to rungt-i ltfile with query sequence ingt-d ltpre-indexed database namegt

Not All Parameters are hellipThere are many other parameters and if not listed explicitly they will use a default value most appropriate to the blast flavour requested Eg for ndashW ltword sizegt blastn uses ndashW 11 where blastx uses ndashW 3

There are also some options that appear on the web pages that are not really parameters but manage the job in a similar way One of the most useful of these is on the NCBI blast pages where you can use Entrez queries or pick from an organism list to modify your search

The Many Parameters of BLASTThere are almost literally hundreds of parameters but most are way too obscure even for die-hard techies like me Very few of them are regularly useful in any but their default value but just occasionally they are very necessary

Here are some of the ones that I have used

-e max expected value -m output format (graphical or tabularspreadsheet)-F filter query sequence for low complexity (default TRUE)-U use only upper case regions of query (default FALSE)-G gap opening cost-E gap extension cost-q nucleotide mismatch penalty (BLASTx uses matrices)-r nucleotide match reward-b number of matching sequences to report-g allow gaps (default TRUE)-W word size-z effective database size (removes effect of actual database size)-S query strands to search (default both directions)-l restrict database sequences to given list of lsquogilsquo numbers

BLAST Parameters Exercises1 BLASTn vs BLASTx

Open the file example-sequenceshtml copy the sequence gtblastn-vs-blastxThis is a Xenopus tropicalis cDNA sequence

Go to the NCBI BLAST Home PageNucleotide-nucleotide BLAST (blastn) section Paste your sequence into the box

Run BLASTn against the nr nucleotide database using all default optionsThen hit [format] to wait for the results in a new page(hint if you paste the sequence definition line lsquogtnamersquo into the box as well your results will be labelled accordingly which can be useful)

Now repeat but go to the TRANSLATED BLAST section and BLAST against the nr protein database using BLASTx

How might the different results help us view the presence of this gene in other vertebrates

Results for Exercise 1 BLASTn

BLASTx

BLAST Parameters Exercises2 Low complexity filtering

Open the file example-sequenceshtml copy the sequence gtlow-complexity-filtering-A This sequence contains a long AT tandem repeat

Go to the NCBI BLAST Home PageTRANSLATED BLAST sectionBLASTx Paste your sequence into the box

Carefully UNTICK the ldquoChoose filter [ ] Low complexityrdquo BOX in the second section And then run BLASTx against the nr database

What do you feel about these alignmentsRe-run but leave the low-complexity filter ON this timeDoes this change our view of the protein matches

Now continue with gtlow-complexity-filtering-B and ndashCC is an especially interesting case ndash what can we deduce about the cDNA sequence Annotators beware

Results for Exercise 2A (OFF) BLASTn ndash low complexity filtering OFF

Results for Exercise 2A (ON) BLASTn ndash low complexity filtering ON

Results for Exercise 2B

ON OFF

Results for Exercise 2C

There is a sequence error an extra G at position 117 in the sequence cDNA (117)AGAAAAGAAGAAACATGGCAATGGATCAGAA|||||||||||||||| ||||||||||||||AGAAAAGAAGAAACAT-GCAATGGATCAGAA

Genomic sequence

ON

OFF

BLAST Parameters Exercises3 Limit by Entrez query

Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching sequences in fruit fly enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT

Open the file example-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home Page TRANSLATED BLAST sectionBLASTx and paste the sequence

Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip

BLAST Parameters Exercises4 BLASTn vs tBLASTx and nucleotide mismatch penalties

Open the file example-sequenceshtml

Also open the NCBI BLAST Home PageSPECIAL ndash Align two sequences section

There are several Xenopus tropicalis cyclins in the examples fileCopy the sequence gtcyclin-A1-Xt to the Sequence 1 BLAST windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 BLAST window(i) Run the default comparison should be BLASTn Note the alignmentNow run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these

(ii) Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1 Then try reducing the gap open and gap extension penaltieshellipWhat do we learn from this

(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip

Results for Exercise 4 (i)

BLASTn tBLASTx

Results for Exercise 4 (ii)

Mismatch penalty = -2 (default) Mismatch penalty = -1

BLAST Parameters Exercises5 E-Value maximum for reporting

Open the file example-sequenceshtml

Copy the sequence gtsumo-binding-motif and go to the NCBI BLAST Home PageGo to the PROTEIN BLAST section BLASTp and paste the sequence

Run the search with the default values

Now re-run the search setting the maximum E-value in the box

Expect 100

What difference does this make

BLAST Parameters Exercises6 Word Size

Open the file example-sequenceshtml

Copy the sequence gtmorpholino and go to the NCBI BLAST Home PageGo to the NUCLEOTIDE BLAST section BLASTn and paste the sequence

Check OFF the low complexity filter and then run the search

Now re-run the search setting the following parameters

Low complexity OFFExpect 100Word Size 7Other advanced -q-1 (mismatch penalty -1 instead of default -3)

What difference does this make

END

  • Bioinformatics Workshop 1 Sequences and Similarity Searches
  • The Basic Questions
  • Part 1 Structural Genomics
  • Chromosomes and Genes
  • Gene to Protein
  • Sequence Signals
  • Genomic Signals
  • Derivative Sequences
  • Gene Models
  • Sequences and Genes (Accession Numbers and Names)
  • Gene Symbols Names Etc
  • A Gene-Centric View
  • Sequences and Accession Numbers
  • mRNA Splicing Signals
  • Gene Predictions
  • Supporting Evidence
  • TheoreticalPredicted Sequences
  • Sequences for a model organism
  • So Whatrsquos in the Databases Now
  • Part 2 Comparative Genomics
  • Speciation
  • Residual Similarity
  • Computers Can Detect Homology
  • Orthologs
  • Paralogs
  • lsquoOtherrsquo-logs
  • The Essential Paradigm
  • Function Conserved Longer than Detectable Similarity
  • Redundancy in the Genetic Code
  • Protein Similarity Persists Longer
  • Always Compare Protein Sequences
  • Exercise 1 nucleotide vs amino acid search
  • Answers Exercise 1
  • The Essential Task
  • Functional Orthologs
  • Finding Orthologs
  • Using Synteny is Better
  • Metazome
  • Metazome Exercise
  • Part 3 Finding Sequence Similarities
  • Gaps in Alignments
  • The Downside of Gaps
  • BLAST
  • Flavours of BLAST
  • How does it work
  • BLAST WORDS and INDEXING
  • Analyse the Query Sequence
  • Expand from Word Based Matches
  • BLAST ndashTypical Output
  • When is a match significant
  • E-values
  • E-values From First Principles
  • Calculating an E-value
  • E-values In Practice
  • E-value Exercise
  • E-value Exercise Answer
  • E-values Effect of Database Size
  • Slide 58
  • Why were the values different
  • E-values Effect of Query Length
  • Why not just use identity
  • So whatrsquos the real problem
  • Rules of Thumb
  • Protein BLAST
  • Amino Acid Substitutions
  • Exercises
  • Part 4 Tweaking BLAST
  • Not All Parameters are hellip
  • The Many Parameters of BLAST
  • Slide 70
  • BLAST Parameters Exercises
  • Results for Exercise 1
  • Slide 73
  • Results for Exercise 2A (OFF)
  • Results for Exercise 2A (ON)
  • Results for Exercise 2B
  • Results for Exercise 2C
  • Slide 78
  • Slide 79
  • Results for Exercise 4 (i)
  • Results for Exercise 4 (ii)
  • Slide 82
  • Slide 83
  • END
Page 15: Bioinformatics Workshop 1 Sequences and Similarity Searches

Gene PredictionsGiven- coding sequence must run from ATG ndash STOP codon in-frame- introns GT AG can be spliced out

Also take a statistical approach- coding and non-coding sequence are slightly different in composition- some lsquopossiblersquo splice sites are more likely than others

CGTCGTATGGCTTCGATGTAGTACATCGGATCGGTATGGAATCATTTCAGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA

CGTCGTATGGCTTCGATGTAGTACATCGGATCGGTATGGAATCATTTCAGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA CGTCGTATGGCTTCGATGTAGTACATCGGATCGGTATGGAATCATTTCAGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA

CGTCGTATGGCTTCGATGTAGTACATCGGATCGGTATGGAATCATTTCAGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA

CGTCGTATGGCTTCGATTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA

CGTCGTATGGCTTCGATGTAGTACATCGGATCGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA CGTCGTATGGCTTCGATGTAGTACATCGGATCGGTATGGAATCATTTCAGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA

CGTCGTATGGCTTCGATTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA

CGTCGTATGGCTTCGATGTAGTACATCGGATCGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA

scan genomic sequence hellip

CGTCGTATGGCTTCGATGTAGTACATCGGATCGGTATGGAATCATTTCAGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA

most likely gene model

Supporting Evidence

EST evidence

genome

gene model

We note that in the absence of EST evidence it is only really possible to predict coding sequence with any confidence (and even thenhellip)

So predicted genes based on computational gene models alone will usually lack UTR regions which has some important consequences

exons 1 2 3 4

TheoreticalPredicted Sequences

genome

predicted gene modelexons 1 2 3 4

Wersquove now reversed the process of working out exon structure from aligning cDNA sequences against the genome sequence but we shouldnrsquot lose sight of the fact that we donrsquot really know if these predicted proteins exists ndash especially where supporting EST evidence is weak or non-existent

predicted transcript

predicted protein

Sequences for a model organism

ESTs ndash millions pound10 eachCheap to sequence ndash so we get millions per organismBut lots of errorsAnd incomplete gene sequencesCan give us relative expression levels

cDNAs ndash tens of thousands pound1000 eachExpensive ndash but only need to do one (or a small number) per geneFew errors with multipass sequencingGives us protein sequences

Genomes ndash one pound30000000Extremely expensiveBut the only way to get the whole pictureGives us gene regulation

So Whatrsquos in the Databases Now

15000000ESTs

3300000cDNAs

NCBI July 2005

2700000proteins

950000proteins

nrRefSeq

DNA

Proteins

Part 2 Comparative Genomics

ATGAAGGCTGCCTACGACTGCCGTGATGCAGGCTGCCTACGACTGCCGTGATGCAGGCTGCCAACGACTGCCGTGATGCATGCTGCCAACGACTGCCGTGATGCATGCTGCCAACGACTGCCCTGATGCATGCTGCCAACGGCTGCCCTGATGCATGCTGCCAACGGATGCCCTGATGCATGCCGCCAACGGATGCCCTGATGCATGCCGCCAACGGATGTCCTG

Imagine one mutation gets fixed every 100000 years in this gene sequencehellip

Gene sequence

Evolution by sequence mutation

Speciation

ATGAAGGCTGCCTACGACTGCCGTG

ATGCAGGCTGCCTACGACTGCCGTGATGCAGGCTGCCAACGACTGCCGTGATGCATGCTGCCAACGACTGCCGTGATGCATGCTGCCAACGACTGCCCTGATGCATGCTGCCAACGGCTGCCCTGATGCATGCTGCCAACGGATGCCCTG

Gene AATGAAGGCTGCCTACGACTGCCGTG

ATGAAGGCCGCCTACGACTGCCGTGATGAAGGCCGCCAACGACTGTCGTGATGAAAGCCGCCAACGACTGTCGTGATGAAAGCCGCCAACGACAGTCGTGATGAAAGCCGCCTACGACAGTCGTGATGAAAGCCGCCTACGACAGTCCTG

ATGCATGCTGCCAACGGATGCCCTG

ATGAAAGCCGCCTACGACAGTCCTG||| | || ||| ||| | ||||

If the genetic difference means they can no longer interbreed with fertile offspring ndash then we have a new specieshellip

Residual Similarity

ATGCATGCTGCCAACGGATGCCCTG

ATGAAAGCCGCCTACGACAGTCCTG||| | || ||| ||| | ||||

ATGCATGCTGCCAACGGATGCCCTG

ATGGAAGGCGCTTAGGATAGTCCAG||| | | || | | | || |

After longer periods of evolution homology may no longer be detectable in the DNA sequencehellip

We can still easily detect residual similarity between these sequences this is what we call homology ndash detectable similarity because of common evolutionary origin

Computers Can Detect Homology

In fact computers are very good at this task ndash the two primary challenges are

(a) performing the search fast enough to look through millions of sequence in a timescale compatible with a lab scientistrsquos attention span

(b) at low levels of similarity being able to distinguish between biologically related sequences and chance matcheshellip

ATGCATGCTGCCAACGGATGCCCTG

ATGAAAGCCGCCTACGACAGTCCTG||| | || ||| ||| | ||||

GCTGACTCGTAGCGCTTAGCTAGCT

CCAACATCTAGCCAGATTAGTTAGT | || | | | |

Orthologs

A A

A Gene duplication though speciation The two copies of Gene

A will now evolve independently but will continue to have the ~same function

They are ORTHOLOGS

Paralogs

A

Gene duplication though internal genome duplication

The two copies of Gene A will now evolve independently but will probably not continue to have exactly the same function

They are PARALOGS

A

A Arsquo

A

lsquoOtherrsquo-logsWhat about gene duplication after speciation

How can we describe the relationship(s) between the various copies of gene A in the two frogs

Bear in mind that understanding gene function is more important than semanticshellip

The two copies of A in the orange frog are sometimes called IN-PARALOGS

If they were also present in the green frog (and therefore were in the ancestor species) they would be OUT-PARALOGS

A

A

A

Arsquo A

The Essential Paradigm

1 any group of modern species can be traced back to some extinct common ancestor

A

A

2 in all likelihood they share orthologous genes which have the same function in the modern animal as in the extinct ancestor

3 If we can experimentally determine the function of a gene in one of these organisms then there is a good chance the ORTHOLOGOUS gene in another organism will have the same function

A A

cyclin b1

cyclin b1

Function Conserved Longer than Detectable Similarity

start from first self-replicating sequence

same function detectable similarity

living organisms

whole genome duplication local duplication

Redundancy in the Genetic Code

GCA A alanine GCC A GCG A GCT A

TGC C cystine TGT C

GAC D aspartate GAT D

GGA G glycine GGC G GGG G GGT G

lsquoSynonymousrsquo or lsquosilentrsquo mutations in the third position of the codon triplets have no effect on the amino acid coded for ndash so there is no evolutionary pressure against thishellip

Protein Similarity Persists Longer

CTATCACGAGAACCTGTGCTATCCCGAGAACCTGTGCTATCCCGAGAACCAGTGCTATCCCGTGAACCAGTGCTATCCCGTGAGCCAGTGCTATCCCGTGAGCCAGTTCTGTCCCGTGAGCCAGTT

CTATCACGAGAACCTGTG

CTGTCCCGTGAGCCAGTT|| || || || || ||

LSREPV

LSREPV||||||

CTATCACGAGAACCTGTG

TTGTCCCGGTCGCCAGTT | || | || ||

LSREPV

LSRFPV||| ||

67 100

44 80

Always Compare Protein Sequences

ATGAATGCAGCCTATGATTGCCGAGCCAGAATGCTAAGG MNAAYDCRARMLR ||||| || || || || || || || ||||| || || | ||||||||+||ATGAAGGCCGCATACGACTGTCGTGCTAGAATCCTGAGA MKAAYDCRARILR

DNA comparison amino acid comparison

The DNA sequence can change while the amino acid sequence stays the same so always look for similarities by comparing amino acid sequences

Exercise 1nucleotide vs amino acid search

Go to the file example-sequenceshtml and locate the section for this exercise There should be two sequences lsquosurfeit1rsquo for frog and fly

Go to NCBI Blast home page then lsquoAlign two sequencesrsquo (bottom left lsquospecialrsquo panel) paste one sequence into each window and hit lsquoAlignrsquo ndash this will do a direct DNADNA comparison

Now find the open reading frames of the two genes and translate them into amino acid protein sequences then repeat the two sequences comparison

Go to NCBI ORF Finder ndash paste sequence ndash hit OrfFind ndash identify longest ORF ndash click on it ndash next screen hit Accept ndash change View to Fasta protein ndash hit View ndash copy sequence to Blast2Seqs Do the same with the other sequence

Before you hit lsquoAlignrsquo change the lsquoProgramrsquo (top left) to blastphellip

Answers Exercise 1

The Essential Taskexperiment data mining

gene sequence what is its function

database of proteins in other species

Cyclin-AFoxA1

cdc25

alpha-tubulin

Predicted protein

Gravin-like

Sprouty-2

calmodulin

KIAA10786568

frizzled

Wint8

Troponin T3

Gravin-like

we can only do this because of implied function based on orthology

Functional Orthologs

function known annotation lsquoGravinrsquo available

Human geneXenopus genefunction unknown

sequence similarityorthologs

same function But we know that function is largely determined by shape

similar shape

Which in general we cannot determine ndash but it is probably SHAPE not SEQUENCE that is conserved

We make an assumption that the same gene function is likely to be present in the two organisms and the ones that have this function are likely to be the most similar in sequence

Finding OrthologsSo how do we find orthologs and can we know when we have

The simplest is Reciprocal Best BLAST but it implicitly relies on having all the protein sequences of you own organism and the one you wish to find an ortholog in

frog proteindatabase of human proteins

best match human protein

database of frog proteins

x

Using Synteny is Better

We know that large regions of (say) vertebrate genomes have preserved their overall organisation from one organism to another

And we find the same genes (ie orthologs) in more or less the same order in the syntenic sectionsThese of course represent chromosomal re-arrangements since these organisms diverged

Human chromosome 5

Mouse chromosome 10

Mouse chromosome 2

MetazomeFortunately someone has done all the hard work for ushellip Dan Rokhsar httpwwwmetazomenet

Metazome Exercise

Go back to Entrez Gene and look for your favourite gene again

Pick probable ortholog vertebrate genes from common organisms (human mouse rat chicken frog fish) and paste their protein sequences into a temporary space

Go to Metazome (httpwwwmetazomenet) find the blast window open two versions of it and blast your sequences against the Tetrapod or Jawed vertebrate node

See if you get the same cluster ID as best top hit and have a look at the Metazome alignment(s)hellip

Part 3 Finding Sequence Similarities

We want computer programs which will compare sequences at all possible different alignments looking for a degree of similarity greater than we would expect to find by chance

But first we have to consider the implication of gapshellip

Insertions and deletions are other possible forms of mutations and they can really mess up our simple alignments

ATGCATGCTGCCAACGGATGTCCTG

ATGAAAGCCGCCTACGAAAGTCCTG||| | || ||| ||| | ||||||

ATGCATGCTGGCCAACGGATGTCCTG

ATGAAAGCCGCCTACGAAAGTCCTG||| | || ||| | | | |

Gaps in Alignments

Consider these two obviously similar sequences

TTCCCAACTCTCCTCTTTCACCATGAAGCTCAAGGACAGATTCCACTCGCCCCAAAATCAAGCTCACCCCGTCCAAGAA | || | || |||||||||||||||||||| ||||||||| ||| ||| | ||| | | |TTCCCACCTCTCCTCTTTGCACCATGAAGCTCAAGGACAAATTCCACTCCCCCAAAATCAAGCGCACCCCGTCCCAGAA

In fact we realise that the most probable alignment (regarding biological origin) is with a small gap in each sequence

TTCCCAACTCTCCTCTTT=CACCATGAAGCTCAAGGACAGATTCCACTCGCCCCAAAATCAAGCTCACCCCGTCCAAGAA |||||| ||||||||||| |||||||||||||||||||| ||||||||| |||||||||||||| |||||||||| ||||TTCCCACCTCTCCTCTTTGCACCATGAAGCTCAAGGACAAATTCCACTC=CCCCAAAATCAAGCGCACCCCGTCCCAGAA

So in general we allow ourselves to insert gaps until we find the optimal alignment

But where should this process stop

The Downside of GapsTake two random sequences with no lsquorealrsquo similarity

GACACTAGGTCGATGCGTGGTGGCGAGA

ACGCATCCGGATGTGCACCGTGGAACTG

And allow lsquocost freersquo gaps

GAC--ACT----AGGTCGATGC---GTGG---TGGCGAGA || | | | | | ||| |||| || ACGCA-TCCGGA--T-G-TGCACCGTGGAACTG

Clearly although the alignment has no mismatches it is obviously not biologically meaningful

To prevent this we assign a cost to adding gaps which is offset against the benefit of finding matches ndash and this is the essence of lsquofinding gapped alignmentsrsquo

We want to find the lsquoalignmentrsquo between the two (or more) sequences which shows the greatest degree of similarity while introducing the fewest gaps hellip

BLAST

gtqueryAGACGAACCTAGCACAAGCGCGTCTGGAAAGACCCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGTTCAGGAGTATTTGGACTGCAATATTGGCCCTCGTTCAAGGGCGCCTACCATCACCCGACGGTCATGCCGGTCCCCAGCAGCTGCTAATAACTTCCTTCGCTACTCAAGTTACCACGCTAGCAAAACCCACGGCATACCGTTTACCCTTTAAAATCAGCTTCAACCAGCAACGAA

There are many programs used to find similarities between sequencesThey range from relatively slow programs which find the exact best matching alignment through ones which take progressively inexact shortcuts to speed things up Of this latter class the best known and easily most widely used is BLAST developed by Stephen Altschul and others and continuously refined over the last 10-15 years

The essential idea is to compare your query sequence against a collection or lsquodatabasersquo of target sequences looking for the one(s) that match the query sequence the best

gttarget1AAAACAGGAATATTTACCGGGACCGGGTAATGATGCATCTCGAGGTACACAATATACCTG GAGAACCGAATTATGAGTTGGCCACCTTACTTAACGAAACCAGCAGAGAAAATCCAACAT GGCAACACCCCTCTGACTACACTAGAAGGAACTACTATGTAAGAAAACAGCCTGTCCCTT GCAGTTTGAATGACTGGGTGATGCGAAATGGGGGTCCTGCCATAGAGCGCTTCCATGGTT TACCTTGCACATTTCAGAGAAGTCCTATGCCAGGAGTCCTTCCTACAGGGCCTTCCTGAA ACTATATATGTGCTTATTCTTGTTTGATTTGGCTTTGCAGgttarget2CTCTTAATTTATTTCTCTTCCTGCAGCTCCCTCGCTTTTTCCTTTCCCTGTTACATTCAT CTGACTTGAAGAGTTGCAAATTTTCAGTGTTTCTGTTTTTGTTGCTGATATGTTGTAAAC TTTTTAATAAAATCTATTTCTATAG gttarget3GCAGTTTGAATGACTGGGTGATGCGAAATGGGGGTCCTGCCATAGAGCGCTTCCATGGTT TACCTTGCACATTTCAGAGAAGTCCTATGCCAGGAGTCCTTCCTACAGGGCCTTCCTGAA ACTATATATGTGCTTATTCTTGTTTGATTTGGCTTTGCAGCTAGGGTTTTCACCTTTTCT GGAAAAAAAAATACTGGCTTCC gttarget4CTGCTATTAATGGGCAAAACAACTCAAATAAAGTCCCTCTGCCACCCTCAGACACTGCCC CTGGCCCCCAGCTGCCCGCTGATCCTTGTAGCCAGAGCAGTAAAGTTTTGAAAGTGGAGC CCAAGGAGAATAAAGTTATTAAAGAAACTGGCTTTGAACAAGGTGAAAAGTCTTGTGCAG CACCTCTAGATCATACTGTGAAGGAAAATCTTGGACAAACTTCTAAAGAACAGGTGGTAG

query

database

COMPARE

LIST MATCHES

Flavours of BLAST

ACGATAGATCCCATCCATAAAT ATGACGATAGATCCCATCAT CGATAGGACCACCACA GATAGACCAGGATACATAGGATAATTA AGCTCGCTTGGCTCGATGGCT

MKJLSPWERSYTRGHYTWER MGHTVNBZY MKLPWRHGDBKJGMNDFD MBKLRPIUHDFRTASGSLKWWRTVBN

MKJLSPWERSYTRGHYTWER MGHTVNBZY MKLPWRHGDBKJGMNDFD MBKLRPIUHDFRTASGSLKWWRTVBN

ATGACGATAGATCCCATCAT CGATAGGACCACCACA GATAGACCAGGATACATAGGATAATTA AGCTCGCTTGGCTCGATGGCT

query sequence other operation database sequences

ATGACGATAGATCCCATCAT CGATAGGACCACCACA GATAGACCAGGATACATAGGATAATTA AGCTCGCTTGGCTCGATGGCT

BLASTn

BLASTp

BLASTx

tBLASTn

tBLASTx

ACGATAGATCCCATCCATAAAT

ACGATAGATCCCATCCATAAAT

MQWCGYRWTYQGYRW

MQWCGYRWTYQGYRW

FAST

FAST

SLOW

SLOWER

HORRIBLY

SLOW

6 fra

me

trans

latio

n

How does it work The main task of any sequence comparison program is to test all possible mutual alignments of two sequence and see how good the match is

CTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGT

CCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGTTC | | | | | ||||||||||||||||||||||||| CTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGTCTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGT || | | | | | | | | |||||||||||||||||||||||| | | | | | |

CCGAGCTTCTCATTGCTCTTCCTAACAGTG=TGATAGGCTAACCGTAATGGCGTTC||||||||||||||||||||||||| ||||||||||||||||||||||||

query

1st database sequence

This would actually be a very slow search process if implemented like thishellip

BLAST achieves its speed through two strategies

- it takes a WORD based approach- it pre-INDEXES database sequences

BLAST WORDS and INDEXING1 GACAAATCCAAACCCCTGAAGTTCTCCACCAGCAAAGCCA

2 TAAGCAAATTTAATTTTGTTTACATTTTC

3 GTTAAGACCTTCCCTGACATTTGCAGCAGTTTCAAATGTA

AAAAAAAA 00001 AAAAAAAC 00002 AAAAAAAG 00003 ACAAATCC 07967 ACAAATCC 07968 ACAAATCC 07979 GACAAATC 33568 GACAAATG 33569 TCCAAACC 64321 TCCAAACC 64322

sequence position word

1 1 33658

1 2 07967

1 3 16210

3 15 33568

3 16 07967

Database of sequences

Numbered list of all possible lsquowordsrsquo

Build a position index of all words in the database

Analyse the Query Sequence gtquery AGACAAATCCAAACCCCTGAAGTTCTCCACCAGCAAAGCCA

AAAAAAAA 00001 AAAAAAAC 00002 AAAAAAAG 00003 ACAAATCC 07967 ACAAATCC 07968 ACAAATCC 07979 GACAAATC 33568 GACAAATG 33569 TCCAAACC 64321 TCCAAACC 64322

QUERY SEQUENCE

Numbered list of all possible lsquowordsrsquo

position word

1 14236

2 33658

3 07967

Analyse QUERY SEQUENCE

sequence position word

1 1 33658

1 2 07967

1 3 16210

3 15 33568

3 16 07967

Index of database

Expand from Word Based Matches We lsquoinstantlyrsquo know which sequences in the database have at least a word length match with our query sequence and at what relative position

Next the potential alignments are expanded adding up a score for (total matches ndash mismatches ndash gap penalties) to make the best possible alignment But this is usually for a tiny proportion of the sequences in the database ndash so overall it is much quicker

The highest scoring alignments are reported

But we can potentially miss alignments with no word-size bits in common consider BLASTn with a default word-size of 11

TCGGAAGTGGAAGCTGAACCTGATTGTAGAGTTGGAGGCCAGTGTTCTGGCTGAGC||||||||| ||||| |||||||||| |||||||||| |||| ||||| ||||||| TCGGAAGTGTAAGCTCAACCTGATTGCAGAGTTGGAGTCCAGAGTTCTAGCTGAGC

Care is sometimes neededhellip

BLAST ndashTypical OutputINPUT

gtpartial cDNA sequence Xenopus tropicalisCGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGTTCCCACCTCTCCTCTTTCACCATGAAGCTCAAGGACAAATTCCACTCCCCCAAAATCAAGCGCACCCCGTCCAAGAAGGGGAAGCCGGCCGACCTCACCGTCAAAACAGAAGAGAAACCCGTCAACAAAACCTTAAGCCGCTTGGAGGAACAGGAGAAAGAAGTCGTTAATGCCTTGCGTTACTTTAAGACAATTGTTGACAAGATGGCGGTGGACAAGATGGTGCTGGTGATGCTGCCAGGGTCGGCGA

OUTPUTQuery= (311 letters) Database NCBI Protein Reference Sequences 954378 sequences 347895532 total letters

gtgi|41055060|ref|NP_9574201| similar to guanine nucleotide-releasing factor 2 (specific for crk proto-oncogene) [Danio rerio]

Length=691

Score = 133 bits (335)Expect = 6e-31 Identities = 7698 (77) Positives = 8298 (83) Gaps = 498 (4) Frame = +2

Query 26 MSGKIE-KADSQRSHLSSFTMKLKDKFHSPKIKRTPSKKGKPA--DLTVKTEEKPVNKTL 196 MSGKIE K +SQ+SHLSSFTMKL KFHSPKIKRTPSKKGK + VKT EKPVNK + Sbjct 1 MSGKIESKHESQKSHLSSFTMKLM-KFHSPKIKRTPSKKGKQLQPEPAVKTPEKPVNKKV 59

Query 197 SRLEEQEKEVVNALRYFKTIVDKMAVDKMVLVMLPGSA 310 SRLEEQEK+VV+ALRYFKTIVDKM VD VL MLPGSA Sbjct 60 SRLEEQEKDVVSALRYFKTIVDKMNVDTKVLQMLPGSA 97

When is a match significant

RFKISDCQHPCTYSHNQYMTNHMRECPYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV

NFSWKKTSEKETNCQFDYPNDYNEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFNMCWLEVNSS

RF---KISDCQHPCTYSH-NQYMTNHMREC----PYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV F K S+ + C + + N Y N +C P+ + +W +P + D I N M ++ NFSWKKTSEKETNCQFDYPNDY--NEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFN------MCWLEVNSS

Here is a lsquotypicalrsquo weak alignment from BLASTp

In fact the sequences were randomly generated so there is no biologically significant alignmenthellip

E-values

The number of matches like the discovered match that I would expect to find by chance

An E-value of 00 implies that I would expect no matches like this to arise by chance thereforehellip

An E-value of 1 implies I would expect 1 match like this to arise by chance so if I have a match with such an E-valuehellip

Also ldquoexpect valueldquo or ldquoexpectationrdquo

E-values From First Principles

Some database statistics (23rd July 2005)

Database NCBI RefSeq mRNA 272619 sequences 503566580 total letters (~50 x 108)

Database NCBI nr 3329110 sequences 14601814750 total letters (~14 x 1010)

Notation

12e-35 = 12 x 10-35

48 x 106 = 4800000

We will consider first searching a nucleotide sequence (lsquoACGTAGACGTrsquo) against a nucleotide database eg the RefSeq mRNA above

Then we will consider the more complex case of amino acid sequence (protein) searches Which is of course what we mostly do

Calculating an E-valueThe RefSeq mRNA database has ~ 50 x 108 letters There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA

Query = lsquoArsquo

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACrsquo

AC AC AC AC

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACGrsquo

ACG

Expected number of matches = (50 x 108) 4 = ~12 x 108

Expected number of matches = (50 x 108) (4x 4) = ~31 x 107

Expected number of matches = (50 x 108) (4 x 4 x 4) = ~81 x 106

Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer

Expected number of matches = (50 x 108) (4 x 4 x 4 x 4 hellip 60 times ) = (50 x 108) 1036 = 50 x 10-28

E-value = 50 x 10-28

E-values In PracticeSo if I take a 60 nt sequencegtsequenceACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA and actually BLAST it against the RefSeq mRNA database I get

BLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds

Length=6060 Score = 119 bits (60) Expect = 2e-26 Identities = 6060 (100) Gaps = 060 (0) Strand=PlusPlus

Query 1 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 60 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 2977 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 3036

What do I get if I BLAST it against the larger nr databaseBLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds

Length=6060 Score = 119 bits (60) Expect = 6e-25 Identities = 6060 (100) Gaps = 060 (0) Strand=PlusPlus

Query 1 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 60 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 2977 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 3036

theoretical value was 50e-28 -

E-value Exercise

Given a transcription factor binding site

ACC[TG]TA

How many would you expect to find by chance in a 10k promoter sequence

How would this differ if there was an optional additional base between the 4th and 5th positionsIeACC[TG]TAOR ACC[TG]TA

E-value Exercise AnswerACC[TG]TA

Expect lsquoArsquo every 4 ntExpect lsquoACCrsquo every 4x4x4 = 64 nt

Expect lsquoT or Grsquo every 2nd ntExpect lsquoACC[TG]rsquo every 64x2 nt = 128 nt

Expect lsquoTArsquo every 4x4 = 16 ntExpect lsquoACC[TG]TArsquo every 128x16 nt = 2048 nt (4x4x4x2x4x4)We would expect ~5 of these promoter sites every 10k by chance

If also ACC[TG]TAA allowed

The two motifs independently have the same E-valueTo allow either means we expect twice as many

E-values Effect of Database SizeThe nr mRNA database has ~ 14 x 1010 letters (was RefSeq and 50 x108)There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA

Query = lsquoArsquo

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACrsquo

AC AC AC AC

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACGrsquo

ACG

Expected number of matches = (14 x 1010) 4 = ~12 x 108

Expected number of matches = (14 x 1010) (4x 4) = ~31 x 107

Expected number of matches = (14 x 1010) (4 x 4 x 4) = ~81 x 106

Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer

Expected number of matches = (14 x 1010) (4 x 4 x 4 x 4 hellip 60 times ) = (14 x 1010) 1036 = 14 x 10-26

E-value = 14 x 10-26

(was E-value = 50 x 10-28)

E-values Effect of Database Size

The E-value is simply dependent on database size

RefSeq

nr

14 x 1010 letters

50 x108 letters

30 x bigger

BLAST the same sequenceagainst each E-value = 14e-26

E-value = 50e-28

The database was ~30 times bigger and so the E-value was ~30 times bigger

Why were the values differentOur calculated E-value for searching against the RefSeq mRNA database was 50 x 10-28But our actual BLAST search at NCBI gave a value of

20 x 10-26 - about 40x larger - why is this

Gapped alignments

If we were expecting N matches for a query sequence lsquoACGTACGTACGTrsquo imagine what would happen to N if we allowed gaps in our matches

ACGTACGTACGT

This would now give us additional possible alignments that would meet our lsquomatchrsquo criteria

ACGTACGTACGT ACGTACAGTACGT ACGTACCGTACGT etc|||||||||||| |||||| |||||| |||||| ||||||ACGTACGTACGT ACGTAC-GTACGT ACGTAC-GTACGT

We will expect many more matches in a given database if we allow our alignments to have gaps The E-value will be larger

E-values Effect of Query Length

Biologically itrsquos the same match Does it mean we are any less sure that this match didnrsquot occur by chance The E-value is simply dependent on match length

database

BLAST 500 nt sequence against a database

BLASTn Get a full length match with sequence XYZ at an E-value = 50e-160

gtsequence ACTAGTCTAGCTAGACATCGATCGATGATGCTACACAGATAGACGATAGATAGTAAGTCGATCGATCGCGCATCGATCGTCTAGATCGATCGCTCGCTGTGTAGATAGATCGGCGATAGA

database

BLAST half of the same sequence against the same database

BLASTn

gtsequence ACTAGTCTAGCTAGACATCGATCGATGATGCTACACAGATAGACGATAGATAGTAAGTCG

Get a match with sequence XYZ again but at an E-value = 50e-80

Why not just use identityAt some levels this a good question

But consider two very different searches both of which give a 75 identity match

Query1 was 60 nt longCGGAGCTCAGGGCTTAACGACTGATATCTCCGCGCATGTCGAGAAACGATACAGCCAGCG||||||||||| || | || | || || |||| | | | |||||| | ||||||||||CGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGWhich would have an E-value ~ 50 x 10-19

And Query2 only 16 nt longACGTACGTACGTACGT||| || | |||| ||ACGCACCTTCGTAGGTWhich would have an E-value ~ 30

And intuitively we feel we would expect to see that sort of number of matches in the database just by chancehellip

So whatrsquos the real problemBasically you are usually trying to answer the question

Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism

Are there any useful guidelines though at least for biological meaningfulness

Basically you are usually trying to answer the question

Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism

BLAST

The difficulty is because

ORTHOLOGY

BLAST Similarity + Probability

biological knowledge

nature of query sequence

phylogenetic relationship

match length PI size of databasehellip

Rules of ThumbHow good does an E-value have to be before we might even think we have an ortholog

largerworse smallerbetter

E-values 10-5 10-10 10-40 10-100 00

fantasy borderline encouraging

pretty good canrsquot get

better

But note that in some gene families with closely related members you can get an E-value of 00 for several different matches and then identity may be more sensitive Also bear in mind in cases like this that ideas of lsquofunctionalrsquo orthology may break down with more than one locus producing identical proteins which share the same functionhellip

Protein BLASTItrsquos (nearly) always better to make comparisons at the amino acid level between protein sequences than the DNA level because the amino acid sequence is more conserved than the underlying DNA sequenceDoes this cause us to treat expected values any differently

If we follow the argument as before then for an exact match of a 20 amino acid sequence in the RefSeq protein database each additional amino acid will reduce the E-value by 120th (there are 20 different amino acids) And as there are 347895532 letters in that databaseE-value = ~35 x 108 (20 x 20 x 20 hellip20 times) = ~35 x 10-18

But this is what we get if we run the blast at NCBI

Score = 431 bits (100) Expect = 8e-04 Identities = 2020 (100) Positives = 2020 (100) Gaps = 020 (0) Frame = +3

Query 3 SSSSFRAYRAALSEVEPPCI 62 SSSSFRAYRAALSEVEPPCISbjct 972 SSSSFRAYRAALSEVEPPCI 991

Really too big a discrepancy to easily explain with hand wavinghellip

Amino Acid Substitutions

A SC F LWYG I LMVL IMFVM ILVP V ILMW FY

N DHSQ REHKS ANTT SY HFW

H NQYK RQER QK

D NEE DQK

In fact we need to take into account both amino acid substitutability as well as as before allowing gapped alignments On average any residue can be substituted for by about 2 others so each position has about 17th chance of lsquomatchingrsquo rather than 120th

So now we getE-value = ~35 x 108 (7 x 7 x 7 hellip20 times) = ~44 x 10-9which is much closer to the actual BLAST value

These substitutabilities are dealt with by the BLOSUM and PAM matrices

ExercisesGo to the file random-DNA-sequenceshtml select one of the 20 randomly generated nucleotide sequences and do a BLASTx (translated DNA-gtprotein) at NCBI against the nr protein database

Did you find any lsquosignificantrsquo hits

Repeat with a second sequence

What conclusions might you draw from this exercise

Try the same sequence(s) against the nr nucleotide database

Is there any general difference

Part 4 Tweaking BLASTAlthough you normally see BLAST as a web page with boxes to place data in and tick boxes etc it is actually a command line program that can be run just by typing the appropriate command and options eg

promptgt blastall ndashp ltblast typegt ndashi ltinput sequencegt ndashd ltdatabasegt

This is the simplest form where the basic program lsquoblastallrsquo takes a number of different options or parameters indicated by the ndashx and followed by its value -p ltwhich blast flavour to rungt-i ltfile with query sequence ingt-d ltpre-indexed database namegt

Not All Parameters are hellipThere are many other parameters and if not listed explicitly they will use a default value most appropriate to the blast flavour requested Eg for ndashW ltword sizegt blastn uses ndashW 11 where blastx uses ndashW 3

There are also some options that appear on the web pages that are not really parameters but manage the job in a similar way One of the most useful of these is on the NCBI blast pages where you can use Entrez queries or pick from an organism list to modify your search

The Many Parameters of BLASTThere are almost literally hundreds of parameters but most are way too obscure even for die-hard techies like me Very few of them are regularly useful in any but their default value but just occasionally they are very necessary

Here are some of the ones that I have used

-e max expected value -m output format (graphical or tabularspreadsheet)-F filter query sequence for low complexity (default TRUE)-U use only upper case regions of query (default FALSE)-G gap opening cost-E gap extension cost-q nucleotide mismatch penalty (BLASTx uses matrices)-r nucleotide match reward-b number of matching sequences to report-g allow gaps (default TRUE)-W word size-z effective database size (removes effect of actual database size)-S query strands to search (default both directions)-l restrict database sequences to given list of lsquogilsquo numbers

BLAST Parameters Exercises1 BLASTn vs BLASTx

Open the file example-sequenceshtml copy the sequence gtblastn-vs-blastxThis is a Xenopus tropicalis cDNA sequence

Go to the NCBI BLAST Home PageNucleotide-nucleotide BLAST (blastn) section Paste your sequence into the box

Run BLASTn against the nr nucleotide database using all default optionsThen hit [format] to wait for the results in a new page(hint if you paste the sequence definition line lsquogtnamersquo into the box as well your results will be labelled accordingly which can be useful)

Now repeat but go to the TRANSLATED BLAST section and BLAST against the nr protein database using BLASTx

How might the different results help us view the presence of this gene in other vertebrates

Results for Exercise 1 BLASTn

BLASTx

BLAST Parameters Exercises2 Low complexity filtering

Open the file example-sequenceshtml copy the sequence gtlow-complexity-filtering-A This sequence contains a long AT tandem repeat

Go to the NCBI BLAST Home PageTRANSLATED BLAST sectionBLASTx Paste your sequence into the box

Carefully UNTICK the ldquoChoose filter [ ] Low complexityrdquo BOX in the second section And then run BLASTx against the nr database

What do you feel about these alignmentsRe-run but leave the low-complexity filter ON this timeDoes this change our view of the protein matches

Now continue with gtlow-complexity-filtering-B and ndashCC is an especially interesting case ndash what can we deduce about the cDNA sequence Annotators beware

Results for Exercise 2A (OFF) BLASTn ndash low complexity filtering OFF

Results for Exercise 2A (ON) BLASTn ndash low complexity filtering ON

Results for Exercise 2B

ON OFF

Results for Exercise 2C

There is a sequence error an extra G at position 117 in the sequence cDNA (117)AGAAAAGAAGAAACATGGCAATGGATCAGAA|||||||||||||||| ||||||||||||||AGAAAAGAAGAAACAT-GCAATGGATCAGAA

Genomic sequence

ON

OFF

BLAST Parameters Exercises3 Limit by Entrez query

Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching sequences in fruit fly enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT

Open the file example-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home Page TRANSLATED BLAST sectionBLASTx and paste the sequence

Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip

BLAST Parameters Exercises4 BLASTn vs tBLASTx and nucleotide mismatch penalties

Open the file example-sequenceshtml

Also open the NCBI BLAST Home PageSPECIAL ndash Align two sequences section

There are several Xenopus tropicalis cyclins in the examples fileCopy the sequence gtcyclin-A1-Xt to the Sequence 1 BLAST windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 BLAST window(i) Run the default comparison should be BLASTn Note the alignmentNow run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these

(ii) Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1 Then try reducing the gap open and gap extension penaltieshellipWhat do we learn from this

(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip

Results for Exercise 4 (i)

BLASTn tBLASTx

Results for Exercise 4 (ii)

Mismatch penalty = -2 (default) Mismatch penalty = -1

BLAST Parameters Exercises5 E-Value maximum for reporting

Open the file example-sequenceshtml

Copy the sequence gtsumo-binding-motif and go to the NCBI BLAST Home PageGo to the PROTEIN BLAST section BLASTp and paste the sequence

Run the search with the default values

Now re-run the search setting the maximum E-value in the box

Expect 100

What difference does this make

BLAST Parameters Exercises6 Word Size

Open the file example-sequenceshtml

Copy the sequence gtmorpholino and go to the NCBI BLAST Home PageGo to the NUCLEOTIDE BLAST section BLASTn and paste the sequence

Check OFF the low complexity filter and then run the search

Now re-run the search setting the following parameters

Low complexity OFFExpect 100Word Size 7Other advanced -q-1 (mismatch penalty -1 instead of default -3)

What difference does this make

END

  • Bioinformatics Workshop 1 Sequences and Similarity Searches
  • The Basic Questions
  • Part 1 Structural Genomics
  • Chromosomes and Genes
  • Gene to Protein
  • Sequence Signals
  • Genomic Signals
  • Derivative Sequences
  • Gene Models
  • Sequences and Genes (Accession Numbers and Names)
  • Gene Symbols Names Etc
  • A Gene-Centric View
  • Sequences and Accession Numbers
  • mRNA Splicing Signals
  • Gene Predictions
  • Supporting Evidence
  • TheoreticalPredicted Sequences
  • Sequences for a model organism
  • So Whatrsquos in the Databases Now
  • Part 2 Comparative Genomics
  • Speciation
  • Residual Similarity
  • Computers Can Detect Homology
  • Orthologs
  • Paralogs
  • lsquoOtherrsquo-logs
  • The Essential Paradigm
  • Function Conserved Longer than Detectable Similarity
  • Redundancy in the Genetic Code
  • Protein Similarity Persists Longer
  • Always Compare Protein Sequences
  • Exercise 1 nucleotide vs amino acid search
  • Answers Exercise 1
  • The Essential Task
  • Functional Orthologs
  • Finding Orthologs
  • Using Synteny is Better
  • Metazome
  • Metazome Exercise
  • Part 3 Finding Sequence Similarities
  • Gaps in Alignments
  • The Downside of Gaps
  • BLAST
  • Flavours of BLAST
  • How does it work
  • BLAST WORDS and INDEXING
  • Analyse the Query Sequence
  • Expand from Word Based Matches
  • BLAST ndashTypical Output
  • When is a match significant
  • E-values
  • E-values From First Principles
  • Calculating an E-value
  • E-values In Practice
  • E-value Exercise
  • E-value Exercise Answer
  • E-values Effect of Database Size
  • Slide 58
  • Why were the values different
  • E-values Effect of Query Length
  • Why not just use identity
  • So whatrsquos the real problem
  • Rules of Thumb
  • Protein BLAST
  • Amino Acid Substitutions
  • Exercises
  • Part 4 Tweaking BLAST
  • Not All Parameters are hellip
  • The Many Parameters of BLAST
  • Slide 70
  • BLAST Parameters Exercises
  • Results for Exercise 1
  • Slide 73
  • Results for Exercise 2A (OFF)
  • Results for Exercise 2A (ON)
  • Results for Exercise 2B
  • Results for Exercise 2C
  • Slide 78
  • Slide 79
  • Results for Exercise 4 (i)
  • Results for Exercise 4 (ii)
  • Slide 82
  • Slide 83
  • END
Page 16: Bioinformatics Workshop 1 Sequences and Similarity Searches

Supporting Evidence

EST evidence

genome

gene model

We note that in the absence of EST evidence it is only really possible to predict coding sequence with any confidence (and even thenhellip)

So predicted genes based on computational gene models alone will usually lack UTR regions which has some important consequences

exons 1 2 3 4

TheoreticalPredicted Sequences

genome

predicted gene modelexons 1 2 3 4

Wersquove now reversed the process of working out exon structure from aligning cDNA sequences against the genome sequence but we shouldnrsquot lose sight of the fact that we donrsquot really know if these predicted proteins exists ndash especially where supporting EST evidence is weak or non-existent

predicted transcript

predicted protein

Sequences for a model organism

ESTs ndash millions pound10 eachCheap to sequence ndash so we get millions per organismBut lots of errorsAnd incomplete gene sequencesCan give us relative expression levels

cDNAs ndash tens of thousands pound1000 eachExpensive ndash but only need to do one (or a small number) per geneFew errors with multipass sequencingGives us protein sequences

Genomes ndash one pound30000000Extremely expensiveBut the only way to get the whole pictureGives us gene regulation

So Whatrsquos in the Databases Now

15000000ESTs

3300000cDNAs

NCBI July 2005

2700000proteins

950000proteins

nrRefSeq

DNA

Proteins

Part 2 Comparative Genomics

ATGAAGGCTGCCTACGACTGCCGTGATGCAGGCTGCCTACGACTGCCGTGATGCAGGCTGCCAACGACTGCCGTGATGCATGCTGCCAACGACTGCCGTGATGCATGCTGCCAACGACTGCCCTGATGCATGCTGCCAACGGCTGCCCTGATGCATGCTGCCAACGGATGCCCTGATGCATGCCGCCAACGGATGCCCTGATGCATGCCGCCAACGGATGTCCTG

Imagine one mutation gets fixed every 100000 years in this gene sequencehellip

Gene sequence

Evolution by sequence mutation

Speciation

ATGAAGGCTGCCTACGACTGCCGTG

ATGCAGGCTGCCTACGACTGCCGTGATGCAGGCTGCCAACGACTGCCGTGATGCATGCTGCCAACGACTGCCGTGATGCATGCTGCCAACGACTGCCCTGATGCATGCTGCCAACGGCTGCCCTGATGCATGCTGCCAACGGATGCCCTG

Gene AATGAAGGCTGCCTACGACTGCCGTG

ATGAAGGCCGCCTACGACTGCCGTGATGAAGGCCGCCAACGACTGTCGTGATGAAAGCCGCCAACGACTGTCGTGATGAAAGCCGCCAACGACAGTCGTGATGAAAGCCGCCTACGACAGTCGTGATGAAAGCCGCCTACGACAGTCCTG

ATGCATGCTGCCAACGGATGCCCTG

ATGAAAGCCGCCTACGACAGTCCTG||| | || ||| ||| | ||||

If the genetic difference means they can no longer interbreed with fertile offspring ndash then we have a new specieshellip

Residual Similarity

ATGCATGCTGCCAACGGATGCCCTG

ATGAAAGCCGCCTACGACAGTCCTG||| | || ||| ||| | ||||

ATGCATGCTGCCAACGGATGCCCTG

ATGGAAGGCGCTTAGGATAGTCCAG||| | | || | | | || |

After longer periods of evolution homology may no longer be detectable in the DNA sequencehellip

We can still easily detect residual similarity between these sequences this is what we call homology ndash detectable similarity because of common evolutionary origin

Computers Can Detect Homology

In fact computers are very good at this task ndash the two primary challenges are

(a) performing the search fast enough to look through millions of sequence in a timescale compatible with a lab scientistrsquos attention span

(b) at low levels of similarity being able to distinguish between biologically related sequences and chance matcheshellip

ATGCATGCTGCCAACGGATGCCCTG

ATGAAAGCCGCCTACGACAGTCCTG||| | || ||| ||| | ||||

GCTGACTCGTAGCGCTTAGCTAGCT

CCAACATCTAGCCAGATTAGTTAGT | || | | | |

Orthologs

A A

A Gene duplication though speciation The two copies of Gene

A will now evolve independently but will continue to have the ~same function

They are ORTHOLOGS

Paralogs

A

Gene duplication though internal genome duplication

The two copies of Gene A will now evolve independently but will probably not continue to have exactly the same function

They are PARALOGS

A

A Arsquo

A

lsquoOtherrsquo-logsWhat about gene duplication after speciation

How can we describe the relationship(s) between the various copies of gene A in the two frogs

Bear in mind that understanding gene function is more important than semanticshellip

The two copies of A in the orange frog are sometimes called IN-PARALOGS

If they were also present in the green frog (and therefore were in the ancestor species) they would be OUT-PARALOGS

A

A

A

Arsquo A

The Essential Paradigm

1 any group of modern species can be traced back to some extinct common ancestor

A

A

2 in all likelihood they share orthologous genes which have the same function in the modern animal as in the extinct ancestor

3 If we can experimentally determine the function of a gene in one of these organisms then there is a good chance the ORTHOLOGOUS gene in another organism will have the same function

A A

cyclin b1

cyclin b1

Function Conserved Longer than Detectable Similarity

start from first self-replicating sequence

same function detectable similarity

living organisms

whole genome duplication local duplication

Redundancy in the Genetic Code

GCA A alanine GCC A GCG A GCT A

TGC C cystine TGT C

GAC D aspartate GAT D

GGA G glycine GGC G GGG G GGT G

lsquoSynonymousrsquo or lsquosilentrsquo mutations in the third position of the codon triplets have no effect on the amino acid coded for ndash so there is no evolutionary pressure against thishellip

Protein Similarity Persists Longer

CTATCACGAGAACCTGTGCTATCCCGAGAACCTGTGCTATCCCGAGAACCAGTGCTATCCCGTGAACCAGTGCTATCCCGTGAGCCAGTGCTATCCCGTGAGCCAGTTCTGTCCCGTGAGCCAGTT

CTATCACGAGAACCTGTG

CTGTCCCGTGAGCCAGTT|| || || || || ||

LSREPV

LSREPV||||||

CTATCACGAGAACCTGTG

TTGTCCCGGTCGCCAGTT | || | || ||

LSREPV

LSRFPV||| ||

67 100

44 80

Always Compare Protein Sequences

ATGAATGCAGCCTATGATTGCCGAGCCAGAATGCTAAGG MNAAYDCRARMLR ||||| || || || || || || || ||||| || || | ||||||||+||ATGAAGGCCGCATACGACTGTCGTGCTAGAATCCTGAGA MKAAYDCRARILR

DNA comparison amino acid comparison

The DNA sequence can change while the amino acid sequence stays the same so always look for similarities by comparing amino acid sequences

Exercise 1nucleotide vs amino acid search

Go to the file example-sequenceshtml and locate the section for this exercise There should be two sequences lsquosurfeit1rsquo for frog and fly

Go to NCBI Blast home page then lsquoAlign two sequencesrsquo (bottom left lsquospecialrsquo panel) paste one sequence into each window and hit lsquoAlignrsquo ndash this will do a direct DNADNA comparison

Now find the open reading frames of the two genes and translate them into amino acid protein sequences then repeat the two sequences comparison

Go to NCBI ORF Finder ndash paste sequence ndash hit OrfFind ndash identify longest ORF ndash click on it ndash next screen hit Accept ndash change View to Fasta protein ndash hit View ndash copy sequence to Blast2Seqs Do the same with the other sequence

Before you hit lsquoAlignrsquo change the lsquoProgramrsquo (top left) to blastphellip

Answers Exercise 1

The Essential Taskexperiment data mining

gene sequence what is its function

database of proteins in other species

Cyclin-AFoxA1

cdc25

alpha-tubulin

Predicted protein

Gravin-like

Sprouty-2

calmodulin

KIAA10786568

frizzled

Wint8

Troponin T3

Gravin-like

we can only do this because of implied function based on orthology

Functional Orthologs

function known annotation lsquoGravinrsquo available

Human geneXenopus genefunction unknown

sequence similarityorthologs

same function But we know that function is largely determined by shape

similar shape

Which in general we cannot determine ndash but it is probably SHAPE not SEQUENCE that is conserved

We make an assumption that the same gene function is likely to be present in the two organisms and the ones that have this function are likely to be the most similar in sequence

Finding OrthologsSo how do we find orthologs and can we know when we have

The simplest is Reciprocal Best BLAST but it implicitly relies on having all the protein sequences of you own organism and the one you wish to find an ortholog in

frog proteindatabase of human proteins

best match human protein

database of frog proteins

x

Using Synteny is Better

We know that large regions of (say) vertebrate genomes have preserved their overall organisation from one organism to another

And we find the same genes (ie orthologs) in more or less the same order in the syntenic sectionsThese of course represent chromosomal re-arrangements since these organisms diverged

Human chromosome 5

Mouse chromosome 10

Mouse chromosome 2

MetazomeFortunately someone has done all the hard work for ushellip Dan Rokhsar httpwwwmetazomenet

Metazome Exercise

Go back to Entrez Gene and look for your favourite gene again

Pick probable ortholog vertebrate genes from common organisms (human mouse rat chicken frog fish) and paste their protein sequences into a temporary space

Go to Metazome (httpwwwmetazomenet) find the blast window open two versions of it and blast your sequences against the Tetrapod or Jawed vertebrate node

See if you get the same cluster ID as best top hit and have a look at the Metazome alignment(s)hellip

Part 3 Finding Sequence Similarities

We want computer programs which will compare sequences at all possible different alignments looking for a degree of similarity greater than we would expect to find by chance

But first we have to consider the implication of gapshellip

Insertions and deletions are other possible forms of mutations and they can really mess up our simple alignments

ATGCATGCTGCCAACGGATGTCCTG

ATGAAAGCCGCCTACGAAAGTCCTG||| | || ||| ||| | ||||||

ATGCATGCTGGCCAACGGATGTCCTG

ATGAAAGCCGCCTACGAAAGTCCTG||| | || ||| | | | |

Gaps in Alignments

Consider these two obviously similar sequences

TTCCCAACTCTCCTCTTTCACCATGAAGCTCAAGGACAGATTCCACTCGCCCCAAAATCAAGCTCACCCCGTCCAAGAA | || | || |||||||||||||||||||| ||||||||| ||| ||| | ||| | | |TTCCCACCTCTCCTCTTTGCACCATGAAGCTCAAGGACAAATTCCACTCCCCCAAAATCAAGCGCACCCCGTCCCAGAA

In fact we realise that the most probable alignment (regarding biological origin) is with a small gap in each sequence

TTCCCAACTCTCCTCTTT=CACCATGAAGCTCAAGGACAGATTCCACTCGCCCCAAAATCAAGCTCACCCCGTCCAAGAA |||||| ||||||||||| |||||||||||||||||||| ||||||||| |||||||||||||| |||||||||| ||||TTCCCACCTCTCCTCTTTGCACCATGAAGCTCAAGGACAAATTCCACTC=CCCCAAAATCAAGCGCACCCCGTCCCAGAA

So in general we allow ourselves to insert gaps until we find the optimal alignment

But where should this process stop

The Downside of GapsTake two random sequences with no lsquorealrsquo similarity

GACACTAGGTCGATGCGTGGTGGCGAGA

ACGCATCCGGATGTGCACCGTGGAACTG

And allow lsquocost freersquo gaps

GAC--ACT----AGGTCGATGC---GTGG---TGGCGAGA || | | | | | ||| |||| || ACGCA-TCCGGA--T-G-TGCACCGTGGAACTG

Clearly although the alignment has no mismatches it is obviously not biologically meaningful

To prevent this we assign a cost to adding gaps which is offset against the benefit of finding matches ndash and this is the essence of lsquofinding gapped alignmentsrsquo

We want to find the lsquoalignmentrsquo between the two (or more) sequences which shows the greatest degree of similarity while introducing the fewest gaps hellip

BLAST

gtqueryAGACGAACCTAGCACAAGCGCGTCTGGAAAGACCCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGTTCAGGAGTATTTGGACTGCAATATTGGCCCTCGTTCAAGGGCGCCTACCATCACCCGACGGTCATGCCGGTCCCCAGCAGCTGCTAATAACTTCCTTCGCTACTCAAGTTACCACGCTAGCAAAACCCACGGCATACCGTTTACCCTTTAAAATCAGCTTCAACCAGCAACGAA

There are many programs used to find similarities between sequencesThey range from relatively slow programs which find the exact best matching alignment through ones which take progressively inexact shortcuts to speed things up Of this latter class the best known and easily most widely used is BLAST developed by Stephen Altschul and others and continuously refined over the last 10-15 years

The essential idea is to compare your query sequence against a collection or lsquodatabasersquo of target sequences looking for the one(s) that match the query sequence the best

gttarget1AAAACAGGAATATTTACCGGGACCGGGTAATGATGCATCTCGAGGTACACAATATACCTG GAGAACCGAATTATGAGTTGGCCACCTTACTTAACGAAACCAGCAGAGAAAATCCAACAT GGCAACACCCCTCTGACTACACTAGAAGGAACTACTATGTAAGAAAACAGCCTGTCCCTT GCAGTTTGAATGACTGGGTGATGCGAAATGGGGGTCCTGCCATAGAGCGCTTCCATGGTT TACCTTGCACATTTCAGAGAAGTCCTATGCCAGGAGTCCTTCCTACAGGGCCTTCCTGAA ACTATATATGTGCTTATTCTTGTTTGATTTGGCTTTGCAGgttarget2CTCTTAATTTATTTCTCTTCCTGCAGCTCCCTCGCTTTTTCCTTTCCCTGTTACATTCAT CTGACTTGAAGAGTTGCAAATTTTCAGTGTTTCTGTTTTTGTTGCTGATATGTTGTAAAC TTTTTAATAAAATCTATTTCTATAG gttarget3GCAGTTTGAATGACTGGGTGATGCGAAATGGGGGTCCTGCCATAGAGCGCTTCCATGGTT TACCTTGCACATTTCAGAGAAGTCCTATGCCAGGAGTCCTTCCTACAGGGCCTTCCTGAA ACTATATATGTGCTTATTCTTGTTTGATTTGGCTTTGCAGCTAGGGTTTTCACCTTTTCT GGAAAAAAAAATACTGGCTTCC gttarget4CTGCTATTAATGGGCAAAACAACTCAAATAAAGTCCCTCTGCCACCCTCAGACACTGCCC CTGGCCCCCAGCTGCCCGCTGATCCTTGTAGCCAGAGCAGTAAAGTTTTGAAAGTGGAGC CCAAGGAGAATAAAGTTATTAAAGAAACTGGCTTTGAACAAGGTGAAAAGTCTTGTGCAG CACCTCTAGATCATACTGTGAAGGAAAATCTTGGACAAACTTCTAAAGAACAGGTGGTAG

query

database

COMPARE

LIST MATCHES

Flavours of BLAST

ACGATAGATCCCATCCATAAAT ATGACGATAGATCCCATCAT CGATAGGACCACCACA GATAGACCAGGATACATAGGATAATTA AGCTCGCTTGGCTCGATGGCT

MKJLSPWERSYTRGHYTWER MGHTVNBZY MKLPWRHGDBKJGMNDFD MBKLRPIUHDFRTASGSLKWWRTVBN

MKJLSPWERSYTRGHYTWER MGHTVNBZY MKLPWRHGDBKJGMNDFD MBKLRPIUHDFRTASGSLKWWRTVBN

ATGACGATAGATCCCATCAT CGATAGGACCACCACA GATAGACCAGGATACATAGGATAATTA AGCTCGCTTGGCTCGATGGCT

query sequence other operation database sequences

ATGACGATAGATCCCATCAT CGATAGGACCACCACA GATAGACCAGGATACATAGGATAATTA AGCTCGCTTGGCTCGATGGCT

BLASTn

BLASTp

BLASTx

tBLASTn

tBLASTx

ACGATAGATCCCATCCATAAAT

ACGATAGATCCCATCCATAAAT

MQWCGYRWTYQGYRW

MQWCGYRWTYQGYRW

FAST

FAST

SLOW

SLOWER

HORRIBLY

SLOW

6 fra

me

trans

latio

n

How does it work The main task of any sequence comparison program is to test all possible mutual alignments of two sequence and see how good the match is

CTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGT

CCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGTTC | | | | | ||||||||||||||||||||||||| CTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGTCTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGT || | | | | | | | | |||||||||||||||||||||||| | | | | | |

CCGAGCTTCTCATTGCTCTTCCTAACAGTG=TGATAGGCTAACCGTAATGGCGTTC||||||||||||||||||||||||| ||||||||||||||||||||||||

query

1st database sequence

This would actually be a very slow search process if implemented like thishellip

BLAST achieves its speed through two strategies

- it takes a WORD based approach- it pre-INDEXES database sequences

BLAST WORDS and INDEXING1 GACAAATCCAAACCCCTGAAGTTCTCCACCAGCAAAGCCA

2 TAAGCAAATTTAATTTTGTTTACATTTTC

3 GTTAAGACCTTCCCTGACATTTGCAGCAGTTTCAAATGTA

AAAAAAAA 00001 AAAAAAAC 00002 AAAAAAAG 00003 ACAAATCC 07967 ACAAATCC 07968 ACAAATCC 07979 GACAAATC 33568 GACAAATG 33569 TCCAAACC 64321 TCCAAACC 64322

sequence position word

1 1 33658

1 2 07967

1 3 16210

3 15 33568

3 16 07967

Database of sequences

Numbered list of all possible lsquowordsrsquo

Build a position index of all words in the database

Analyse the Query Sequence gtquery AGACAAATCCAAACCCCTGAAGTTCTCCACCAGCAAAGCCA

AAAAAAAA 00001 AAAAAAAC 00002 AAAAAAAG 00003 ACAAATCC 07967 ACAAATCC 07968 ACAAATCC 07979 GACAAATC 33568 GACAAATG 33569 TCCAAACC 64321 TCCAAACC 64322

QUERY SEQUENCE

Numbered list of all possible lsquowordsrsquo

position word

1 14236

2 33658

3 07967

Analyse QUERY SEQUENCE

sequence position word

1 1 33658

1 2 07967

1 3 16210

3 15 33568

3 16 07967

Index of database

Expand from Word Based Matches We lsquoinstantlyrsquo know which sequences in the database have at least a word length match with our query sequence and at what relative position

Next the potential alignments are expanded adding up a score for (total matches ndash mismatches ndash gap penalties) to make the best possible alignment But this is usually for a tiny proportion of the sequences in the database ndash so overall it is much quicker

The highest scoring alignments are reported

But we can potentially miss alignments with no word-size bits in common consider BLASTn with a default word-size of 11

TCGGAAGTGGAAGCTGAACCTGATTGTAGAGTTGGAGGCCAGTGTTCTGGCTGAGC||||||||| ||||| |||||||||| |||||||||| |||| ||||| ||||||| TCGGAAGTGTAAGCTCAACCTGATTGCAGAGTTGGAGTCCAGAGTTCTAGCTGAGC

Care is sometimes neededhellip

BLAST ndashTypical OutputINPUT

gtpartial cDNA sequence Xenopus tropicalisCGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGTTCCCACCTCTCCTCTTTCACCATGAAGCTCAAGGACAAATTCCACTCCCCCAAAATCAAGCGCACCCCGTCCAAGAAGGGGAAGCCGGCCGACCTCACCGTCAAAACAGAAGAGAAACCCGTCAACAAAACCTTAAGCCGCTTGGAGGAACAGGAGAAAGAAGTCGTTAATGCCTTGCGTTACTTTAAGACAATTGTTGACAAGATGGCGGTGGACAAGATGGTGCTGGTGATGCTGCCAGGGTCGGCGA

OUTPUTQuery= (311 letters) Database NCBI Protein Reference Sequences 954378 sequences 347895532 total letters

gtgi|41055060|ref|NP_9574201| similar to guanine nucleotide-releasing factor 2 (specific for crk proto-oncogene) [Danio rerio]

Length=691

Score = 133 bits (335)Expect = 6e-31 Identities = 7698 (77) Positives = 8298 (83) Gaps = 498 (4) Frame = +2

Query 26 MSGKIE-KADSQRSHLSSFTMKLKDKFHSPKIKRTPSKKGKPA--DLTVKTEEKPVNKTL 196 MSGKIE K +SQ+SHLSSFTMKL KFHSPKIKRTPSKKGK + VKT EKPVNK + Sbjct 1 MSGKIESKHESQKSHLSSFTMKLM-KFHSPKIKRTPSKKGKQLQPEPAVKTPEKPVNKKV 59

Query 197 SRLEEQEKEVVNALRYFKTIVDKMAVDKMVLVMLPGSA 310 SRLEEQEK+VV+ALRYFKTIVDKM VD VL MLPGSA Sbjct 60 SRLEEQEKDVVSALRYFKTIVDKMNVDTKVLQMLPGSA 97

When is a match significant

RFKISDCQHPCTYSHNQYMTNHMRECPYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV

NFSWKKTSEKETNCQFDYPNDYNEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFNMCWLEVNSS

RF---KISDCQHPCTYSH-NQYMTNHMREC----PYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV F K S+ + C + + N Y N +C P+ + +W +P + D I N M ++ NFSWKKTSEKETNCQFDYPNDY--NEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFN------MCWLEVNSS

Here is a lsquotypicalrsquo weak alignment from BLASTp

In fact the sequences were randomly generated so there is no biologically significant alignmenthellip

E-values

The number of matches like the discovered match that I would expect to find by chance

An E-value of 00 implies that I would expect no matches like this to arise by chance thereforehellip

An E-value of 1 implies I would expect 1 match like this to arise by chance so if I have a match with such an E-valuehellip

Also ldquoexpect valueldquo or ldquoexpectationrdquo

E-values From First Principles

Some database statistics (23rd July 2005)

Database NCBI RefSeq mRNA 272619 sequences 503566580 total letters (~50 x 108)

Database NCBI nr 3329110 sequences 14601814750 total letters (~14 x 1010)

Notation

12e-35 = 12 x 10-35

48 x 106 = 4800000

We will consider first searching a nucleotide sequence (lsquoACGTAGACGTrsquo) against a nucleotide database eg the RefSeq mRNA above

Then we will consider the more complex case of amino acid sequence (protein) searches Which is of course what we mostly do

Calculating an E-valueThe RefSeq mRNA database has ~ 50 x 108 letters There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA

Query = lsquoArsquo

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACrsquo

AC AC AC AC

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACGrsquo

ACG

Expected number of matches = (50 x 108) 4 = ~12 x 108

Expected number of matches = (50 x 108) (4x 4) = ~31 x 107

Expected number of matches = (50 x 108) (4 x 4 x 4) = ~81 x 106

Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer

Expected number of matches = (50 x 108) (4 x 4 x 4 x 4 hellip 60 times ) = (50 x 108) 1036 = 50 x 10-28

E-value = 50 x 10-28

E-values In PracticeSo if I take a 60 nt sequencegtsequenceACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA and actually BLAST it against the RefSeq mRNA database I get

BLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds

Length=6060 Score = 119 bits (60) Expect = 2e-26 Identities = 6060 (100) Gaps = 060 (0) Strand=PlusPlus

Query 1 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 60 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 2977 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 3036

What do I get if I BLAST it against the larger nr databaseBLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds

Length=6060 Score = 119 bits (60) Expect = 6e-25 Identities = 6060 (100) Gaps = 060 (0) Strand=PlusPlus

Query 1 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 60 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 2977 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 3036

theoretical value was 50e-28 -

E-value Exercise

Given a transcription factor binding site

ACC[TG]TA

How many would you expect to find by chance in a 10k promoter sequence

How would this differ if there was an optional additional base between the 4th and 5th positionsIeACC[TG]TAOR ACC[TG]TA

E-value Exercise AnswerACC[TG]TA

Expect lsquoArsquo every 4 ntExpect lsquoACCrsquo every 4x4x4 = 64 nt

Expect lsquoT or Grsquo every 2nd ntExpect lsquoACC[TG]rsquo every 64x2 nt = 128 nt

Expect lsquoTArsquo every 4x4 = 16 ntExpect lsquoACC[TG]TArsquo every 128x16 nt = 2048 nt (4x4x4x2x4x4)We would expect ~5 of these promoter sites every 10k by chance

If also ACC[TG]TAA allowed

The two motifs independently have the same E-valueTo allow either means we expect twice as many

E-values Effect of Database SizeThe nr mRNA database has ~ 14 x 1010 letters (was RefSeq and 50 x108)There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA

Query = lsquoArsquo

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACrsquo

AC AC AC AC

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACGrsquo

ACG

Expected number of matches = (14 x 1010) 4 = ~12 x 108

Expected number of matches = (14 x 1010) (4x 4) = ~31 x 107

Expected number of matches = (14 x 1010) (4 x 4 x 4) = ~81 x 106

Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer

Expected number of matches = (14 x 1010) (4 x 4 x 4 x 4 hellip 60 times ) = (14 x 1010) 1036 = 14 x 10-26

E-value = 14 x 10-26

(was E-value = 50 x 10-28)

E-values Effect of Database Size

The E-value is simply dependent on database size

RefSeq

nr

14 x 1010 letters

50 x108 letters

30 x bigger

BLAST the same sequenceagainst each E-value = 14e-26

E-value = 50e-28

The database was ~30 times bigger and so the E-value was ~30 times bigger

Why were the values differentOur calculated E-value for searching against the RefSeq mRNA database was 50 x 10-28But our actual BLAST search at NCBI gave a value of

20 x 10-26 - about 40x larger - why is this

Gapped alignments

If we were expecting N matches for a query sequence lsquoACGTACGTACGTrsquo imagine what would happen to N if we allowed gaps in our matches

ACGTACGTACGT

This would now give us additional possible alignments that would meet our lsquomatchrsquo criteria

ACGTACGTACGT ACGTACAGTACGT ACGTACCGTACGT etc|||||||||||| |||||| |||||| |||||| ||||||ACGTACGTACGT ACGTAC-GTACGT ACGTAC-GTACGT

We will expect many more matches in a given database if we allow our alignments to have gaps The E-value will be larger

E-values Effect of Query Length

Biologically itrsquos the same match Does it mean we are any less sure that this match didnrsquot occur by chance The E-value is simply dependent on match length

database

BLAST 500 nt sequence against a database

BLASTn Get a full length match with sequence XYZ at an E-value = 50e-160

gtsequence ACTAGTCTAGCTAGACATCGATCGATGATGCTACACAGATAGACGATAGATAGTAAGTCGATCGATCGCGCATCGATCGTCTAGATCGATCGCTCGCTGTGTAGATAGATCGGCGATAGA

database

BLAST half of the same sequence against the same database

BLASTn

gtsequence ACTAGTCTAGCTAGACATCGATCGATGATGCTACACAGATAGACGATAGATAGTAAGTCG

Get a match with sequence XYZ again but at an E-value = 50e-80

Why not just use identityAt some levels this a good question

But consider two very different searches both of which give a 75 identity match

Query1 was 60 nt longCGGAGCTCAGGGCTTAACGACTGATATCTCCGCGCATGTCGAGAAACGATACAGCCAGCG||||||||||| || | || | || || |||| | | | |||||| | ||||||||||CGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGWhich would have an E-value ~ 50 x 10-19

And Query2 only 16 nt longACGTACGTACGTACGT||| || | |||| ||ACGCACCTTCGTAGGTWhich would have an E-value ~ 30

And intuitively we feel we would expect to see that sort of number of matches in the database just by chancehellip

So whatrsquos the real problemBasically you are usually trying to answer the question

Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism

Are there any useful guidelines though at least for biological meaningfulness

Basically you are usually trying to answer the question

Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism

BLAST

The difficulty is because

ORTHOLOGY

BLAST Similarity + Probability

biological knowledge

nature of query sequence

phylogenetic relationship

match length PI size of databasehellip

Rules of ThumbHow good does an E-value have to be before we might even think we have an ortholog

largerworse smallerbetter

E-values 10-5 10-10 10-40 10-100 00

fantasy borderline encouraging

pretty good canrsquot get

better

But note that in some gene families with closely related members you can get an E-value of 00 for several different matches and then identity may be more sensitive Also bear in mind in cases like this that ideas of lsquofunctionalrsquo orthology may break down with more than one locus producing identical proteins which share the same functionhellip

Protein BLASTItrsquos (nearly) always better to make comparisons at the amino acid level between protein sequences than the DNA level because the amino acid sequence is more conserved than the underlying DNA sequenceDoes this cause us to treat expected values any differently

If we follow the argument as before then for an exact match of a 20 amino acid sequence in the RefSeq protein database each additional amino acid will reduce the E-value by 120th (there are 20 different amino acids) And as there are 347895532 letters in that databaseE-value = ~35 x 108 (20 x 20 x 20 hellip20 times) = ~35 x 10-18

But this is what we get if we run the blast at NCBI

Score = 431 bits (100) Expect = 8e-04 Identities = 2020 (100) Positives = 2020 (100) Gaps = 020 (0) Frame = +3

Query 3 SSSSFRAYRAALSEVEPPCI 62 SSSSFRAYRAALSEVEPPCISbjct 972 SSSSFRAYRAALSEVEPPCI 991

Really too big a discrepancy to easily explain with hand wavinghellip

Amino Acid Substitutions

A SC F LWYG I LMVL IMFVM ILVP V ILMW FY

N DHSQ REHKS ANTT SY HFW

H NQYK RQER QK

D NEE DQK

In fact we need to take into account both amino acid substitutability as well as as before allowing gapped alignments On average any residue can be substituted for by about 2 others so each position has about 17th chance of lsquomatchingrsquo rather than 120th

So now we getE-value = ~35 x 108 (7 x 7 x 7 hellip20 times) = ~44 x 10-9which is much closer to the actual BLAST value

These substitutabilities are dealt with by the BLOSUM and PAM matrices

ExercisesGo to the file random-DNA-sequenceshtml select one of the 20 randomly generated nucleotide sequences and do a BLASTx (translated DNA-gtprotein) at NCBI against the nr protein database

Did you find any lsquosignificantrsquo hits

Repeat with a second sequence

What conclusions might you draw from this exercise

Try the same sequence(s) against the nr nucleotide database

Is there any general difference

Part 4 Tweaking BLASTAlthough you normally see BLAST as a web page with boxes to place data in and tick boxes etc it is actually a command line program that can be run just by typing the appropriate command and options eg

promptgt blastall ndashp ltblast typegt ndashi ltinput sequencegt ndashd ltdatabasegt

This is the simplest form where the basic program lsquoblastallrsquo takes a number of different options or parameters indicated by the ndashx and followed by its value -p ltwhich blast flavour to rungt-i ltfile with query sequence ingt-d ltpre-indexed database namegt

Not All Parameters are hellipThere are many other parameters and if not listed explicitly they will use a default value most appropriate to the blast flavour requested Eg for ndashW ltword sizegt blastn uses ndashW 11 where blastx uses ndashW 3

There are also some options that appear on the web pages that are not really parameters but manage the job in a similar way One of the most useful of these is on the NCBI blast pages where you can use Entrez queries or pick from an organism list to modify your search

The Many Parameters of BLASTThere are almost literally hundreds of parameters but most are way too obscure even for die-hard techies like me Very few of them are regularly useful in any but their default value but just occasionally they are very necessary

Here are some of the ones that I have used

-e max expected value -m output format (graphical or tabularspreadsheet)-F filter query sequence for low complexity (default TRUE)-U use only upper case regions of query (default FALSE)-G gap opening cost-E gap extension cost-q nucleotide mismatch penalty (BLASTx uses matrices)-r nucleotide match reward-b number of matching sequences to report-g allow gaps (default TRUE)-W word size-z effective database size (removes effect of actual database size)-S query strands to search (default both directions)-l restrict database sequences to given list of lsquogilsquo numbers

BLAST Parameters Exercises1 BLASTn vs BLASTx

Open the file example-sequenceshtml copy the sequence gtblastn-vs-blastxThis is a Xenopus tropicalis cDNA sequence

Go to the NCBI BLAST Home PageNucleotide-nucleotide BLAST (blastn) section Paste your sequence into the box

Run BLASTn against the nr nucleotide database using all default optionsThen hit [format] to wait for the results in a new page(hint if you paste the sequence definition line lsquogtnamersquo into the box as well your results will be labelled accordingly which can be useful)

Now repeat but go to the TRANSLATED BLAST section and BLAST against the nr protein database using BLASTx

How might the different results help us view the presence of this gene in other vertebrates

Results for Exercise 1 BLASTn

BLASTx

BLAST Parameters Exercises2 Low complexity filtering

Open the file example-sequenceshtml copy the sequence gtlow-complexity-filtering-A This sequence contains a long AT tandem repeat

Go to the NCBI BLAST Home PageTRANSLATED BLAST sectionBLASTx Paste your sequence into the box

Carefully UNTICK the ldquoChoose filter [ ] Low complexityrdquo BOX in the second section And then run BLASTx against the nr database

What do you feel about these alignmentsRe-run but leave the low-complexity filter ON this timeDoes this change our view of the protein matches

Now continue with gtlow-complexity-filtering-B and ndashCC is an especially interesting case ndash what can we deduce about the cDNA sequence Annotators beware

Results for Exercise 2A (OFF) BLASTn ndash low complexity filtering OFF

Results for Exercise 2A (ON) BLASTn ndash low complexity filtering ON

Results for Exercise 2B

ON OFF

Results for Exercise 2C

There is a sequence error an extra G at position 117 in the sequence cDNA (117)AGAAAAGAAGAAACATGGCAATGGATCAGAA|||||||||||||||| ||||||||||||||AGAAAAGAAGAAACAT-GCAATGGATCAGAA

Genomic sequence

ON

OFF

BLAST Parameters Exercises3 Limit by Entrez query

Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching sequences in fruit fly enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT

Open the file example-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home Page TRANSLATED BLAST sectionBLASTx and paste the sequence

Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip

BLAST Parameters Exercises4 BLASTn vs tBLASTx and nucleotide mismatch penalties

Open the file example-sequenceshtml

Also open the NCBI BLAST Home PageSPECIAL ndash Align two sequences section

There are several Xenopus tropicalis cyclins in the examples fileCopy the sequence gtcyclin-A1-Xt to the Sequence 1 BLAST windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 BLAST window(i) Run the default comparison should be BLASTn Note the alignmentNow run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these

(ii) Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1 Then try reducing the gap open and gap extension penaltieshellipWhat do we learn from this

(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip

Results for Exercise 4 (i)

BLASTn tBLASTx

Results for Exercise 4 (ii)

Mismatch penalty = -2 (default) Mismatch penalty = -1

BLAST Parameters Exercises5 E-Value maximum for reporting

Open the file example-sequenceshtml

Copy the sequence gtsumo-binding-motif and go to the NCBI BLAST Home PageGo to the PROTEIN BLAST section BLASTp and paste the sequence

Run the search with the default values

Now re-run the search setting the maximum E-value in the box

Expect 100

What difference does this make

BLAST Parameters Exercises6 Word Size

Open the file example-sequenceshtml

Copy the sequence gtmorpholino and go to the NCBI BLAST Home PageGo to the NUCLEOTIDE BLAST section BLASTn and paste the sequence

Check OFF the low complexity filter and then run the search

Now re-run the search setting the following parameters

Low complexity OFFExpect 100Word Size 7Other advanced -q-1 (mismatch penalty -1 instead of default -3)

What difference does this make

END

  • Bioinformatics Workshop 1 Sequences and Similarity Searches
  • The Basic Questions
  • Part 1 Structural Genomics
  • Chromosomes and Genes
  • Gene to Protein
  • Sequence Signals
  • Genomic Signals
  • Derivative Sequences
  • Gene Models
  • Sequences and Genes (Accession Numbers and Names)
  • Gene Symbols Names Etc
  • A Gene-Centric View
  • Sequences and Accession Numbers
  • mRNA Splicing Signals
  • Gene Predictions
  • Supporting Evidence
  • TheoreticalPredicted Sequences
  • Sequences for a model organism
  • So Whatrsquos in the Databases Now
  • Part 2 Comparative Genomics
  • Speciation
  • Residual Similarity
  • Computers Can Detect Homology
  • Orthologs
  • Paralogs
  • lsquoOtherrsquo-logs
  • The Essential Paradigm
  • Function Conserved Longer than Detectable Similarity
  • Redundancy in the Genetic Code
  • Protein Similarity Persists Longer
  • Always Compare Protein Sequences
  • Exercise 1 nucleotide vs amino acid search
  • Answers Exercise 1
  • The Essential Task
  • Functional Orthologs
  • Finding Orthologs
  • Using Synteny is Better
  • Metazome
  • Metazome Exercise
  • Part 3 Finding Sequence Similarities
  • Gaps in Alignments
  • The Downside of Gaps
  • BLAST
  • Flavours of BLAST
  • How does it work
  • BLAST WORDS and INDEXING
  • Analyse the Query Sequence
  • Expand from Word Based Matches
  • BLAST ndashTypical Output
  • When is a match significant
  • E-values
  • E-values From First Principles
  • Calculating an E-value
  • E-values In Practice
  • E-value Exercise
  • E-value Exercise Answer
  • E-values Effect of Database Size
  • Slide 58
  • Why were the values different
  • E-values Effect of Query Length
  • Why not just use identity
  • So whatrsquos the real problem
  • Rules of Thumb
  • Protein BLAST
  • Amino Acid Substitutions
  • Exercises
  • Part 4 Tweaking BLAST
  • Not All Parameters are hellip
  • The Many Parameters of BLAST
  • Slide 70
  • BLAST Parameters Exercises
  • Results for Exercise 1
  • Slide 73
  • Results for Exercise 2A (OFF)
  • Results for Exercise 2A (ON)
  • Results for Exercise 2B
  • Results for Exercise 2C
  • Slide 78
  • Slide 79
  • Results for Exercise 4 (i)
  • Results for Exercise 4 (ii)
  • Slide 82
  • Slide 83
  • END
Page 17: Bioinformatics Workshop 1 Sequences and Similarity Searches

TheoreticalPredicted Sequences

genome

predicted gene modelexons 1 2 3 4

Wersquove now reversed the process of working out exon structure from aligning cDNA sequences against the genome sequence but we shouldnrsquot lose sight of the fact that we donrsquot really know if these predicted proteins exists ndash especially where supporting EST evidence is weak or non-existent

predicted transcript

predicted protein

Sequences for a model organism

ESTs ndash millions pound10 eachCheap to sequence ndash so we get millions per organismBut lots of errorsAnd incomplete gene sequencesCan give us relative expression levels

cDNAs ndash tens of thousands pound1000 eachExpensive ndash but only need to do one (or a small number) per geneFew errors with multipass sequencingGives us protein sequences

Genomes ndash one pound30000000Extremely expensiveBut the only way to get the whole pictureGives us gene regulation

So Whatrsquos in the Databases Now

15000000ESTs

3300000cDNAs

NCBI July 2005

2700000proteins

950000proteins

nrRefSeq

DNA

Proteins

Part 2 Comparative Genomics

ATGAAGGCTGCCTACGACTGCCGTGATGCAGGCTGCCTACGACTGCCGTGATGCAGGCTGCCAACGACTGCCGTGATGCATGCTGCCAACGACTGCCGTGATGCATGCTGCCAACGACTGCCCTGATGCATGCTGCCAACGGCTGCCCTGATGCATGCTGCCAACGGATGCCCTGATGCATGCCGCCAACGGATGCCCTGATGCATGCCGCCAACGGATGTCCTG

Imagine one mutation gets fixed every 100000 years in this gene sequencehellip

Gene sequence

Evolution by sequence mutation

Speciation

ATGAAGGCTGCCTACGACTGCCGTG

ATGCAGGCTGCCTACGACTGCCGTGATGCAGGCTGCCAACGACTGCCGTGATGCATGCTGCCAACGACTGCCGTGATGCATGCTGCCAACGACTGCCCTGATGCATGCTGCCAACGGCTGCCCTGATGCATGCTGCCAACGGATGCCCTG

Gene AATGAAGGCTGCCTACGACTGCCGTG

ATGAAGGCCGCCTACGACTGCCGTGATGAAGGCCGCCAACGACTGTCGTGATGAAAGCCGCCAACGACTGTCGTGATGAAAGCCGCCAACGACAGTCGTGATGAAAGCCGCCTACGACAGTCGTGATGAAAGCCGCCTACGACAGTCCTG

ATGCATGCTGCCAACGGATGCCCTG

ATGAAAGCCGCCTACGACAGTCCTG||| | || ||| ||| | ||||

If the genetic difference means they can no longer interbreed with fertile offspring ndash then we have a new specieshellip

Residual Similarity

ATGCATGCTGCCAACGGATGCCCTG

ATGAAAGCCGCCTACGACAGTCCTG||| | || ||| ||| | ||||

ATGCATGCTGCCAACGGATGCCCTG

ATGGAAGGCGCTTAGGATAGTCCAG||| | | || | | | || |

After longer periods of evolution homology may no longer be detectable in the DNA sequencehellip

We can still easily detect residual similarity between these sequences this is what we call homology ndash detectable similarity because of common evolutionary origin

Computers Can Detect Homology

In fact computers are very good at this task ndash the two primary challenges are

(a) performing the search fast enough to look through millions of sequence in a timescale compatible with a lab scientistrsquos attention span

(b) at low levels of similarity being able to distinguish between biologically related sequences and chance matcheshellip

ATGCATGCTGCCAACGGATGCCCTG

ATGAAAGCCGCCTACGACAGTCCTG||| | || ||| ||| | ||||

GCTGACTCGTAGCGCTTAGCTAGCT

CCAACATCTAGCCAGATTAGTTAGT | || | | | |

Orthologs

A A

A Gene duplication though speciation The two copies of Gene

A will now evolve independently but will continue to have the ~same function

They are ORTHOLOGS

Paralogs

A

Gene duplication though internal genome duplication

The two copies of Gene A will now evolve independently but will probably not continue to have exactly the same function

They are PARALOGS

A

A Arsquo

A

lsquoOtherrsquo-logsWhat about gene duplication after speciation

How can we describe the relationship(s) between the various copies of gene A in the two frogs

Bear in mind that understanding gene function is more important than semanticshellip

The two copies of A in the orange frog are sometimes called IN-PARALOGS

If they were also present in the green frog (and therefore were in the ancestor species) they would be OUT-PARALOGS

A

A

A

Arsquo A

The Essential Paradigm

1 any group of modern species can be traced back to some extinct common ancestor

A

A

2 in all likelihood they share orthologous genes which have the same function in the modern animal as in the extinct ancestor

3 If we can experimentally determine the function of a gene in one of these organisms then there is a good chance the ORTHOLOGOUS gene in another organism will have the same function

A A

cyclin b1

cyclin b1

Function Conserved Longer than Detectable Similarity

start from first self-replicating sequence

same function detectable similarity

living organisms

whole genome duplication local duplication

Redundancy in the Genetic Code

GCA A alanine GCC A GCG A GCT A

TGC C cystine TGT C

GAC D aspartate GAT D

GGA G glycine GGC G GGG G GGT G

lsquoSynonymousrsquo or lsquosilentrsquo mutations in the third position of the codon triplets have no effect on the amino acid coded for ndash so there is no evolutionary pressure against thishellip

Protein Similarity Persists Longer

CTATCACGAGAACCTGTGCTATCCCGAGAACCTGTGCTATCCCGAGAACCAGTGCTATCCCGTGAACCAGTGCTATCCCGTGAGCCAGTGCTATCCCGTGAGCCAGTTCTGTCCCGTGAGCCAGTT

CTATCACGAGAACCTGTG

CTGTCCCGTGAGCCAGTT|| || || || || ||

LSREPV

LSREPV||||||

CTATCACGAGAACCTGTG

TTGTCCCGGTCGCCAGTT | || | || ||

LSREPV

LSRFPV||| ||

67 100

44 80

Always Compare Protein Sequences

ATGAATGCAGCCTATGATTGCCGAGCCAGAATGCTAAGG MNAAYDCRARMLR ||||| || || || || || || || ||||| || || | ||||||||+||ATGAAGGCCGCATACGACTGTCGTGCTAGAATCCTGAGA MKAAYDCRARILR

DNA comparison amino acid comparison

The DNA sequence can change while the amino acid sequence stays the same so always look for similarities by comparing amino acid sequences

Exercise 1nucleotide vs amino acid search

Go to the file example-sequenceshtml and locate the section for this exercise There should be two sequences lsquosurfeit1rsquo for frog and fly

Go to NCBI Blast home page then lsquoAlign two sequencesrsquo (bottom left lsquospecialrsquo panel) paste one sequence into each window and hit lsquoAlignrsquo ndash this will do a direct DNADNA comparison

Now find the open reading frames of the two genes and translate them into amino acid protein sequences then repeat the two sequences comparison

Go to NCBI ORF Finder ndash paste sequence ndash hit OrfFind ndash identify longest ORF ndash click on it ndash next screen hit Accept ndash change View to Fasta protein ndash hit View ndash copy sequence to Blast2Seqs Do the same with the other sequence

Before you hit lsquoAlignrsquo change the lsquoProgramrsquo (top left) to blastphellip

Answers Exercise 1

The Essential Taskexperiment data mining

gene sequence what is its function

database of proteins in other species

Cyclin-AFoxA1

cdc25

alpha-tubulin

Predicted protein

Gravin-like

Sprouty-2

calmodulin

KIAA10786568

frizzled

Wint8

Troponin T3

Gravin-like

we can only do this because of implied function based on orthology

Functional Orthologs

function known annotation lsquoGravinrsquo available

Human geneXenopus genefunction unknown

sequence similarityorthologs

same function But we know that function is largely determined by shape

similar shape

Which in general we cannot determine ndash but it is probably SHAPE not SEQUENCE that is conserved

We make an assumption that the same gene function is likely to be present in the two organisms and the ones that have this function are likely to be the most similar in sequence

Finding OrthologsSo how do we find orthologs and can we know when we have

The simplest is Reciprocal Best BLAST but it implicitly relies on having all the protein sequences of you own organism and the one you wish to find an ortholog in

frog proteindatabase of human proteins

best match human protein

database of frog proteins

x

Using Synteny is Better

We know that large regions of (say) vertebrate genomes have preserved their overall organisation from one organism to another

And we find the same genes (ie orthologs) in more or less the same order in the syntenic sectionsThese of course represent chromosomal re-arrangements since these organisms diverged

Human chromosome 5

Mouse chromosome 10

Mouse chromosome 2

MetazomeFortunately someone has done all the hard work for ushellip Dan Rokhsar httpwwwmetazomenet

Metazome Exercise

Go back to Entrez Gene and look for your favourite gene again

Pick probable ortholog vertebrate genes from common organisms (human mouse rat chicken frog fish) and paste their protein sequences into a temporary space

Go to Metazome (httpwwwmetazomenet) find the blast window open two versions of it and blast your sequences against the Tetrapod or Jawed vertebrate node

See if you get the same cluster ID as best top hit and have a look at the Metazome alignment(s)hellip

Part 3 Finding Sequence Similarities

We want computer programs which will compare sequences at all possible different alignments looking for a degree of similarity greater than we would expect to find by chance

But first we have to consider the implication of gapshellip

Insertions and deletions are other possible forms of mutations and they can really mess up our simple alignments

ATGCATGCTGCCAACGGATGTCCTG

ATGAAAGCCGCCTACGAAAGTCCTG||| | || ||| ||| | ||||||

ATGCATGCTGGCCAACGGATGTCCTG

ATGAAAGCCGCCTACGAAAGTCCTG||| | || ||| | | | |

Gaps in Alignments

Consider these two obviously similar sequences

TTCCCAACTCTCCTCTTTCACCATGAAGCTCAAGGACAGATTCCACTCGCCCCAAAATCAAGCTCACCCCGTCCAAGAA | || | || |||||||||||||||||||| ||||||||| ||| ||| | ||| | | |TTCCCACCTCTCCTCTTTGCACCATGAAGCTCAAGGACAAATTCCACTCCCCCAAAATCAAGCGCACCCCGTCCCAGAA

In fact we realise that the most probable alignment (regarding biological origin) is with a small gap in each sequence

TTCCCAACTCTCCTCTTT=CACCATGAAGCTCAAGGACAGATTCCACTCGCCCCAAAATCAAGCTCACCCCGTCCAAGAA |||||| ||||||||||| |||||||||||||||||||| ||||||||| |||||||||||||| |||||||||| ||||TTCCCACCTCTCCTCTTTGCACCATGAAGCTCAAGGACAAATTCCACTC=CCCCAAAATCAAGCGCACCCCGTCCCAGAA

So in general we allow ourselves to insert gaps until we find the optimal alignment

But where should this process stop

The Downside of GapsTake two random sequences with no lsquorealrsquo similarity

GACACTAGGTCGATGCGTGGTGGCGAGA

ACGCATCCGGATGTGCACCGTGGAACTG

And allow lsquocost freersquo gaps

GAC--ACT----AGGTCGATGC---GTGG---TGGCGAGA || | | | | | ||| |||| || ACGCA-TCCGGA--T-G-TGCACCGTGGAACTG

Clearly although the alignment has no mismatches it is obviously not biologically meaningful

To prevent this we assign a cost to adding gaps which is offset against the benefit of finding matches ndash and this is the essence of lsquofinding gapped alignmentsrsquo

We want to find the lsquoalignmentrsquo between the two (or more) sequences which shows the greatest degree of similarity while introducing the fewest gaps hellip

BLAST

gtqueryAGACGAACCTAGCACAAGCGCGTCTGGAAAGACCCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGTTCAGGAGTATTTGGACTGCAATATTGGCCCTCGTTCAAGGGCGCCTACCATCACCCGACGGTCATGCCGGTCCCCAGCAGCTGCTAATAACTTCCTTCGCTACTCAAGTTACCACGCTAGCAAAACCCACGGCATACCGTTTACCCTTTAAAATCAGCTTCAACCAGCAACGAA

There are many programs used to find similarities between sequencesThey range from relatively slow programs which find the exact best matching alignment through ones which take progressively inexact shortcuts to speed things up Of this latter class the best known and easily most widely used is BLAST developed by Stephen Altschul and others and continuously refined over the last 10-15 years

The essential idea is to compare your query sequence against a collection or lsquodatabasersquo of target sequences looking for the one(s) that match the query sequence the best

gttarget1AAAACAGGAATATTTACCGGGACCGGGTAATGATGCATCTCGAGGTACACAATATACCTG GAGAACCGAATTATGAGTTGGCCACCTTACTTAACGAAACCAGCAGAGAAAATCCAACAT GGCAACACCCCTCTGACTACACTAGAAGGAACTACTATGTAAGAAAACAGCCTGTCCCTT GCAGTTTGAATGACTGGGTGATGCGAAATGGGGGTCCTGCCATAGAGCGCTTCCATGGTT TACCTTGCACATTTCAGAGAAGTCCTATGCCAGGAGTCCTTCCTACAGGGCCTTCCTGAA ACTATATATGTGCTTATTCTTGTTTGATTTGGCTTTGCAGgttarget2CTCTTAATTTATTTCTCTTCCTGCAGCTCCCTCGCTTTTTCCTTTCCCTGTTACATTCAT CTGACTTGAAGAGTTGCAAATTTTCAGTGTTTCTGTTTTTGTTGCTGATATGTTGTAAAC TTTTTAATAAAATCTATTTCTATAG gttarget3GCAGTTTGAATGACTGGGTGATGCGAAATGGGGGTCCTGCCATAGAGCGCTTCCATGGTT TACCTTGCACATTTCAGAGAAGTCCTATGCCAGGAGTCCTTCCTACAGGGCCTTCCTGAA ACTATATATGTGCTTATTCTTGTTTGATTTGGCTTTGCAGCTAGGGTTTTCACCTTTTCT GGAAAAAAAAATACTGGCTTCC gttarget4CTGCTATTAATGGGCAAAACAACTCAAATAAAGTCCCTCTGCCACCCTCAGACACTGCCC CTGGCCCCCAGCTGCCCGCTGATCCTTGTAGCCAGAGCAGTAAAGTTTTGAAAGTGGAGC CCAAGGAGAATAAAGTTATTAAAGAAACTGGCTTTGAACAAGGTGAAAAGTCTTGTGCAG CACCTCTAGATCATACTGTGAAGGAAAATCTTGGACAAACTTCTAAAGAACAGGTGGTAG

query

database

COMPARE

LIST MATCHES

Flavours of BLAST

ACGATAGATCCCATCCATAAAT ATGACGATAGATCCCATCAT CGATAGGACCACCACA GATAGACCAGGATACATAGGATAATTA AGCTCGCTTGGCTCGATGGCT

MKJLSPWERSYTRGHYTWER MGHTVNBZY MKLPWRHGDBKJGMNDFD MBKLRPIUHDFRTASGSLKWWRTVBN

MKJLSPWERSYTRGHYTWER MGHTVNBZY MKLPWRHGDBKJGMNDFD MBKLRPIUHDFRTASGSLKWWRTVBN

ATGACGATAGATCCCATCAT CGATAGGACCACCACA GATAGACCAGGATACATAGGATAATTA AGCTCGCTTGGCTCGATGGCT

query sequence other operation database sequences

ATGACGATAGATCCCATCAT CGATAGGACCACCACA GATAGACCAGGATACATAGGATAATTA AGCTCGCTTGGCTCGATGGCT

BLASTn

BLASTp

BLASTx

tBLASTn

tBLASTx

ACGATAGATCCCATCCATAAAT

ACGATAGATCCCATCCATAAAT

MQWCGYRWTYQGYRW

MQWCGYRWTYQGYRW

FAST

FAST

SLOW

SLOWER

HORRIBLY

SLOW

6 fra

me

trans

latio

n

How does it work The main task of any sequence comparison program is to test all possible mutual alignments of two sequence and see how good the match is

CTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGT

CCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGTTC | | | | | ||||||||||||||||||||||||| CTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGTCTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGT || | | | | | | | | |||||||||||||||||||||||| | | | | | |

CCGAGCTTCTCATTGCTCTTCCTAACAGTG=TGATAGGCTAACCGTAATGGCGTTC||||||||||||||||||||||||| ||||||||||||||||||||||||

query

1st database sequence

This would actually be a very slow search process if implemented like thishellip

BLAST achieves its speed through two strategies

- it takes a WORD based approach- it pre-INDEXES database sequences

BLAST WORDS and INDEXING1 GACAAATCCAAACCCCTGAAGTTCTCCACCAGCAAAGCCA

2 TAAGCAAATTTAATTTTGTTTACATTTTC

3 GTTAAGACCTTCCCTGACATTTGCAGCAGTTTCAAATGTA

AAAAAAAA 00001 AAAAAAAC 00002 AAAAAAAG 00003 ACAAATCC 07967 ACAAATCC 07968 ACAAATCC 07979 GACAAATC 33568 GACAAATG 33569 TCCAAACC 64321 TCCAAACC 64322

sequence position word

1 1 33658

1 2 07967

1 3 16210

3 15 33568

3 16 07967

Database of sequences

Numbered list of all possible lsquowordsrsquo

Build a position index of all words in the database

Analyse the Query Sequence gtquery AGACAAATCCAAACCCCTGAAGTTCTCCACCAGCAAAGCCA

AAAAAAAA 00001 AAAAAAAC 00002 AAAAAAAG 00003 ACAAATCC 07967 ACAAATCC 07968 ACAAATCC 07979 GACAAATC 33568 GACAAATG 33569 TCCAAACC 64321 TCCAAACC 64322

QUERY SEQUENCE

Numbered list of all possible lsquowordsrsquo

position word

1 14236

2 33658

3 07967

Analyse QUERY SEQUENCE

sequence position word

1 1 33658

1 2 07967

1 3 16210

3 15 33568

3 16 07967

Index of database

Expand from Word Based Matches We lsquoinstantlyrsquo know which sequences in the database have at least a word length match with our query sequence and at what relative position

Next the potential alignments are expanded adding up a score for (total matches ndash mismatches ndash gap penalties) to make the best possible alignment But this is usually for a tiny proportion of the sequences in the database ndash so overall it is much quicker

The highest scoring alignments are reported

But we can potentially miss alignments with no word-size bits in common consider BLASTn with a default word-size of 11

TCGGAAGTGGAAGCTGAACCTGATTGTAGAGTTGGAGGCCAGTGTTCTGGCTGAGC||||||||| ||||| |||||||||| |||||||||| |||| ||||| ||||||| TCGGAAGTGTAAGCTCAACCTGATTGCAGAGTTGGAGTCCAGAGTTCTAGCTGAGC

Care is sometimes neededhellip

BLAST ndashTypical OutputINPUT

gtpartial cDNA sequence Xenopus tropicalisCGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGTTCCCACCTCTCCTCTTTCACCATGAAGCTCAAGGACAAATTCCACTCCCCCAAAATCAAGCGCACCCCGTCCAAGAAGGGGAAGCCGGCCGACCTCACCGTCAAAACAGAAGAGAAACCCGTCAACAAAACCTTAAGCCGCTTGGAGGAACAGGAGAAAGAAGTCGTTAATGCCTTGCGTTACTTTAAGACAATTGTTGACAAGATGGCGGTGGACAAGATGGTGCTGGTGATGCTGCCAGGGTCGGCGA

OUTPUTQuery= (311 letters) Database NCBI Protein Reference Sequences 954378 sequences 347895532 total letters

gtgi|41055060|ref|NP_9574201| similar to guanine nucleotide-releasing factor 2 (specific for crk proto-oncogene) [Danio rerio]

Length=691

Score = 133 bits (335)Expect = 6e-31 Identities = 7698 (77) Positives = 8298 (83) Gaps = 498 (4) Frame = +2

Query 26 MSGKIE-KADSQRSHLSSFTMKLKDKFHSPKIKRTPSKKGKPA--DLTVKTEEKPVNKTL 196 MSGKIE K +SQ+SHLSSFTMKL KFHSPKIKRTPSKKGK + VKT EKPVNK + Sbjct 1 MSGKIESKHESQKSHLSSFTMKLM-KFHSPKIKRTPSKKGKQLQPEPAVKTPEKPVNKKV 59

Query 197 SRLEEQEKEVVNALRYFKTIVDKMAVDKMVLVMLPGSA 310 SRLEEQEK+VV+ALRYFKTIVDKM VD VL MLPGSA Sbjct 60 SRLEEQEKDVVSALRYFKTIVDKMNVDTKVLQMLPGSA 97

When is a match significant

RFKISDCQHPCTYSHNQYMTNHMRECPYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV

NFSWKKTSEKETNCQFDYPNDYNEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFNMCWLEVNSS

RF---KISDCQHPCTYSH-NQYMTNHMREC----PYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV F K S+ + C + + N Y N +C P+ + +W +P + D I N M ++ NFSWKKTSEKETNCQFDYPNDY--NEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFN------MCWLEVNSS

Here is a lsquotypicalrsquo weak alignment from BLASTp

In fact the sequences were randomly generated so there is no biologically significant alignmenthellip

E-values

The number of matches like the discovered match that I would expect to find by chance

An E-value of 00 implies that I would expect no matches like this to arise by chance thereforehellip

An E-value of 1 implies I would expect 1 match like this to arise by chance so if I have a match with such an E-valuehellip

Also ldquoexpect valueldquo or ldquoexpectationrdquo

E-values From First Principles

Some database statistics (23rd July 2005)

Database NCBI RefSeq mRNA 272619 sequences 503566580 total letters (~50 x 108)

Database NCBI nr 3329110 sequences 14601814750 total letters (~14 x 1010)

Notation

12e-35 = 12 x 10-35

48 x 106 = 4800000

We will consider first searching a nucleotide sequence (lsquoACGTAGACGTrsquo) against a nucleotide database eg the RefSeq mRNA above

Then we will consider the more complex case of amino acid sequence (protein) searches Which is of course what we mostly do

Calculating an E-valueThe RefSeq mRNA database has ~ 50 x 108 letters There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA

Query = lsquoArsquo

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACrsquo

AC AC AC AC

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACGrsquo

ACG

Expected number of matches = (50 x 108) 4 = ~12 x 108

Expected number of matches = (50 x 108) (4x 4) = ~31 x 107

Expected number of matches = (50 x 108) (4 x 4 x 4) = ~81 x 106

Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer

Expected number of matches = (50 x 108) (4 x 4 x 4 x 4 hellip 60 times ) = (50 x 108) 1036 = 50 x 10-28

E-value = 50 x 10-28

E-values In PracticeSo if I take a 60 nt sequencegtsequenceACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA and actually BLAST it against the RefSeq mRNA database I get

BLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds

Length=6060 Score = 119 bits (60) Expect = 2e-26 Identities = 6060 (100) Gaps = 060 (0) Strand=PlusPlus

Query 1 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 60 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 2977 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 3036

What do I get if I BLAST it against the larger nr databaseBLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds

Length=6060 Score = 119 bits (60) Expect = 6e-25 Identities = 6060 (100) Gaps = 060 (0) Strand=PlusPlus

Query 1 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 60 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 2977 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 3036

theoretical value was 50e-28 -

E-value Exercise

Given a transcription factor binding site

ACC[TG]TA

How many would you expect to find by chance in a 10k promoter sequence

How would this differ if there was an optional additional base between the 4th and 5th positionsIeACC[TG]TAOR ACC[TG]TA

E-value Exercise AnswerACC[TG]TA

Expect lsquoArsquo every 4 ntExpect lsquoACCrsquo every 4x4x4 = 64 nt

Expect lsquoT or Grsquo every 2nd ntExpect lsquoACC[TG]rsquo every 64x2 nt = 128 nt

Expect lsquoTArsquo every 4x4 = 16 ntExpect lsquoACC[TG]TArsquo every 128x16 nt = 2048 nt (4x4x4x2x4x4)We would expect ~5 of these promoter sites every 10k by chance

If also ACC[TG]TAA allowed

The two motifs independently have the same E-valueTo allow either means we expect twice as many

E-values Effect of Database SizeThe nr mRNA database has ~ 14 x 1010 letters (was RefSeq and 50 x108)There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA

Query = lsquoArsquo

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACrsquo

AC AC AC AC

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACGrsquo

ACG

Expected number of matches = (14 x 1010) 4 = ~12 x 108

Expected number of matches = (14 x 1010) (4x 4) = ~31 x 107

Expected number of matches = (14 x 1010) (4 x 4 x 4) = ~81 x 106

Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer

Expected number of matches = (14 x 1010) (4 x 4 x 4 x 4 hellip 60 times ) = (14 x 1010) 1036 = 14 x 10-26

E-value = 14 x 10-26

(was E-value = 50 x 10-28)

E-values Effect of Database Size

The E-value is simply dependent on database size

RefSeq

nr

14 x 1010 letters

50 x108 letters

30 x bigger

BLAST the same sequenceagainst each E-value = 14e-26

E-value = 50e-28

The database was ~30 times bigger and so the E-value was ~30 times bigger

Why were the values differentOur calculated E-value for searching against the RefSeq mRNA database was 50 x 10-28But our actual BLAST search at NCBI gave a value of

20 x 10-26 - about 40x larger - why is this

Gapped alignments

If we were expecting N matches for a query sequence lsquoACGTACGTACGTrsquo imagine what would happen to N if we allowed gaps in our matches

ACGTACGTACGT

This would now give us additional possible alignments that would meet our lsquomatchrsquo criteria

ACGTACGTACGT ACGTACAGTACGT ACGTACCGTACGT etc|||||||||||| |||||| |||||| |||||| ||||||ACGTACGTACGT ACGTAC-GTACGT ACGTAC-GTACGT

We will expect many more matches in a given database if we allow our alignments to have gaps The E-value will be larger

E-values Effect of Query Length

Biologically itrsquos the same match Does it mean we are any less sure that this match didnrsquot occur by chance The E-value is simply dependent on match length

database

BLAST 500 nt sequence against a database

BLASTn Get a full length match with sequence XYZ at an E-value = 50e-160

gtsequence ACTAGTCTAGCTAGACATCGATCGATGATGCTACACAGATAGACGATAGATAGTAAGTCGATCGATCGCGCATCGATCGTCTAGATCGATCGCTCGCTGTGTAGATAGATCGGCGATAGA

database

BLAST half of the same sequence against the same database

BLASTn

gtsequence ACTAGTCTAGCTAGACATCGATCGATGATGCTACACAGATAGACGATAGATAGTAAGTCG

Get a match with sequence XYZ again but at an E-value = 50e-80

Why not just use identityAt some levels this a good question

But consider two very different searches both of which give a 75 identity match

Query1 was 60 nt longCGGAGCTCAGGGCTTAACGACTGATATCTCCGCGCATGTCGAGAAACGATACAGCCAGCG||||||||||| || | || | || || |||| | | | |||||| | ||||||||||CGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGWhich would have an E-value ~ 50 x 10-19

And Query2 only 16 nt longACGTACGTACGTACGT||| || | |||| ||ACGCACCTTCGTAGGTWhich would have an E-value ~ 30

And intuitively we feel we would expect to see that sort of number of matches in the database just by chancehellip

So whatrsquos the real problemBasically you are usually trying to answer the question

Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism

Are there any useful guidelines though at least for biological meaningfulness

Basically you are usually trying to answer the question

Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism

BLAST

The difficulty is because

ORTHOLOGY

BLAST Similarity + Probability

biological knowledge

nature of query sequence

phylogenetic relationship

match length PI size of databasehellip

Rules of ThumbHow good does an E-value have to be before we might even think we have an ortholog

largerworse smallerbetter

E-values 10-5 10-10 10-40 10-100 00

fantasy borderline encouraging

pretty good canrsquot get

better

But note that in some gene families with closely related members you can get an E-value of 00 for several different matches and then identity may be more sensitive Also bear in mind in cases like this that ideas of lsquofunctionalrsquo orthology may break down with more than one locus producing identical proteins which share the same functionhellip

Protein BLASTItrsquos (nearly) always better to make comparisons at the amino acid level between protein sequences than the DNA level because the amino acid sequence is more conserved than the underlying DNA sequenceDoes this cause us to treat expected values any differently

If we follow the argument as before then for an exact match of a 20 amino acid sequence in the RefSeq protein database each additional amino acid will reduce the E-value by 120th (there are 20 different amino acids) And as there are 347895532 letters in that databaseE-value = ~35 x 108 (20 x 20 x 20 hellip20 times) = ~35 x 10-18

But this is what we get if we run the blast at NCBI

Score = 431 bits (100) Expect = 8e-04 Identities = 2020 (100) Positives = 2020 (100) Gaps = 020 (0) Frame = +3

Query 3 SSSSFRAYRAALSEVEPPCI 62 SSSSFRAYRAALSEVEPPCISbjct 972 SSSSFRAYRAALSEVEPPCI 991

Really too big a discrepancy to easily explain with hand wavinghellip

Amino Acid Substitutions

A SC F LWYG I LMVL IMFVM ILVP V ILMW FY

N DHSQ REHKS ANTT SY HFW

H NQYK RQER QK

D NEE DQK

In fact we need to take into account both amino acid substitutability as well as as before allowing gapped alignments On average any residue can be substituted for by about 2 others so each position has about 17th chance of lsquomatchingrsquo rather than 120th

So now we getE-value = ~35 x 108 (7 x 7 x 7 hellip20 times) = ~44 x 10-9which is much closer to the actual BLAST value

These substitutabilities are dealt with by the BLOSUM and PAM matrices

ExercisesGo to the file random-DNA-sequenceshtml select one of the 20 randomly generated nucleotide sequences and do a BLASTx (translated DNA-gtprotein) at NCBI against the nr protein database

Did you find any lsquosignificantrsquo hits

Repeat with a second sequence

What conclusions might you draw from this exercise

Try the same sequence(s) against the nr nucleotide database

Is there any general difference

Part 4 Tweaking BLASTAlthough you normally see BLAST as a web page with boxes to place data in and tick boxes etc it is actually a command line program that can be run just by typing the appropriate command and options eg

promptgt blastall ndashp ltblast typegt ndashi ltinput sequencegt ndashd ltdatabasegt

This is the simplest form where the basic program lsquoblastallrsquo takes a number of different options or parameters indicated by the ndashx and followed by its value -p ltwhich blast flavour to rungt-i ltfile with query sequence ingt-d ltpre-indexed database namegt

Not All Parameters are hellipThere are many other parameters and if not listed explicitly they will use a default value most appropriate to the blast flavour requested Eg for ndashW ltword sizegt blastn uses ndashW 11 where blastx uses ndashW 3

There are also some options that appear on the web pages that are not really parameters but manage the job in a similar way One of the most useful of these is on the NCBI blast pages where you can use Entrez queries or pick from an organism list to modify your search

The Many Parameters of BLASTThere are almost literally hundreds of parameters but most are way too obscure even for die-hard techies like me Very few of them are regularly useful in any but their default value but just occasionally they are very necessary

Here are some of the ones that I have used

-e max expected value -m output format (graphical or tabularspreadsheet)-F filter query sequence for low complexity (default TRUE)-U use only upper case regions of query (default FALSE)-G gap opening cost-E gap extension cost-q nucleotide mismatch penalty (BLASTx uses matrices)-r nucleotide match reward-b number of matching sequences to report-g allow gaps (default TRUE)-W word size-z effective database size (removes effect of actual database size)-S query strands to search (default both directions)-l restrict database sequences to given list of lsquogilsquo numbers

BLAST Parameters Exercises1 BLASTn vs BLASTx

Open the file example-sequenceshtml copy the sequence gtblastn-vs-blastxThis is a Xenopus tropicalis cDNA sequence

Go to the NCBI BLAST Home PageNucleotide-nucleotide BLAST (blastn) section Paste your sequence into the box

Run BLASTn against the nr nucleotide database using all default optionsThen hit [format] to wait for the results in a new page(hint if you paste the sequence definition line lsquogtnamersquo into the box as well your results will be labelled accordingly which can be useful)

Now repeat but go to the TRANSLATED BLAST section and BLAST against the nr protein database using BLASTx

How might the different results help us view the presence of this gene in other vertebrates

Results for Exercise 1 BLASTn

BLASTx

BLAST Parameters Exercises2 Low complexity filtering

Open the file example-sequenceshtml copy the sequence gtlow-complexity-filtering-A This sequence contains a long AT tandem repeat

Go to the NCBI BLAST Home PageTRANSLATED BLAST sectionBLASTx Paste your sequence into the box

Carefully UNTICK the ldquoChoose filter [ ] Low complexityrdquo BOX in the second section And then run BLASTx against the nr database

What do you feel about these alignmentsRe-run but leave the low-complexity filter ON this timeDoes this change our view of the protein matches

Now continue with gtlow-complexity-filtering-B and ndashCC is an especially interesting case ndash what can we deduce about the cDNA sequence Annotators beware

Results for Exercise 2A (OFF) BLASTn ndash low complexity filtering OFF

Results for Exercise 2A (ON) BLASTn ndash low complexity filtering ON

Results for Exercise 2B

ON OFF

Results for Exercise 2C

There is a sequence error an extra G at position 117 in the sequence cDNA (117)AGAAAAGAAGAAACATGGCAATGGATCAGAA|||||||||||||||| ||||||||||||||AGAAAAGAAGAAACAT-GCAATGGATCAGAA

Genomic sequence

ON

OFF

BLAST Parameters Exercises3 Limit by Entrez query

Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching sequences in fruit fly enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT

Open the file example-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home Page TRANSLATED BLAST sectionBLASTx and paste the sequence

Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip

BLAST Parameters Exercises4 BLASTn vs tBLASTx and nucleotide mismatch penalties

Open the file example-sequenceshtml

Also open the NCBI BLAST Home PageSPECIAL ndash Align two sequences section

There are several Xenopus tropicalis cyclins in the examples fileCopy the sequence gtcyclin-A1-Xt to the Sequence 1 BLAST windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 BLAST window(i) Run the default comparison should be BLASTn Note the alignmentNow run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these

(ii) Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1 Then try reducing the gap open and gap extension penaltieshellipWhat do we learn from this

(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip

Results for Exercise 4 (i)

BLASTn tBLASTx

Results for Exercise 4 (ii)

Mismatch penalty = -2 (default) Mismatch penalty = -1

BLAST Parameters Exercises5 E-Value maximum for reporting

Open the file example-sequenceshtml

Copy the sequence gtsumo-binding-motif and go to the NCBI BLAST Home PageGo to the PROTEIN BLAST section BLASTp and paste the sequence

Run the search with the default values

Now re-run the search setting the maximum E-value in the box

Expect 100

What difference does this make

BLAST Parameters Exercises6 Word Size

Open the file example-sequenceshtml

Copy the sequence gtmorpholino and go to the NCBI BLAST Home PageGo to the NUCLEOTIDE BLAST section BLASTn and paste the sequence

Check OFF the low complexity filter and then run the search

Now re-run the search setting the following parameters

Low complexity OFFExpect 100Word Size 7Other advanced -q-1 (mismatch penalty -1 instead of default -3)

What difference does this make

END

  • Bioinformatics Workshop 1 Sequences and Similarity Searches
  • The Basic Questions
  • Part 1 Structural Genomics
  • Chromosomes and Genes
  • Gene to Protein
  • Sequence Signals
  • Genomic Signals
  • Derivative Sequences
  • Gene Models
  • Sequences and Genes (Accession Numbers and Names)
  • Gene Symbols Names Etc
  • A Gene-Centric View
  • Sequences and Accession Numbers
  • mRNA Splicing Signals
  • Gene Predictions
  • Supporting Evidence
  • TheoreticalPredicted Sequences
  • Sequences for a model organism
  • So Whatrsquos in the Databases Now
  • Part 2 Comparative Genomics
  • Speciation
  • Residual Similarity
  • Computers Can Detect Homology
  • Orthologs
  • Paralogs
  • lsquoOtherrsquo-logs
  • The Essential Paradigm
  • Function Conserved Longer than Detectable Similarity
  • Redundancy in the Genetic Code
  • Protein Similarity Persists Longer
  • Always Compare Protein Sequences
  • Exercise 1 nucleotide vs amino acid search
  • Answers Exercise 1
  • The Essential Task
  • Functional Orthologs
  • Finding Orthologs
  • Using Synteny is Better
  • Metazome
  • Metazome Exercise
  • Part 3 Finding Sequence Similarities
  • Gaps in Alignments
  • The Downside of Gaps
  • BLAST
  • Flavours of BLAST
  • How does it work
  • BLAST WORDS and INDEXING
  • Analyse the Query Sequence
  • Expand from Word Based Matches
  • BLAST ndashTypical Output
  • When is a match significant
  • E-values
  • E-values From First Principles
  • Calculating an E-value
  • E-values In Practice
  • E-value Exercise
  • E-value Exercise Answer
  • E-values Effect of Database Size
  • Slide 58
  • Why were the values different
  • E-values Effect of Query Length
  • Why not just use identity
  • So whatrsquos the real problem
  • Rules of Thumb
  • Protein BLAST
  • Amino Acid Substitutions
  • Exercises
  • Part 4 Tweaking BLAST
  • Not All Parameters are hellip
  • The Many Parameters of BLAST
  • Slide 70
  • BLAST Parameters Exercises
  • Results for Exercise 1
  • Slide 73
  • Results for Exercise 2A (OFF)
  • Results for Exercise 2A (ON)
  • Results for Exercise 2B
  • Results for Exercise 2C
  • Slide 78
  • Slide 79
  • Results for Exercise 4 (i)
  • Results for Exercise 4 (ii)
  • Slide 82
  • Slide 83
  • END
Page 18: Bioinformatics Workshop 1 Sequences and Similarity Searches

Sequences for a model organism

ESTs ndash millions pound10 eachCheap to sequence ndash so we get millions per organismBut lots of errorsAnd incomplete gene sequencesCan give us relative expression levels

cDNAs ndash tens of thousands pound1000 eachExpensive ndash but only need to do one (or a small number) per geneFew errors with multipass sequencingGives us protein sequences

Genomes ndash one pound30000000Extremely expensiveBut the only way to get the whole pictureGives us gene regulation

So Whatrsquos in the Databases Now

15000000ESTs

3300000cDNAs

NCBI July 2005

2700000proteins

950000proteins

nrRefSeq

DNA

Proteins

Part 2 Comparative Genomics

ATGAAGGCTGCCTACGACTGCCGTGATGCAGGCTGCCTACGACTGCCGTGATGCAGGCTGCCAACGACTGCCGTGATGCATGCTGCCAACGACTGCCGTGATGCATGCTGCCAACGACTGCCCTGATGCATGCTGCCAACGGCTGCCCTGATGCATGCTGCCAACGGATGCCCTGATGCATGCCGCCAACGGATGCCCTGATGCATGCCGCCAACGGATGTCCTG

Imagine one mutation gets fixed every 100000 years in this gene sequencehellip

Gene sequence

Evolution by sequence mutation

Speciation

ATGAAGGCTGCCTACGACTGCCGTG

ATGCAGGCTGCCTACGACTGCCGTGATGCAGGCTGCCAACGACTGCCGTGATGCATGCTGCCAACGACTGCCGTGATGCATGCTGCCAACGACTGCCCTGATGCATGCTGCCAACGGCTGCCCTGATGCATGCTGCCAACGGATGCCCTG

Gene AATGAAGGCTGCCTACGACTGCCGTG

ATGAAGGCCGCCTACGACTGCCGTGATGAAGGCCGCCAACGACTGTCGTGATGAAAGCCGCCAACGACTGTCGTGATGAAAGCCGCCAACGACAGTCGTGATGAAAGCCGCCTACGACAGTCGTGATGAAAGCCGCCTACGACAGTCCTG

ATGCATGCTGCCAACGGATGCCCTG

ATGAAAGCCGCCTACGACAGTCCTG||| | || ||| ||| | ||||

If the genetic difference means they can no longer interbreed with fertile offspring ndash then we have a new specieshellip

Residual Similarity

ATGCATGCTGCCAACGGATGCCCTG

ATGAAAGCCGCCTACGACAGTCCTG||| | || ||| ||| | ||||

ATGCATGCTGCCAACGGATGCCCTG

ATGGAAGGCGCTTAGGATAGTCCAG||| | | || | | | || |

After longer periods of evolution homology may no longer be detectable in the DNA sequencehellip

We can still easily detect residual similarity between these sequences this is what we call homology ndash detectable similarity because of common evolutionary origin

Computers Can Detect Homology

In fact computers are very good at this task ndash the two primary challenges are

(a) performing the search fast enough to look through millions of sequence in a timescale compatible with a lab scientistrsquos attention span

(b) at low levels of similarity being able to distinguish between biologically related sequences and chance matcheshellip

ATGCATGCTGCCAACGGATGCCCTG

ATGAAAGCCGCCTACGACAGTCCTG||| | || ||| ||| | ||||

GCTGACTCGTAGCGCTTAGCTAGCT

CCAACATCTAGCCAGATTAGTTAGT | || | | | |

Orthologs

A A

A Gene duplication though speciation The two copies of Gene

A will now evolve independently but will continue to have the ~same function

They are ORTHOLOGS

Paralogs

A

Gene duplication though internal genome duplication

The two copies of Gene A will now evolve independently but will probably not continue to have exactly the same function

They are PARALOGS

A

A Arsquo

A

lsquoOtherrsquo-logsWhat about gene duplication after speciation

How can we describe the relationship(s) between the various copies of gene A in the two frogs

Bear in mind that understanding gene function is more important than semanticshellip

The two copies of A in the orange frog are sometimes called IN-PARALOGS

If they were also present in the green frog (and therefore were in the ancestor species) they would be OUT-PARALOGS

A

A

A

Arsquo A

The Essential Paradigm

1 any group of modern species can be traced back to some extinct common ancestor

A

A

2 in all likelihood they share orthologous genes which have the same function in the modern animal as in the extinct ancestor

3 If we can experimentally determine the function of a gene in one of these organisms then there is a good chance the ORTHOLOGOUS gene in another organism will have the same function

A A

cyclin b1

cyclin b1

Function Conserved Longer than Detectable Similarity

start from first self-replicating sequence

same function detectable similarity

living organisms

whole genome duplication local duplication

Redundancy in the Genetic Code

GCA A alanine GCC A GCG A GCT A

TGC C cystine TGT C

GAC D aspartate GAT D

GGA G glycine GGC G GGG G GGT G

lsquoSynonymousrsquo or lsquosilentrsquo mutations in the third position of the codon triplets have no effect on the amino acid coded for ndash so there is no evolutionary pressure against thishellip

Protein Similarity Persists Longer

CTATCACGAGAACCTGTGCTATCCCGAGAACCTGTGCTATCCCGAGAACCAGTGCTATCCCGTGAACCAGTGCTATCCCGTGAGCCAGTGCTATCCCGTGAGCCAGTTCTGTCCCGTGAGCCAGTT

CTATCACGAGAACCTGTG

CTGTCCCGTGAGCCAGTT|| || || || || ||

LSREPV

LSREPV||||||

CTATCACGAGAACCTGTG

TTGTCCCGGTCGCCAGTT | || | || ||

LSREPV

LSRFPV||| ||

67 100

44 80

Always Compare Protein Sequences

ATGAATGCAGCCTATGATTGCCGAGCCAGAATGCTAAGG MNAAYDCRARMLR ||||| || || || || || || || ||||| || || | ||||||||+||ATGAAGGCCGCATACGACTGTCGTGCTAGAATCCTGAGA MKAAYDCRARILR

DNA comparison amino acid comparison

The DNA sequence can change while the amino acid sequence stays the same so always look for similarities by comparing amino acid sequences

Exercise 1nucleotide vs amino acid search

Go to the file example-sequenceshtml and locate the section for this exercise There should be two sequences lsquosurfeit1rsquo for frog and fly

Go to NCBI Blast home page then lsquoAlign two sequencesrsquo (bottom left lsquospecialrsquo panel) paste one sequence into each window and hit lsquoAlignrsquo ndash this will do a direct DNADNA comparison

Now find the open reading frames of the two genes and translate them into amino acid protein sequences then repeat the two sequences comparison

Go to NCBI ORF Finder ndash paste sequence ndash hit OrfFind ndash identify longest ORF ndash click on it ndash next screen hit Accept ndash change View to Fasta protein ndash hit View ndash copy sequence to Blast2Seqs Do the same with the other sequence

Before you hit lsquoAlignrsquo change the lsquoProgramrsquo (top left) to blastphellip

Answers Exercise 1

The Essential Taskexperiment data mining

gene sequence what is its function

database of proteins in other species

Cyclin-AFoxA1

cdc25

alpha-tubulin

Predicted protein

Gravin-like

Sprouty-2

calmodulin

KIAA10786568

frizzled

Wint8

Troponin T3

Gravin-like

we can only do this because of implied function based on orthology

Functional Orthologs

function known annotation lsquoGravinrsquo available

Human geneXenopus genefunction unknown

sequence similarityorthologs

same function But we know that function is largely determined by shape

similar shape

Which in general we cannot determine ndash but it is probably SHAPE not SEQUENCE that is conserved

We make an assumption that the same gene function is likely to be present in the two organisms and the ones that have this function are likely to be the most similar in sequence

Finding OrthologsSo how do we find orthologs and can we know when we have

The simplest is Reciprocal Best BLAST but it implicitly relies on having all the protein sequences of you own organism and the one you wish to find an ortholog in

frog proteindatabase of human proteins

best match human protein

database of frog proteins

x

Using Synteny is Better

We know that large regions of (say) vertebrate genomes have preserved their overall organisation from one organism to another

And we find the same genes (ie orthologs) in more or less the same order in the syntenic sectionsThese of course represent chromosomal re-arrangements since these organisms diverged

Human chromosome 5

Mouse chromosome 10

Mouse chromosome 2

MetazomeFortunately someone has done all the hard work for ushellip Dan Rokhsar httpwwwmetazomenet

Metazome Exercise

Go back to Entrez Gene and look for your favourite gene again

Pick probable ortholog vertebrate genes from common organisms (human mouse rat chicken frog fish) and paste their protein sequences into a temporary space

Go to Metazome (httpwwwmetazomenet) find the blast window open two versions of it and blast your sequences against the Tetrapod or Jawed vertebrate node

See if you get the same cluster ID as best top hit and have a look at the Metazome alignment(s)hellip

Part 3 Finding Sequence Similarities

We want computer programs which will compare sequences at all possible different alignments looking for a degree of similarity greater than we would expect to find by chance

But first we have to consider the implication of gapshellip

Insertions and deletions are other possible forms of mutations and they can really mess up our simple alignments

ATGCATGCTGCCAACGGATGTCCTG

ATGAAAGCCGCCTACGAAAGTCCTG||| | || ||| ||| | ||||||

ATGCATGCTGGCCAACGGATGTCCTG

ATGAAAGCCGCCTACGAAAGTCCTG||| | || ||| | | | |

Gaps in Alignments

Consider these two obviously similar sequences

TTCCCAACTCTCCTCTTTCACCATGAAGCTCAAGGACAGATTCCACTCGCCCCAAAATCAAGCTCACCCCGTCCAAGAA | || | || |||||||||||||||||||| ||||||||| ||| ||| | ||| | | |TTCCCACCTCTCCTCTTTGCACCATGAAGCTCAAGGACAAATTCCACTCCCCCAAAATCAAGCGCACCCCGTCCCAGAA

In fact we realise that the most probable alignment (regarding biological origin) is with a small gap in each sequence

TTCCCAACTCTCCTCTTT=CACCATGAAGCTCAAGGACAGATTCCACTCGCCCCAAAATCAAGCTCACCCCGTCCAAGAA |||||| ||||||||||| |||||||||||||||||||| ||||||||| |||||||||||||| |||||||||| ||||TTCCCACCTCTCCTCTTTGCACCATGAAGCTCAAGGACAAATTCCACTC=CCCCAAAATCAAGCGCACCCCGTCCCAGAA

So in general we allow ourselves to insert gaps until we find the optimal alignment

But where should this process stop

The Downside of GapsTake two random sequences with no lsquorealrsquo similarity

GACACTAGGTCGATGCGTGGTGGCGAGA

ACGCATCCGGATGTGCACCGTGGAACTG

And allow lsquocost freersquo gaps

GAC--ACT----AGGTCGATGC---GTGG---TGGCGAGA || | | | | | ||| |||| || ACGCA-TCCGGA--T-G-TGCACCGTGGAACTG

Clearly although the alignment has no mismatches it is obviously not biologically meaningful

To prevent this we assign a cost to adding gaps which is offset against the benefit of finding matches ndash and this is the essence of lsquofinding gapped alignmentsrsquo

We want to find the lsquoalignmentrsquo between the two (or more) sequences which shows the greatest degree of similarity while introducing the fewest gaps hellip

BLAST

gtqueryAGACGAACCTAGCACAAGCGCGTCTGGAAAGACCCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGTTCAGGAGTATTTGGACTGCAATATTGGCCCTCGTTCAAGGGCGCCTACCATCACCCGACGGTCATGCCGGTCCCCAGCAGCTGCTAATAACTTCCTTCGCTACTCAAGTTACCACGCTAGCAAAACCCACGGCATACCGTTTACCCTTTAAAATCAGCTTCAACCAGCAACGAA

There are many programs used to find similarities between sequencesThey range from relatively slow programs which find the exact best matching alignment through ones which take progressively inexact shortcuts to speed things up Of this latter class the best known and easily most widely used is BLAST developed by Stephen Altschul and others and continuously refined over the last 10-15 years

The essential idea is to compare your query sequence against a collection or lsquodatabasersquo of target sequences looking for the one(s) that match the query sequence the best

gttarget1AAAACAGGAATATTTACCGGGACCGGGTAATGATGCATCTCGAGGTACACAATATACCTG GAGAACCGAATTATGAGTTGGCCACCTTACTTAACGAAACCAGCAGAGAAAATCCAACAT GGCAACACCCCTCTGACTACACTAGAAGGAACTACTATGTAAGAAAACAGCCTGTCCCTT GCAGTTTGAATGACTGGGTGATGCGAAATGGGGGTCCTGCCATAGAGCGCTTCCATGGTT TACCTTGCACATTTCAGAGAAGTCCTATGCCAGGAGTCCTTCCTACAGGGCCTTCCTGAA ACTATATATGTGCTTATTCTTGTTTGATTTGGCTTTGCAGgttarget2CTCTTAATTTATTTCTCTTCCTGCAGCTCCCTCGCTTTTTCCTTTCCCTGTTACATTCAT CTGACTTGAAGAGTTGCAAATTTTCAGTGTTTCTGTTTTTGTTGCTGATATGTTGTAAAC TTTTTAATAAAATCTATTTCTATAG gttarget3GCAGTTTGAATGACTGGGTGATGCGAAATGGGGGTCCTGCCATAGAGCGCTTCCATGGTT TACCTTGCACATTTCAGAGAAGTCCTATGCCAGGAGTCCTTCCTACAGGGCCTTCCTGAA ACTATATATGTGCTTATTCTTGTTTGATTTGGCTTTGCAGCTAGGGTTTTCACCTTTTCT GGAAAAAAAAATACTGGCTTCC gttarget4CTGCTATTAATGGGCAAAACAACTCAAATAAAGTCCCTCTGCCACCCTCAGACACTGCCC CTGGCCCCCAGCTGCCCGCTGATCCTTGTAGCCAGAGCAGTAAAGTTTTGAAAGTGGAGC CCAAGGAGAATAAAGTTATTAAAGAAACTGGCTTTGAACAAGGTGAAAAGTCTTGTGCAG CACCTCTAGATCATACTGTGAAGGAAAATCTTGGACAAACTTCTAAAGAACAGGTGGTAG

query

database

COMPARE

LIST MATCHES

Flavours of BLAST

ACGATAGATCCCATCCATAAAT ATGACGATAGATCCCATCAT CGATAGGACCACCACA GATAGACCAGGATACATAGGATAATTA AGCTCGCTTGGCTCGATGGCT

MKJLSPWERSYTRGHYTWER MGHTVNBZY MKLPWRHGDBKJGMNDFD MBKLRPIUHDFRTASGSLKWWRTVBN

MKJLSPWERSYTRGHYTWER MGHTVNBZY MKLPWRHGDBKJGMNDFD MBKLRPIUHDFRTASGSLKWWRTVBN

ATGACGATAGATCCCATCAT CGATAGGACCACCACA GATAGACCAGGATACATAGGATAATTA AGCTCGCTTGGCTCGATGGCT

query sequence other operation database sequences

ATGACGATAGATCCCATCAT CGATAGGACCACCACA GATAGACCAGGATACATAGGATAATTA AGCTCGCTTGGCTCGATGGCT

BLASTn

BLASTp

BLASTx

tBLASTn

tBLASTx

ACGATAGATCCCATCCATAAAT

ACGATAGATCCCATCCATAAAT

MQWCGYRWTYQGYRW

MQWCGYRWTYQGYRW

FAST

FAST

SLOW

SLOWER

HORRIBLY

SLOW

6 fra

me

trans

latio

n

How does it work The main task of any sequence comparison program is to test all possible mutual alignments of two sequence and see how good the match is

CTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGT

CCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGTTC | | | | | ||||||||||||||||||||||||| CTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGTCTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGT || | | | | | | | | |||||||||||||||||||||||| | | | | | |

CCGAGCTTCTCATTGCTCTTCCTAACAGTG=TGATAGGCTAACCGTAATGGCGTTC||||||||||||||||||||||||| ||||||||||||||||||||||||

query

1st database sequence

This would actually be a very slow search process if implemented like thishellip

BLAST achieves its speed through two strategies

- it takes a WORD based approach- it pre-INDEXES database sequences

BLAST WORDS and INDEXING1 GACAAATCCAAACCCCTGAAGTTCTCCACCAGCAAAGCCA

2 TAAGCAAATTTAATTTTGTTTACATTTTC

3 GTTAAGACCTTCCCTGACATTTGCAGCAGTTTCAAATGTA

AAAAAAAA 00001 AAAAAAAC 00002 AAAAAAAG 00003 ACAAATCC 07967 ACAAATCC 07968 ACAAATCC 07979 GACAAATC 33568 GACAAATG 33569 TCCAAACC 64321 TCCAAACC 64322

sequence position word

1 1 33658

1 2 07967

1 3 16210

3 15 33568

3 16 07967

Database of sequences

Numbered list of all possible lsquowordsrsquo

Build a position index of all words in the database

Analyse the Query Sequence gtquery AGACAAATCCAAACCCCTGAAGTTCTCCACCAGCAAAGCCA

AAAAAAAA 00001 AAAAAAAC 00002 AAAAAAAG 00003 ACAAATCC 07967 ACAAATCC 07968 ACAAATCC 07979 GACAAATC 33568 GACAAATG 33569 TCCAAACC 64321 TCCAAACC 64322

QUERY SEQUENCE

Numbered list of all possible lsquowordsrsquo

position word

1 14236

2 33658

3 07967

Analyse QUERY SEQUENCE

sequence position word

1 1 33658

1 2 07967

1 3 16210

3 15 33568

3 16 07967

Index of database

Expand from Word Based Matches We lsquoinstantlyrsquo know which sequences in the database have at least a word length match with our query sequence and at what relative position

Next the potential alignments are expanded adding up a score for (total matches ndash mismatches ndash gap penalties) to make the best possible alignment But this is usually for a tiny proportion of the sequences in the database ndash so overall it is much quicker

The highest scoring alignments are reported

But we can potentially miss alignments with no word-size bits in common consider BLASTn with a default word-size of 11

TCGGAAGTGGAAGCTGAACCTGATTGTAGAGTTGGAGGCCAGTGTTCTGGCTGAGC||||||||| ||||| |||||||||| |||||||||| |||| ||||| ||||||| TCGGAAGTGTAAGCTCAACCTGATTGCAGAGTTGGAGTCCAGAGTTCTAGCTGAGC

Care is sometimes neededhellip

BLAST ndashTypical OutputINPUT

gtpartial cDNA sequence Xenopus tropicalisCGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGTTCCCACCTCTCCTCTTTCACCATGAAGCTCAAGGACAAATTCCACTCCCCCAAAATCAAGCGCACCCCGTCCAAGAAGGGGAAGCCGGCCGACCTCACCGTCAAAACAGAAGAGAAACCCGTCAACAAAACCTTAAGCCGCTTGGAGGAACAGGAGAAAGAAGTCGTTAATGCCTTGCGTTACTTTAAGACAATTGTTGACAAGATGGCGGTGGACAAGATGGTGCTGGTGATGCTGCCAGGGTCGGCGA

OUTPUTQuery= (311 letters) Database NCBI Protein Reference Sequences 954378 sequences 347895532 total letters

gtgi|41055060|ref|NP_9574201| similar to guanine nucleotide-releasing factor 2 (specific for crk proto-oncogene) [Danio rerio]

Length=691

Score = 133 bits (335)Expect = 6e-31 Identities = 7698 (77) Positives = 8298 (83) Gaps = 498 (4) Frame = +2

Query 26 MSGKIE-KADSQRSHLSSFTMKLKDKFHSPKIKRTPSKKGKPA--DLTVKTEEKPVNKTL 196 MSGKIE K +SQ+SHLSSFTMKL KFHSPKIKRTPSKKGK + VKT EKPVNK + Sbjct 1 MSGKIESKHESQKSHLSSFTMKLM-KFHSPKIKRTPSKKGKQLQPEPAVKTPEKPVNKKV 59

Query 197 SRLEEQEKEVVNALRYFKTIVDKMAVDKMVLVMLPGSA 310 SRLEEQEK+VV+ALRYFKTIVDKM VD VL MLPGSA Sbjct 60 SRLEEQEKDVVSALRYFKTIVDKMNVDTKVLQMLPGSA 97

When is a match significant

RFKISDCQHPCTYSHNQYMTNHMRECPYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV

NFSWKKTSEKETNCQFDYPNDYNEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFNMCWLEVNSS

RF---KISDCQHPCTYSH-NQYMTNHMREC----PYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV F K S+ + C + + N Y N +C P+ + +W +P + D I N M ++ NFSWKKTSEKETNCQFDYPNDY--NEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFN------MCWLEVNSS

Here is a lsquotypicalrsquo weak alignment from BLASTp

In fact the sequences were randomly generated so there is no biologically significant alignmenthellip

E-values

The number of matches like the discovered match that I would expect to find by chance

An E-value of 00 implies that I would expect no matches like this to arise by chance thereforehellip

An E-value of 1 implies I would expect 1 match like this to arise by chance so if I have a match with such an E-valuehellip

Also ldquoexpect valueldquo or ldquoexpectationrdquo

E-values From First Principles

Some database statistics (23rd July 2005)

Database NCBI RefSeq mRNA 272619 sequences 503566580 total letters (~50 x 108)

Database NCBI nr 3329110 sequences 14601814750 total letters (~14 x 1010)

Notation

12e-35 = 12 x 10-35

48 x 106 = 4800000

We will consider first searching a nucleotide sequence (lsquoACGTAGACGTrsquo) against a nucleotide database eg the RefSeq mRNA above

Then we will consider the more complex case of amino acid sequence (protein) searches Which is of course what we mostly do

Calculating an E-valueThe RefSeq mRNA database has ~ 50 x 108 letters There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA

Query = lsquoArsquo

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACrsquo

AC AC AC AC

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACGrsquo

ACG

Expected number of matches = (50 x 108) 4 = ~12 x 108

Expected number of matches = (50 x 108) (4x 4) = ~31 x 107

Expected number of matches = (50 x 108) (4 x 4 x 4) = ~81 x 106

Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer

Expected number of matches = (50 x 108) (4 x 4 x 4 x 4 hellip 60 times ) = (50 x 108) 1036 = 50 x 10-28

E-value = 50 x 10-28

E-values In PracticeSo if I take a 60 nt sequencegtsequenceACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA and actually BLAST it against the RefSeq mRNA database I get

BLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds

Length=6060 Score = 119 bits (60) Expect = 2e-26 Identities = 6060 (100) Gaps = 060 (0) Strand=PlusPlus

Query 1 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 60 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 2977 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 3036

What do I get if I BLAST it against the larger nr databaseBLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds

Length=6060 Score = 119 bits (60) Expect = 6e-25 Identities = 6060 (100) Gaps = 060 (0) Strand=PlusPlus

Query 1 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 60 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 2977 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 3036

theoretical value was 50e-28 -

E-value Exercise

Given a transcription factor binding site

ACC[TG]TA

How many would you expect to find by chance in a 10k promoter sequence

How would this differ if there was an optional additional base between the 4th and 5th positionsIeACC[TG]TAOR ACC[TG]TA

E-value Exercise AnswerACC[TG]TA

Expect lsquoArsquo every 4 ntExpect lsquoACCrsquo every 4x4x4 = 64 nt

Expect lsquoT or Grsquo every 2nd ntExpect lsquoACC[TG]rsquo every 64x2 nt = 128 nt

Expect lsquoTArsquo every 4x4 = 16 ntExpect lsquoACC[TG]TArsquo every 128x16 nt = 2048 nt (4x4x4x2x4x4)We would expect ~5 of these promoter sites every 10k by chance

If also ACC[TG]TAA allowed

The two motifs independently have the same E-valueTo allow either means we expect twice as many

E-values Effect of Database SizeThe nr mRNA database has ~ 14 x 1010 letters (was RefSeq and 50 x108)There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA

Query = lsquoArsquo

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACrsquo

AC AC AC AC

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACGrsquo

ACG

Expected number of matches = (14 x 1010) 4 = ~12 x 108

Expected number of matches = (14 x 1010) (4x 4) = ~31 x 107

Expected number of matches = (14 x 1010) (4 x 4 x 4) = ~81 x 106

Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer

Expected number of matches = (14 x 1010) (4 x 4 x 4 x 4 hellip 60 times ) = (14 x 1010) 1036 = 14 x 10-26

E-value = 14 x 10-26

(was E-value = 50 x 10-28)

E-values Effect of Database Size

The E-value is simply dependent on database size

RefSeq

nr

14 x 1010 letters

50 x108 letters

30 x bigger

BLAST the same sequenceagainst each E-value = 14e-26

E-value = 50e-28

The database was ~30 times bigger and so the E-value was ~30 times bigger

Why were the values differentOur calculated E-value for searching against the RefSeq mRNA database was 50 x 10-28But our actual BLAST search at NCBI gave a value of

20 x 10-26 - about 40x larger - why is this

Gapped alignments

If we were expecting N matches for a query sequence lsquoACGTACGTACGTrsquo imagine what would happen to N if we allowed gaps in our matches

ACGTACGTACGT

This would now give us additional possible alignments that would meet our lsquomatchrsquo criteria

ACGTACGTACGT ACGTACAGTACGT ACGTACCGTACGT etc|||||||||||| |||||| |||||| |||||| ||||||ACGTACGTACGT ACGTAC-GTACGT ACGTAC-GTACGT

We will expect many more matches in a given database if we allow our alignments to have gaps The E-value will be larger

E-values Effect of Query Length

Biologically itrsquos the same match Does it mean we are any less sure that this match didnrsquot occur by chance The E-value is simply dependent on match length

database

BLAST 500 nt sequence against a database

BLASTn Get a full length match with sequence XYZ at an E-value = 50e-160

gtsequence ACTAGTCTAGCTAGACATCGATCGATGATGCTACACAGATAGACGATAGATAGTAAGTCGATCGATCGCGCATCGATCGTCTAGATCGATCGCTCGCTGTGTAGATAGATCGGCGATAGA

database

BLAST half of the same sequence against the same database

BLASTn

gtsequence ACTAGTCTAGCTAGACATCGATCGATGATGCTACACAGATAGACGATAGATAGTAAGTCG

Get a match with sequence XYZ again but at an E-value = 50e-80

Why not just use identityAt some levels this a good question

But consider two very different searches both of which give a 75 identity match

Query1 was 60 nt longCGGAGCTCAGGGCTTAACGACTGATATCTCCGCGCATGTCGAGAAACGATACAGCCAGCG||||||||||| || | || | || || |||| | | | |||||| | ||||||||||CGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGWhich would have an E-value ~ 50 x 10-19

And Query2 only 16 nt longACGTACGTACGTACGT||| || | |||| ||ACGCACCTTCGTAGGTWhich would have an E-value ~ 30

And intuitively we feel we would expect to see that sort of number of matches in the database just by chancehellip

So whatrsquos the real problemBasically you are usually trying to answer the question

Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism

Are there any useful guidelines though at least for biological meaningfulness

Basically you are usually trying to answer the question

Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism

BLAST

The difficulty is because

ORTHOLOGY

BLAST Similarity + Probability

biological knowledge

nature of query sequence

phylogenetic relationship

match length PI size of databasehellip

Rules of ThumbHow good does an E-value have to be before we might even think we have an ortholog

largerworse smallerbetter

E-values 10-5 10-10 10-40 10-100 00

fantasy borderline encouraging

pretty good canrsquot get

better

But note that in some gene families with closely related members you can get an E-value of 00 for several different matches and then identity may be more sensitive Also bear in mind in cases like this that ideas of lsquofunctionalrsquo orthology may break down with more than one locus producing identical proteins which share the same functionhellip

Protein BLASTItrsquos (nearly) always better to make comparisons at the amino acid level between protein sequences than the DNA level because the amino acid sequence is more conserved than the underlying DNA sequenceDoes this cause us to treat expected values any differently

If we follow the argument as before then for an exact match of a 20 amino acid sequence in the RefSeq protein database each additional amino acid will reduce the E-value by 120th (there are 20 different amino acids) And as there are 347895532 letters in that databaseE-value = ~35 x 108 (20 x 20 x 20 hellip20 times) = ~35 x 10-18

But this is what we get if we run the blast at NCBI

Score = 431 bits (100) Expect = 8e-04 Identities = 2020 (100) Positives = 2020 (100) Gaps = 020 (0) Frame = +3

Query 3 SSSSFRAYRAALSEVEPPCI 62 SSSSFRAYRAALSEVEPPCISbjct 972 SSSSFRAYRAALSEVEPPCI 991

Really too big a discrepancy to easily explain with hand wavinghellip

Amino Acid Substitutions

A SC F LWYG I LMVL IMFVM ILVP V ILMW FY

N DHSQ REHKS ANTT SY HFW

H NQYK RQER QK

D NEE DQK

In fact we need to take into account both amino acid substitutability as well as as before allowing gapped alignments On average any residue can be substituted for by about 2 others so each position has about 17th chance of lsquomatchingrsquo rather than 120th

So now we getE-value = ~35 x 108 (7 x 7 x 7 hellip20 times) = ~44 x 10-9which is much closer to the actual BLAST value

These substitutabilities are dealt with by the BLOSUM and PAM matrices

ExercisesGo to the file random-DNA-sequenceshtml select one of the 20 randomly generated nucleotide sequences and do a BLASTx (translated DNA-gtprotein) at NCBI against the nr protein database

Did you find any lsquosignificantrsquo hits

Repeat with a second sequence

What conclusions might you draw from this exercise

Try the same sequence(s) against the nr nucleotide database

Is there any general difference

Part 4 Tweaking BLASTAlthough you normally see BLAST as a web page with boxes to place data in and tick boxes etc it is actually a command line program that can be run just by typing the appropriate command and options eg

promptgt blastall ndashp ltblast typegt ndashi ltinput sequencegt ndashd ltdatabasegt

This is the simplest form where the basic program lsquoblastallrsquo takes a number of different options or parameters indicated by the ndashx and followed by its value -p ltwhich blast flavour to rungt-i ltfile with query sequence ingt-d ltpre-indexed database namegt

Not All Parameters are hellipThere are many other parameters and if not listed explicitly they will use a default value most appropriate to the blast flavour requested Eg for ndashW ltword sizegt blastn uses ndashW 11 where blastx uses ndashW 3

There are also some options that appear on the web pages that are not really parameters but manage the job in a similar way One of the most useful of these is on the NCBI blast pages where you can use Entrez queries or pick from an organism list to modify your search

The Many Parameters of BLASTThere are almost literally hundreds of parameters but most are way too obscure even for die-hard techies like me Very few of them are regularly useful in any but their default value but just occasionally they are very necessary

Here are some of the ones that I have used

-e max expected value -m output format (graphical or tabularspreadsheet)-F filter query sequence for low complexity (default TRUE)-U use only upper case regions of query (default FALSE)-G gap opening cost-E gap extension cost-q nucleotide mismatch penalty (BLASTx uses matrices)-r nucleotide match reward-b number of matching sequences to report-g allow gaps (default TRUE)-W word size-z effective database size (removes effect of actual database size)-S query strands to search (default both directions)-l restrict database sequences to given list of lsquogilsquo numbers

BLAST Parameters Exercises1 BLASTn vs BLASTx

Open the file example-sequenceshtml copy the sequence gtblastn-vs-blastxThis is a Xenopus tropicalis cDNA sequence

Go to the NCBI BLAST Home PageNucleotide-nucleotide BLAST (blastn) section Paste your sequence into the box

Run BLASTn against the nr nucleotide database using all default optionsThen hit [format] to wait for the results in a new page(hint if you paste the sequence definition line lsquogtnamersquo into the box as well your results will be labelled accordingly which can be useful)

Now repeat but go to the TRANSLATED BLAST section and BLAST against the nr protein database using BLASTx

How might the different results help us view the presence of this gene in other vertebrates

Results for Exercise 1 BLASTn

BLASTx

BLAST Parameters Exercises2 Low complexity filtering

Open the file example-sequenceshtml copy the sequence gtlow-complexity-filtering-A This sequence contains a long AT tandem repeat

Go to the NCBI BLAST Home PageTRANSLATED BLAST sectionBLASTx Paste your sequence into the box

Carefully UNTICK the ldquoChoose filter [ ] Low complexityrdquo BOX in the second section And then run BLASTx against the nr database

What do you feel about these alignmentsRe-run but leave the low-complexity filter ON this timeDoes this change our view of the protein matches

Now continue with gtlow-complexity-filtering-B and ndashCC is an especially interesting case ndash what can we deduce about the cDNA sequence Annotators beware

Results for Exercise 2A (OFF) BLASTn ndash low complexity filtering OFF

Results for Exercise 2A (ON) BLASTn ndash low complexity filtering ON

Results for Exercise 2B

ON OFF

Results for Exercise 2C

There is a sequence error an extra G at position 117 in the sequence cDNA (117)AGAAAAGAAGAAACATGGCAATGGATCAGAA|||||||||||||||| ||||||||||||||AGAAAAGAAGAAACAT-GCAATGGATCAGAA

Genomic sequence

ON

OFF

BLAST Parameters Exercises3 Limit by Entrez query

Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching sequences in fruit fly enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT

Open the file example-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home Page TRANSLATED BLAST sectionBLASTx and paste the sequence

Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip

BLAST Parameters Exercises4 BLASTn vs tBLASTx and nucleotide mismatch penalties

Open the file example-sequenceshtml

Also open the NCBI BLAST Home PageSPECIAL ndash Align two sequences section

There are several Xenopus tropicalis cyclins in the examples fileCopy the sequence gtcyclin-A1-Xt to the Sequence 1 BLAST windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 BLAST window(i) Run the default comparison should be BLASTn Note the alignmentNow run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these

(ii) Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1 Then try reducing the gap open and gap extension penaltieshellipWhat do we learn from this

(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip

Results for Exercise 4 (i)

BLASTn tBLASTx

Results for Exercise 4 (ii)

Mismatch penalty = -2 (default) Mismatch penalty = -1

BLAST Parameters Exercises5 E-Value maximum for reporting

Open the file example-sequenceshtml

Copy the sequence gtsumo-binding-motif and go to the NCBI BLAST Home PageGo to the PROTEIN BLAST section BLASTp and paste the sequence

Run the search with the default values

Now re-run the search setting the maximum E-value in the box

Expect 100

What difference does this make

BLAST Parameters Exercises6 Word Size

Open the file example-sequenceshtml

Copy the sequence gtmorpholino and go to the NCBI BLAST Home PageGo to the NUCLEOTIDE BLAST section BLASTn and paste the sequence

Check OFF the low complexity filter and then run the search

Now re-run the search setting the following parameters

Low complexity OFFExpect 100Word Size 7Other advanced -q-1 (mismatch penalty -1 instead of default -3)

What difference does this make

END

  • Bioinformatics Workshop 1 Sequences and Similarity Searches
  • The Basic Questions
  • Part 1 Structural Genomics
  • Chromosomes and Genes
  • Gene to Protein
  • Sequence Signals
  • Genomic Signals
  • Derivative Sequences
  • Gene Models
  • Sequences and Genes (Accession Numbers and Names)
  • Gene Symbols Names Etc
  • A Gene-Centric View
  • Sequences and Accession Numbers
  • mRNA Splicing Signals
  • Gene Predictions
  • Supporting Evidence
  • TheoreticalPredicted Sequences
  • Sequences for a model organism
  • So Whatrsquos in the Databases Now
  • Part 2 Comparative Genomics
  • Speciation
  • Residual Similarity
  • Computers Can Detect Homology
  • Orthologs
  • Paralogs
  • lsquoOtherrsquo-logs
  • The Essential Paradigm
  • Function Conserved Longer than Detectable Similarity
  • Redundancy in the Genetic Code
  • Protein Similarity Persists Longer
  • Always Compare Protein Sequences
  • Exercise 1 nucleotide vs amino acid search
  • Answers Exercise 1
  • The Essential Task
  • Functional Orthologs
  • Finding Orthologs
  • Using Synteny is Better
  • Metazome
  • Metazome Exercise
  • Part 3 Finding Sequence Similarities
  • Gaps in Alignments
  • The Downside of Gaps
  • BLAST
  • Flavours of BLAST
  • How does it work
  • BLAST WORDS and INDEXING
  • Analyse the Query Sequence
  • Expand from Word Based Matches
  • BLAST ndashTypical Output
  • When is a match significant
  • E-values
  • E-values From First Principles
  • Calculating an E-value
  • E-values In Practice
  • E-value Exercise
  • E-value Exercise Answer
  • E-values Effect of Database Size
  • Slide 58
  • Why were the values different
  • E-values Effect of Query Length
  • Why not just use identity
  • So whatrsquos the real problem
  • Rules of Thumb
  • Protein BLAST
  • Amino Acid Substitutions
  • Exercises
  • Part 4 Tweaking BLAST
  • Not All Parameters are hellip
  • The Many Parameters of BLAST
  • Slide 70
  • BLAST Parameters Exercises
  • Results for Exercise 1
  • Slide 73
  • Results for Exercise 2A (OFF)
  • Results for Exercise 2A (ON)
  • Results for Exercise 2B
  • Results for Exercise 2C
  • Slide 78
  • Slide 79
  • Results for Exercise 4 (i)
  • Results for Exercise 4 (ii)
  • Slide 82
  • Slide 83
  • END
Page 19: Bioinformatics Workshop 1 Sequences and Similarity Searches

So Whatrsquos in the Databases Now

15000000ESTs

3300000cDNAs

NCBI July 2005

2700000proteins

950000proteins

nrRefSeq

DNA

Proteins

Part 2 Comparative Genomics

ATGAAGGCTGCCTACGACTGCCGTGATGCAGGCTGCCTACGACTGCCGTGATGCAGGCTGCCAACGACTGCCGTGATGCATGCTGCCAACGACTGCCGTGATGCATGCTGCCAACGACTGCCCTGATGCATGCTGCCAACGGCTGCCCTGATGCATGCTGCCAACGGATGCCCTGATGCATGCCGCCAACGGATGCCCTGATGCATGCCGCCAACGGATGTCCTG

Imagine one mutation gets fixed every 100000 years in this gene sequencehellip

Gene sequence

Evolution by sequence mutation

Speciation

ATGAAGGCTGCCTACGACTGCCGTG

ATGCAGGCTGCCTACGACTGCCGTGATGCAGGCTGCCAACGACTGCCGTGATGCATGCTGCCAACGACTGCCGTGATGCATGCTGCCAACGACTGCCCTGATGCATGCTGCCAACGGCTGCCCTGATGCATGCTGCCAACGGATGCCCTG

Gene AATGAAGGCTGCCTACGACTGCCGTG

ATGAAGGCCGCCTACGACTGCCGTGATGAAGGCCGCCAACGACTGTCGTGATGAAAGCCGCCAACGACTGTCGTGATGAAAGCCGCCAACGACAGTCGTGATGAAAGCCGCCTACGACAGTCGTGATGAAAGCCGCCTACGACAGTCCTG

ATGCATGCTGCCAACGGATGCCCTG

ATGAAAGCCGCCTACGACAGTCCTG||| | || ||| ||| | ||||

If the genetic difference means they can no longer interbreed with fertile offspring ndash then we have a new specieshellip

Residual Similarity

ATGCATGCTGCCAACGGATGCCCTG

ATGAAAGCCGCCTACGACAGTCCTG||| | || ||| ||| | ||||

ATGCATGCTGCCAACGGATGCCCTG

ATGGAAGGCGCTTAGGATAGTCCAG||| | | || | | | || |

After longer periods of evolution homology may no longer be detectable in the DNA sequencehellip

We can still easily detect residual similarity between these sequences this is what we call homology ndash detectable similarity because of common evolutionary origin

Computers Can Detect Homology

In fact computers are very good at this task ndash the two primary challenges are

(a) performing the search fast enough to look through millions of sequence in a timescale compatible with a lab scientistrsquos attention span

(b) at low levels of similarity being able to distinguish between biologically related sequences and chance matcheshellip

ATGCATGCTGCCAACGGATGCCCTG

ATGAAAGCCGCCTACGACAGTCCTG||| | || ||| ||| | ||||

GCTGACTCGTAGCGCTTAGCTAGCT

CCAACATCTAGCCAGATTAGTTAGT | || | | | |

Orthologs

A A

A Gene duplication though speciation The two copies of Gene

A will now evolve independently but will continue to have the ~same function

They are ORTHOLOGS

Paralogs

A

Gene duplication though internal genome duplication

The two copies of Gene A will now evolve independently but will probably not continue to have exactly the same function

They are PARALOGS

A

A Arsquo

A

lsquoOtherrsquo-logsWhat about gene duplication after speciation

How can we describe the relationship(s) between the various copies of gene A in the two frogs

Bear in mind that understanding gene function is more important than semanticshellip

The two copies of A in the orange frog are sometimes called IN-PARALOGS

If they were also present in the green frog (and therefore were in the ancestor species) they would be OUT-PARALOGS

A

A

A

Arsquo A

The Essential Paradigm

1 any group of modern species can be traced back to some extinct common ancestor

A

A

2 in all likelihood they share orthologous genes which have the same function in the modern animal as in the extinct ancestor

3 If we can experimentally determine the function of a gene in one of these organisms then there is a good chance the ORTHOLOGOUS gene in another organism will have the same function

A A

cyclin b1

cyclin b1

Function Conserved Longer than Detectable Similarity

start from first self-replicating sequence

same function detectable similarity

living organisms

whole genome duplication local duplication

Redundancy in the Genetic Code

GCA A alanine GCC A GCG A GCT A

TGC C cystine TGT C

GAC D aspartate GAT D

GGA G glycine GGC G GGG G GGT G

lsquoSynonymousrsquo or lsquosilentrsquo mutations in the third position of the codon triplets have no effect on the amino acid coded for ndash so there is no evolutionary pressure against thishellip

Protein Similarity Persists Longer

CTATCACGAGAACCTGTGCTATCCCGAGAACCTGTGCTATCCCGAGAACCAGTGCTATCCCGTGAACCAGTGCTATCCCGTGAGCCAGTGCTATCCCGTGAGCCAGTTCTGTCCCGTGAGCCAGTT

CTATCACGAGAACCTGTG

CTGTCCCGTGAGCCAGTT|| || || || || ||

LSREPV

LSREPV||||||

CTATCACGAGAACCTGTG

TTGTCCCGGTCGCCAGTT | || | || ||

LSREPV

LSRFPV||| ||

67 100

44 80

Always Compare Protein Sequences

ATGAATGCAGCCTATGATTGCCGAGCCAGAATGCTAAGG MNAAYDCRARMLR ||||| || || || || || || || ||||| || || | ||||||||+||ATGAAGGCCGCATACGACTGTCGTGCTAGAATCCTGAGA MKAAYDCRARILR

DNA comparison amino acid comparison

The DNA sequence can change while the amino acid sequence stays the same so always look for similarities by comparing amino acid sequences

Exercise 1nucleotide vs amino acid search

Go to the file example-sequenceshtml and locate the section for this exercise There should be two sequences lsquosurfeit1rsquo for frog and fly

Go to NCBI Blast home page then lsquoAlign two sequencesrsquo (bottom left lsquospecialrsquo panel) paste one sequence into each window and hit lsquoAlignrsquo ndash this will do a direct DNADNA comparison

Now find the open reading frames of the two genes and translate them into amino acid protein sequences then repeat the two sequences comparison

Go to NCBI ORF Finder ndash paste sequence ndash hit OrfFind ndash identify longest ORF ndash click on it ndash next screen hit Accept ndash change View to Fasta protein ndash hit View ndash copy sequence to Blast2Seqs Do the same with the other sequence

Before you hit lsquoAlignrsquo change the lsquoProgramrsquo (top left) to blastphellip

Answers Exercise 1

The Essential Taskexperiment data mining

gene sequence what is its function

database of proteins in other species

Cyclin-AFoxA1

cdc25

alpha-tubulin

Predicted protein

Gravin-like

Sprouty-2

calmodulin

KIAA10786568

frizzled

Wint8

Troponin T3

Gravin-like

we can only do this because of implied function based on orthology

Functional Orthologs

function known annotation lsquoGravinrsquo available

Human geneXenopus genefunction unknown

sequence similarityorthologs

same function But we know that function is largely determined by shape

similar shape

Which in general we cannot determine ndash but it is probably SHAPE not SEQUENCE that is conserved

We make an assumption that the same gene function is likely to be present in the two organisms and the ones that have this function are likely to be the most similar in sequence

Finding OrthologsSo how do we find orthologs and can we know when we have

The simplest is Reciprocal Best BLAST but it implicitly relies on having all the protein sequences of you own organism and the one you wish to find an ortholog in

frog proteindatabase of human proteins

best match human protein

database of frog proteins

x

Using Synteny is Better

We know that large regions of (say) vertebrate genomes have preserved their overall organisation from one organism to another

And we find the same genes (ie orthologs) in more or less the same order in the syntenic sectionsThese of course represent chromosomal re-arrangements since these organisms diverged

Human chromosome 5

Mouse chromosome 10

Mouse chromosome 2

MetazomeFortunately someone has done all the hard work for ushellip Dan Rokhsar httpwwwmetazomenet

Metazome Exercise

Go back to Entrez Gene and look for your favourite gene again

Pick probable ortholog vertebrate genes from common organisms (human mouse rat chicken frog fish) and paste their protein sequences into a temporary space

Go to Metazome (httpwwwmetazomenet) find the blast window open two versions of it and blast your sequences against the Tetrapod or Jawed vertebrate node

See if you get the same cluster ID as best top hit and have a look at the Metazome alignment(s)hellip

Part 3 Finding Sequence Similarities

We want computer programs which will compare sequences at all possible different alignments looking for a degree of similarity greater than we would expect to find by chance

But first we have to consider the implication of gapshellip

Insertions and deletions are other possible forms of mutations and they can really mess up our simple alignments

ATGCATGCTGCCAACGGATGTCCTG

ATGAAAGCCGCCTACGAAAGTCCTG||| | || ||| ||| | ||||||

ATGCATGCTGGCCAACGGATGTCCTG

ATGAAAGCCGCCTACGAAAGTCCTG||| | || ||| | | | |

Gaps in Alignments

Consider these two obviously similar sequences

TTCCCAACTCTCCTCTTTCACCATGAAGCTCAAGGACAGATTCCACTCGCCCCAAAATCAAGCTCACCCCGTCCAAGAA | || | || |||||||||||||||||||| ||||||||| ||| ||| | ||| | | |TTCCCACCTCTCCTCTTTGCACCATGAAGCTCAAGGACAAATTCCACTCCCCCAAAATCAAGCGCACCCCGTCCCAGAA

In fact we realise that the most probable alignment (regarding biological origin) is with a small gap in each sequence

TTCCCAACTCTCCTCTTT=CACCATGAAGCTCAAGGACAGATTCCACTCGCCCCAAAATCAAGCTCACCCCGTCCAAGAA |||||| ||||||||||| |||||||||||||||||||| ||||||||| |||||||||||||| |||||||||| ||||TTCCCACCTCTCCTCTTTGCACCATGAAGCTCAAGGACAAATTCCACTC=CCCCAAAATCAAGCGCACCCCGTCCCAGAA

So in general we allow ourselves to insert gaps until we find the optimal alignment

But where should this process stop

The Downside of GapsTake two random sequences with no lsquorealrsquo similarity

GACACTAGGTCGATGCGTGGTGGCGAGA

ACGCATCCGGATGTGCACCGTGGAACTG

And allow lsquocost freersquo gaps

GAC--ACT----AGGTCGATGC---GTGG---TGGCGAGA || | | | | | ||| |||| || ACGCA-TCCGGA--T-G-TGCACCGTGGAACTG

Clearly although the alignment has no mismatches it is obviously not biologically meaningful

To prevent this we assign a cost to adding gaps which is offset against the benefit of finding matches ndash and this is the essence of lsquofinding gapped alignmentsrsquo

We want to find the lsquoalignmentrsquo between the two (or more) sequences which shows the greatest degree of similarity while introducing the fewest gaps hellip

BLAST

gtqueryAGACGAACCTAGCACAAGCGCGTCTGGAAAGACCCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGTTCAGGAGTATTTGGACTGCAATATTGGCCCTCGTTCAAGGGCGCCTACCATCACCCGACGGTCATGCCGGTCCCCAGCAGCTGCTAATAACTTCCTTCGCTACTCAAGTTACCACGCTAGCAAAACCCACGGCATACCGTTTACCCTTTAAAATCAGCTTCAACCAGCAACGAA

There are many programs used to find similarities between sequencesThey range from relatively slow programs which find the exact best matching alignment through ones which take progressively inexact shortcuts to speed things up Of this latter class the best known and easily most widely used is BLAST developed by Stephen Altschul and others and continuously refined over the last 10-15 years

The essential idea is to compare your query sequence against a collection or lsquodatabasersquo of target sequences looking for the one(s) that match the query sequence the best

gttarget1AAAACAGGAATATTTACCGGGACCGGGTAATGATGCATCTCGAGGTACACAATATACCTG GAGAACCGAATTATGAGTTGGCCACCTTACTTAACGAAACCAGCAGAGAAAATCCAACAT GGCAACACCCCTCTGACTACACTAGAAGGAACTACTATGTAAGAAAACAGCCTGTCCCTT GCAGTTTGAATGACTGGGTGATGCGAAATGGGGGTCCTGCCATAGAGCGCTTCCATGGTT TACCTTGCACATTTCAGAGAAGTCCTATGCCAGGAGTCCTTCCTACAGGGCCTTCCTGAA ACTATATATGTGCTTATTCTTGTTTGATTTGGCTTTGCAGgttarget2CTCTTAATTTATTTCTCTTCCTGCAGCTCCCTCGCTTTTTCCTTTCCCTGTTACATTCAT CTGACTTGAAGAGTTGCAAATTTTCAGTGTTTCTGTTTTTGTTGCTGATATGTTGTAAAC TTTTTAATAAAATCTATTTCTATAG gttarget3GCAGTTTGAATGACTGGGTGATGCGAAATGGGGGTCCTGCCATAGAGCGCTTCCATGGTT TACCTTGCACATTTCAGAGAAGTCCTATGCCAGGAGTCCTTCCTACAGGGCCTTCCTGAA ACTATATATGTGCTTATTCTTGTTTGATTTGGCTTTGCAGCTAGGGTTTTCACCTTTTCT GGAAAAAAAAATACTGGCTTCC gttarget4CTGCTATTAATGGGCAAAACAACTCAAATAAAGTCCCTCTGCCACCCTCAGACACTGCCC CTGGCCCCCAGCTGCCCGCTGATCCTTGTAGCCAGAGCAGTAAAGTTTTGAAAGTGGAGC CCAAGGAGAATAAAGTTATTAAAGAAACTGGCTTTGAACAAGGTGAAAAGTCTTGTGCAG CACCTCTAGATCATACTGTGAAGGAAAATCTTGGACAAACTTCTAAAGAACAGGTGGTAG

query

database

COMPARE

LIST MATCHES

Flavours of BLAST

ACGATAGATCCCATCCATAAAT ATGACGATAGATCCCATCAT CGATAGGACCACCACA GATAGACCAGGATACATAGGATAATTA AGCTCGCTTGGCTCGATGGCT

MKJLSPWERSYTRGHYTWER MGHTVNBZY MKLPWRHGDBKJGMNDFD MBKLRPIUHDFRTASGSLKWWRTVBN

MKJLSPWERSYTRGHYTWER MGHTVNBZY MKLPWRHGDBKJGMNDFD MBKLRPIUHDFRTASGSLKWWRTVBN

ATGACGATAGATCCCATCAT CGATAGGACCACCACA GATAGACCAGGATACATAGGATAATTA AGCTCGCTTGGCTCGATGGCT

query sequence other operation database sequences

ATGACGATAGATCCCATCAT CGATAGGACCACCACA GATAGACCAGGATACATAGGATAATTA AGCTCGCTTGGCTCGATGGCT

BLASTn

BLASTp

BLASTx

tBLASTn

tBLASTx

ACGATAGATCCCATCCATAAAT

ACGATAGATCCCATCCATAAAT

MQWCGYRWTYQGYRW

MQWCGYRWTYQGYRW

FAST

FAST

SLOW

SLOWER

HORRIBLY

SLOW

6 fra

me

trans

latio

n

How does it work The main task of any sequence comparison program is to test all possible mutual alignments of two sequence and see how good the match is

CTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGT

CCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGTTC | | | | | ||||||||||||||||||||||||| CTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGTCTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGT || | | | | | | | | |||||||||||||||||||||||| | | | | | |

CCGAGCTTCTCATTGCTCTTCCTAACAGTG=TGATAGGCTAACCGTAATGGCGTTC||||||||||||||||||||||||| ||||||||||||||||||||||||

query

1st database sequence

This would actually be a very slow search process if implemented like thishellip

BLAST achieves its speed through two strategies

- it takes a WORD based approach- it pre-INDEXES database sequences

BLAST WORDS and INDEXING1 GACAAATCCAAACCCCTGAAGTTCTCCACCAGCAAAGCCA

2 TAAGCAAATTTAATTTTGTTTACATTTTC

3 GTTAAGACCTTCCCTGACATTTGCAGCAGTTTCAAATGTA

AAAAAAAA 00001 AAAAAAAC 00002 AAAAAAAG 00003 ACAAATCC 07967 ACAAATCC 07968 ACAAATCC 07979 GACAAATC 33568 GACAAATG 33569 TCCAAACC 64321 TCCAAACC 64322

sequence position word

1 1 33658

1 2 07967

1 3 16210

3 15 33568

3 16 07967

Database of sequences

Numbered list of all possible lsquowordsrsquo

Build a position index of all words in the database

Analyse the Query Sequence gtquery AGACAAATCCAAACCCCTGAAGTTCTCCACCAGCAAAGCCA

AAAAAAAA 00001 AAAAAAAC 00002 AAAAAAAG 00003 ACAAATCC 07967 ACAAATCC 07968 ACAAATCC 07979 GACAAATC 33568 GACAAATG 33569 TCCAAACC 64321 TCCAAACC 64322

QUERY SEQUENCE

Numbered list of all possible lsquowordsrsquo

position word

1 14236

2 33658

3 07967

Analyse QUERY SEQUENCE

sequence position word

1 1 33658

1 2 07967

1 3 16210

3 15 33568

3 16 07967

Index of database

Expand from Word Based Matches We lsquoinstantlyrsquo know which sequences in the database have at least a word length match with our query sequence and at what relative position

Next the potential alignments are expanded adding up a score for (total matches ndash mismatches ndash gap penalties) to make the best possible alignment But this is usually for a tiny proportion of the sequences in the database ndash so overall it is much quicker

The highest scoring alignments are reported

But we can potentially miss alignments with no word-size bits in common consider BLASTn with a default word-size of 11

TCGGAAGTGGAAGCTGAACCTGATTGTAGAGTTGGAGGCCAGTGTTCTGGCTGAGC||||||||| ||||| |||||||||| |||||||||| |||| ||||| ||||||| TCGGAAGTGTAAGCTCAACCTGATTGCAGAGTTGGAGTCCAGAGTTCTAGCTGAGC

Care is sometimes neededhellip

BLAST ndashTypical OutputINPUT

gtpartial cDNA sequence Xenopus tropicalisCGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGTTCCCACCTCTCCTCTTTCACCATGAAGCTCAAGGACAAATTCCACTCCCCCAAAATCAAGCGCACCCCGTCCAAGAAGGGGAAGCCGGCCGACCTCACCGTCAAAACAGAAGAGAAACCCGTCAACAAAACCTTAAGCCGCTTGGAGGAACAGGAGAAAGAAGTCGTTAATGCCTTGCGTTACTTTAAGACAATTGTTGACAAGATGGCGGTGGACAAGATGGTGCTGGTGATGCTGCCAGGGTCGGCGA

OUTPUTQuery= (311 letters) Database NCBI Protein Reference Sequences 954378 sequences 347895532 total letters

gtgi|41055060|ref|NP_9574201| similar to guanine nucleotide-releasing factor 2 (specific for crk proto-oncogene) [Danio rerio]

Length=691

Score = 133 bits (335)Expect = 6e-31 Identities = 7698 (77) Positives = 8298 (83) Gaps = 498 (4) Frame = +2

Query 26 MSGKIE-KADSQRSHLSSFTMKLKDKFHSPKIKRTPSKKGKPA--DLTVKTEEKPVNKTL 196 MSGKIE K +SQ+SHLSSFTMKL KFHSPKIKRTPSKKGK + VKT EKPVNK + Sbjct 1 MSGKIESKHESQKSHLSSFTMKLM-KFHSPKIKRTPSKKGKQLQPEPAVKTPEKPVNKKV 59

Query 197 SRLEEQEKEVVNALRYFKTIVDKMAVDKMVLVMLPGSA 310 SRLEEQEK+VV+ALRYFKTIVDKM VD VL MLPGSA Sbjct 60 SRLEEQEKDVVSALRYFKTIVDKMNVDTKVLQMLPGSA 97

When is a match significant

RFKISDCQHPCTYSHNQYMTNHMRECPYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV

NFSWKKTSEKETNCQFDYPNDYNEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFNMCWLEVNSS

RF---KISDCQHPCTYSH-NQYMTNHMREC----PYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV F K S+ + C + + N Y N +C P+ + +W +P + D I N M ++ NFSWKKTSEKETNCQFDYPNDY--NEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFN------MCWLEVNSS

Here is a lsquotypicalrsquo weak alignment from BLASTp

In fact the sequences were randomly generated so there is no biologically significant alignmenthellip

E-values

The number of matches like the discovered match that I would expect to find by chance

An E-value of 00 implies that I would expect no matches like this to arise by chance thereforehellip

An E-value of 1 implies I would expect 1 match like this to arise by chance so if I have a match with such an E-valuehellip

Also ldquoexpect valueldquo or ldquoexpectationrdquo

E-values From First Principles

Some database statistics (23rd July 2005)

Database NCBI RefSeq mRNA 272619 sequences 503566580 total letters (~50 x 108)

Database NCBI nr 3329110 sequences 14601814750 total letters (~14 x 1010)

Notation

12e-35 = 12 x 10-35

48 x 106 = 4800000

We will consider first searching a nucleotide sequence (lsquoACGTAGACGTrsquo) against a nucleotide database eg the RefSeq mRNA above

Then we will consider the more complex case of amino acid sequence (protein) searches Which is of course what we mostly do

Calculating an E-valueThe RefSeq mRNA database has ~ 50 x 108 letters There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA

Query = lsquoArsquo

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACrsquo

AC AC AC AC

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACGrsquo

ACG

Expected number of matches = (50 x 108) 4 = ~12 x 108

Expected number of matches = (50 x 108) (4x 4) = ~31 x 107

Expected number of matches = (50 x 108) (4 x 4 x 4) = ~81 x 106

Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer

Expected number of matches = (50 x 108) (4 x 4 x 4 x 4 hellip 60 times ) = (50 x 108) 1036 = 50 x 10-28

E-value = 50 x 10-28

E-values In PracticeSo if I take a 60 nt sequencegtsequenceACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA and actually BLAST it against the RefSeq mRNA database I get

BLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds

Length=6060 Score = 119 bits (60) Expect = 2e-26 Identities = 6060 (100) Gaps = 060 (0) Strand=PlusPlus

Query 1 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 60 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 2977 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 3036

What do I get if I BLAST it against the larger nr databaseBLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds

Length=6060 Score = 119 bits (60) Expect = 6e-25 Identities = 6060 (100) Gaps = 060 (0) Strand=PlusPlus

Query 1 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 60 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 2977 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 3036

theoretical value was 50e-28 -

E-value Exercise

Given a transcription factor binding site

ACC[TG]TA

How many would you expect to find by chance in a 10k promoter sequence

How would this differ if there was an optional additional base between the 4th and 5th positionsIeACC[TG]TAOR ACC[TG]TA

E-value Exercise AnswerACC[TG]TA

Expect lsquoArsquo every 4 ntExpect lsquoACCrsquo every 4x4x4 = 64 nt

Expect lsquoT or Grsquo every 2nd ntExpect lsquoACC[TG]rsquo every 64x2 nt = 128 nt

Expect lsquoTArsquo every 4x4 = 16 ntExpect lsquoACC[TG]TArsquo every 128x16 nt = 2048 nt (4x4x4x2x4x4)We would expect ~5 of these promoter sites every 10k by chance

If also ACC[TG]TAA allowed

The two motifs independently have the same E-valueTo allow either means we expect twice as many

E-values Effect of Database SizeThe nr mRNA database has ~ 14 x 1010 letters (was RefSeq and 50 x108)There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA

Query = lsquoArsquo

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACrsquo

AC AC AC AC

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACGrsquo

ACG

Expected number of matches = (14 x 1010) 4 = ~12 x 108

Expected number of matches = (14 x 1010) (4x 4) = ~31 x 107

Expected number of matches = (14 x 1010) (4 x 4 x 4) = ~81 x 106

Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer

Expected number of matches = (14 x 1010) (4 x 4 x 4 x 4 hellip 60 times ) = (14 x 1010) 1036 = 14 x 10-26

E-value = 14 x 10-26

(was E-value = 50 x 10-28)

E-values Effect of Database Size

The E-value is simply dependent on database size

RefSeq

nr

14 x 1010 letters

50 x108 letters

30 x bigger

BLAST the same sequenceagainst each E-value = 14e-26

E-value = 50e-28

The database was ~30 times bigger and so the E-value was ~30 times bigger

Why were the values differentOur calculated E-value for searching against the RefSeq mRNA database was 50 x 10-28But our actual BLAST search at NCBI gave a value of

20 x 10-26 - about 40x larger - why is this

Gapped alignments

If we were expecting N matches for a query sequence lsquoACGTACGTACGTrsquo imagine what would happen to N if we allowed gaps in our matches

ACGTACGTACGT

This would now give us additional possible alignments that would meet our lsquomatchrsquo criteria

ACGTACGTACGT ACGTACAGTACGT ACGTACCGTACGT etc|||||||||||| |||||| |||||| |||||| ||||||ACGTACGTACGT ACGTAC-GTACGT ACGTAC-GTACGT

We will expect many more matches in a given database if we allow our alignments to have gaps The E-value will be larger

E-values Effect of Query Length

Biologically itrsquos the same match Does it mean we are any less sure that this match didnrsquot occur by chance The E-value is simply dependent on match length

database

BLAST 500 nt sequence against a database

BLASTn Get a full length match with sequence XYZ at an E-value = 50e-160

gtsequence ACTAGTCTAGCTAGACATCGATCGATGATGCTACACAGATAGACGATAGATAGTAAGTCGATCGATCGCGCATCGATCGTCTAGATCGATCGCTCGCTGTGTAGATAGATCGGCGATAGA

database

BLAST half of the same sequence against the same database

BLASTn

gtsequence ACTAGTCTAGCTAGACATCGATCGATGATGCTACACAGATAGACGATAGATAGTAAGTCG

Get a match with sequence XYZ again but at an E-value = 50e-80

Why not just use identityAt some levels this a good question

But consider two very different searches both of which give a 75 identity match

Query1 was 60 nt longCGGAGCTCAGGGCTTAACGACTGATATCTCCGCGCATGTCGAGAAACGATACAGCCAGCG||||||||||| || | || | || || |||| | | | |||||| | ||||||||||CGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGWhich would have an E-value ~ 50 x 10-19

And Query2 only 16 nt longACGTACGTACGTACGT||| || | |||| ||ACGCACCTTCGTAGGTWhich would have an E-value ~ 30

And intuitively we feel we would expect to see that sort of number of matches in the database just by chancehellip

So whatrsquos the real problemBasically you are usually trying to answer the question

Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism

Are there any useful guidelines though at least for biological meaningfulness

Basically you are usually trying to answer the question

Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism

BLAST

The difficulty is because

ORTHOLOGY

BLAST Similarity + Probability

biological knowledge

nature of query sequence

phylogenetic relationship

match length PI size of databasehellip

Rules of ThumbHow good does an E-value have to be before we might even think we have an ortholog

largerworse smallerbetter

E-values 10-5 10-10 10-40 10-100 00

fantasy borderline encouraging

pretty good canrsquot get

better

But note that in some gene families with closely related members you can get an E-value of 00 for several different matches and then identity may be more sensitive Also bear in mind in cases like this that ideas of lsquofunctionalrsquo orthology may break down with more than one locus producing identical proteins which share the same functionhellip

Protein BLASTItrsquos (nearly) always better to make comparisons at the amino acid level between protein sequences than the DNA level because the amino acid sequence is more conserved than the underlying DNA sequenceDoes this cause us to treat expected values any differently

If we follow the argument as before then for an exact match of a 20 amino acid sequence in the RefSeq protein database each additional amino acid will reduce the E-value by 120th (there are 20 different amino acids) And as there are 347895532 letters in that databaseE-value = ~35 x 108 (20 x 20 x 20 hellip20 times) = ~35 x 10-18

But this is what we get if we run the blast at NCBI

Score = 431 bits (100) Expect = 8e-04 Identities = 2020 (100) Positives = 2020 (100) Gaps = 020 (0) Frame = +3

Query 3 SSSSFRAYRAALSEVEPPCI 62 SSSSFRAYRAALSEVEPPCISbjct 972 SSSSFRAYRAALSEVEPPCI 991

Really too big a discrepancy to easily explain with hand wavinghellip

Amino Acid Substitutions

A SC F LWYG I LMVL IMFVM ILVP V ILMW FY

N DHSQ REHKS ANTT SY HFW

H NQYK RQER QK

D NEE DQK

In fact we need to take into account both amino acid substitutability as well as as before allowing gapped alignments On average any residue can be substituted for by about 2 others so each position has about 17th chance of lsquomatchingrsquo rather than 120th

So now we getE-value = ~35 x 108 (7 x 7 x 7 hellip20 times) = ~44 x 10-9which is much closer to the actual BLAST value

These substitutabilities are dealt with by the BLOSUM and PAM matrices

ExercisesGo to the file random-DNA-sequenceshtml select one of the 20 randomly generated nucleotide sequences and do a BLASTx (translated DNA-gtprotein) at NCBI against the nr protein database

Did you find any lsquosignificantrsquo hits

Repeat with a second sequence

What conclusions might you draw from this exercise

Try the same sequence(s) against the nr nucleotide database

Is there any general difference

Part 4 Tweaking BLASTAlthough you normally see BLAST as a web page with boxes to place data in and tick boxes etc it is actually a command line program that can be run just by typing the appropriate command and options eg

promptgt blastall ndashp ltblast typegt ndashi ltinput sequencegt ndashd ltdatabasegt

This is the simplest form where the basic program lsquoblastallrsquo takes a number of different options or parameters indicated by the ndashx and followed by its value -p ltwhich blast flavour to rungt-i ltfile with query sequence ingt-d ltpre-indexed database namegt

Not All Parameters are hellipThere are many other parameters and if not listed explicitly they will use a default value most appropriate to the blast flavour requested Eg for ndashW ltword sizegt blastn uses ndashW 11 where blastx uses ndashW 3

There are also some options that appear on the web pages that are not really parameters but manage the job in a similar way One of the most useful of these is on the NCBI blast pages where you can use Entrez queries or pick from an organism list to modify your search

The Many Parameters of BLASTThere are almost literally hundreds of parameters but most are way too obscure even for die-hard techies like me Very few of them are regularly useful in any but their default value but just occasionally they are very necessary

Here are some of the ones that I have used

-e max expected value -m output format (graphical or tabularspreadsheet)-F filter query sequence for low complexity (default TRUE)-U use only upper case regions of query (default FALSE)-G gap opening cost-E gap extension cost-q nucleotide mismatch penalty (BLASTx uses matrices)-r nucleotide match reward-b number of matching sequences to report-g allow gaps (default TRUE)-W word size-z effective database size (removes effect of actual database size)-S query strands to search (default both directions)-l restrict database sequences to given list of lsquogilsquo numbers

BLAST Parameters Exercises1 BLASTn vs BLASTx

Open the file example-sequenceshtml copy the sequence gtblastn-vs-blastxThis is a Xenopus tropicalis cDNA sequence

Go to the NCBI BLAST Home PageNucleotide-nucleotide BLAST (blastn) section Paste your sequence into the box

Run BLASTn against the nr nucleotide database using all default optionsThen hit [format] to wait for the results in a new page(hint if you paste the sequence definition line lsquogtnamersquo into the box as well your results will be labelled accordingly which can be useful)

Now repeat but go to the TRANSLATED BLAST section and BLAST against the nr protein database using BLASTx

How might the different results help us view the presence of this gene in other vertebrates

Results for Exercise 1 BLASTn

BLASTx

BLAST Parameters Exercises2 Low complexity filtering

Open the file example-sequenceshtml copy the sequence gtlow-complexity-filtering-A This sequence contains a long AT tandem repeat

Go to the NCBI BLAST Home PageTRANSLATED BLAST sectionBLASTx Paste your sequence into the box

Carefully UNTICK the ldquoChoose filter [ ] Low complexityrdquo BOX in the second section And then run BLASTx against the nr database

What do you feel about these alignmentsRe-run but leave the low-complexity filter ON this timeDoes this change our view of the protein matches

Now continue with gtlow-complexity-filtering-B and ndashCC is an especially interesting case ndash what can we deduce about the cDNA sequence Annotators beware

Results for Exercise 2A (OFF) BLASTn ndash low complexity filtering OFF

Results for Exercise 2A (ON) BLASTn ndash low complexity filtering ON

Results for Exercise 2B

ON OFF

Results for Exercise 2C

There is a sequence error an extra G at position 117 in the sequence cDNA (117)AGAAAAGAAGAAACATGGCAATGGATCAGAA|||||||||||||||| ||||||||||||||AGAAAAGAAGAAACAT-GCAATGGATCAGAA

Genomic sequence

ON

OFF

BLAST Parameters Exercises3 Limit by Entrez query

Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching sequences in fruit fly enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT

Open the file example-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home Page TRANSLATED BLAST sectionBLASTx and paste the sequence

Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip

BLAST Parameters Exercises4 BLASTn vs tBLASTx and nucleotide mismatch penalties

Open the file example-sequenceshtml

Also open the NCBI BLAST Home PageSPECIAL ndash Align two sequences section

There are several Xenopus tropicalis cyclins in the examples fileCopy the sequence gtcyclin-A1-Xt to the Sequence 1 BLAST windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 BLAST window(i) Run the default comparison should be BLASTn Note the alignmentNow run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these

(ii) Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1 Then try reducing the gap open and gap extension penaltieshellipWhat do we learn from this

(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip

Results for Exercise 4 (i)

BLASTn tBLASTx

Results for Exercise 4 (ii)

Mismatch penalty = -2 (default) Mismatch penalty = -1

BLAST Parameters Exercises5 E-Value maximum for reporting

Open the file example-sequenceshtml

Copy the sequence gtsumo-binding-motif and go to the NCBI BLAST Home PageGo to the PROTEIN BLAST section BLASTp and paste the sequence

Run the search with the default values

Now re-run the search setting the maximum E-value in the box

Expect 100

What difference does this make

BLAST Parameters Exercises6 Word Size

Open the file example-sequenceshtml

Copy the sequence gtmorpholino and go to the NCBI BLAST Home PageGo to the NUCLEOTIDE BLAST section BLASTn and paste the sequence

Check OFF the low complexity filter and then run the search

Now re-run the search setting the following parameters

Low complexity OFFExpect 100Word Size 7Other advanced -q-1 (mismatch penalty -1 instead of default -3)

What difference does this make

END

  • Bioinformatics Workshop 1 Sequences and Similarity Searches
  • The Basic Questions
  • Part 1 Structural Genomics
  • Chromosomes and Genes
  • Gene to Protein
  • Sequence Signals
  • Genomic Signals
  • Derivative Sequences
  • Gene Models
  • Sequences and Genes (Accession Numbers and Names)
  • Gene Symbols Names Etc
  • A Gene-Centric View
  • Sequences and Accession Numbers
  • mRNA Splicing Signals
  • Gene Predictions
  • Supporting Evidence
  • TheoreticalPredicted Sequences
  • Sequences for a model organism
  • So Whatrsquos in the Databases Now
  • Part 2 Comparative Genomics
  • Speciation
  • Residual Similarity
  • Computers Can Detect Homology
  • Orthologs
  • Paralogs
  • lsquoOtherrsquo-logs
  • The Essential Paradigm
  • Function Conserved Longer than Detectable Similarity
  • Redundancy in the Genetic Code
  • Protein Similarity Persists Longer
  • Always Compare Protein Sequences
  • Exercise 1 nucleotide vs amino acid search
  • Answers Exercise 1
  • The Essential Task
  • Functional Orthologs
  • Finding Orthologs
  • Using Synteny is Better
  • Metazome
  • Metazome Exercise
  • Part 3 Finding Sequence Similarities
  • Gaps in Alignments
  • The Downside of Gaps
  • BLAST
  • Flavours of BLAST
  • How does it work
  • BLAST WORDS and INDEXING
  • Analyse the Query Sequence
  • Expand from Word Based Matches
  • BLAST ndashTypical Output
  • When is a match significant
  • E-values
  • E-values From First Principles
  • Calculating an E-value
  • E-values In Practice
  • E-value Exercise
  • E-value Exercise Answer
  • E-values Effect of Database Size
  • Slide 58
  • Why were the values different
  • E-values Effect of Query Length
  • Why not just use identity
  • So whatrsquos the real problem
  • Rules of Thumb
  • Protein BLAST
  • Amino Acid Substitutions
  • Exercises
  • Part 4 Tweaking BLAST
  • Not All Parameters are hellip
  • The Many Parameters of BLAST
  • Slide 70
  • BLAST Parameters Exercises
  • Results for Exercise 1
  • Slide 73
  • Results for Exercise 2A (OFF)
  • Results for Exercise 2A (ON)
  • Results for Exercise 2B
  • Results for Exercise 2C
  • Slide 78
  • Slide 79
  • Results for Exercise 4 (i)
  • Results for Exercise 4 (ii)
  • Slide 82
  • Slide 83
  • END
Page 20: Bioinformatics Workshop 1 Sequences and Similarity Searches

Part 2 Comparative Genomics

ATGAAGGCTGCCTACGACTGCCGTGATGCAGGCTGCCTACGACTGCCGTGATGCAGGCTGCCAACGACTGCCGTGATGCATGCTGCCAACGACTGCCGTGATGCATGCTGCCAACGACTGCCCTGATGCATGCTGCCAACGGCTGCCCTGATGCATGCTGCCAACGGATGCCCTGATGCATGCCGCCAACGGATGCCCTGATGCATGCCGCCAACGGATGTCCTG

Imagine one mutation gets fixed every 100000 years in this gene sequencehellip

Gene sequence

Evolution by sequence mutation

Speciation

ATGAAGGCTGCCTACGACTGCCGTG

ATGCAGGCTGCCTACGACTGCCGTGATGCAGGCTGCCAACGACTGCCGTGATGCATGCTGCCAACGACTGCCGTGATGCATGCTGCCAACGACTGCCCTGATGCATGCTGCCAACGGCTGCCCTGATGCATGCTGCCAACGGATGCCCTG

Gene AATGAAGGCTGCCTACGACTGCCGTG

ATGAAGGCCGCCTACGACTGCCGTGATGAAGGCCGCCAACGACTGTCGTGATGAAAGCCGCCAACGACTGTCGTGATGAAAGCCGCCAACGACAGTCGTGATGAAAGCCGCCTACGACAGTCGTGATGAAAGCCGCCTACGACAGTCCTG

ATGCATGCTGCCAACGGATGCCCTG

ATGAAAGCCGCCTACGACAGTCCTG||| | || ||| ||| | ||||

If the genetic difference means they can no longer interbreed with fertile offspring ndash then we have a new specieshellip

Residual Similarity

ATGCATGCTGCCAACGGATGCCCTG

ATGAAAGCCGCCTACGACAGTCCTG||| | || ||| ||| | ||||

ATGCATGCTGCCAACGGATGCCCTG

ATGGAAGGCGCTTAGGATAGTCCAG||| | | || | | | || |

After longer periods of evolution homology may no longer be detectable in the DNA sequencehellip

We can still easily detect residual similarity between these sequences this is what we call homology ndash detectable similarity because of common evolutionary origin

Computers Can Detect Homology

In fact computers are very good at this task ndash the two primary challenges are

(a) performing the search fast enough to look through millions of sequence in a timescale compatible with a lab scientistrsquos attention span

(b) at low levels of similarity being able to distinguish between biologically related sequences and chance matcheshellip

ATGCATGCTGCCAACGGATGCCCTG

ATGAAAGCCGCCTACGACAGTCCTG||| | || ||| ||| | ||||

GCTGACTCGTAGCGCTTAGCTAGCT

CCAACATCTAGCCAGATTAGTTAGT | || | | | |

Orthologs

A A

A Gene duplication though speciation The two copies of Gene

A will now evolve independently but will continue to have the ~same function

They are ORTHOLOGS

Paralogs

A

Gene duplication though internal genome duplication

The two copies of Gene A will now evolve independently but will probably not continue to have exactly the same function

They are PARALOGS

A

A Arsquo

A

lsquoOtherrsquo-logsWhat about gene duplication after speciation

How can we describe the relationship(s) between the various copies of gene A in the two frogs

Bear in mind that understanding gene function is more important than semanticshellip

The two copies of A in the orange frog are sometimes called IN-PARALOGS

If they were also present in the green frog (and therefore were in the ancestor species) they would be OUT-PARALOGS

A

A

A

Arsquo A

The Essential Paradigm

1 any group of modern species can be traced back to some extinct common ancestor

A

A

2 in all likelihood they share orthologous genes which have the same function in the modern animal as in the extinct ancestor

3 If we can experimentally determine the function of a gene in one of these organisms then there is a good chance the ORTHOLOGOUS gene in another organism will have the same function

A A

cyclin b1

cyclin b1

Function Conserved Longer than Detectable Similarity

start from first self-replicating sequence

same function detectable similarity

living organisms

whole genome duplication local duplication

Redundancy in the Genetic Code

GCA A alanine GCC A GCG A GCT A

TGC C cystine TGT C

GAC D aspartate GAT D

GGA G glycine GGC G GGG G GGT G

lsquoSynonymousrsquo or lsquosilentrsquo mutations in the third position of the codon triplets have no effect on the amino acid coded for ndash so there is no evolutionary pressure against thishellip

Protein Similarity Persists Longer

CTATCACGAGAACCTGTGCTATCCCGAGAACCTGTGCTATCCCGAGAACCAGTGCTATCCCGTGAACCAGTGCTATCCCGTGAGCCAGTGCTATCCCGTGAGCCAGTTCTGTCCCGTGAGCCAGTT

CTATCACGAGAACCTGTG

CTGTCCCGTGAGCCAGTT|| || || || || ||

LSREPV

LSREPV||||||

CTATCACGAGAACCTGTG

TTGTCCCGGTCGCCAGTT | || | || ||

LSREPV

LSRFPV||| ||

67 100

44 80

Always Compare Protein Sequences

ATGAATGCAGCCTATGATTGCCGAGCCAGAATGCTAAGG MNAAYDCRARMLR ||||| || || || || || || || ||||| || || | ||||||||+||ATGAAGGCCGCATACGACTGTCGTGCTAGAATCCTGAGA MKAAYDCRARILR

DNA comparison amino acid comparison

The DNA sequence can change while the amino acid sequence stays the same so always look for similarities by comparing amino acid sequences

Exercise 1nucleotide vs amino acid search

Go to the file example-sequenceshtml and locate the section for this exercise There should be two sequences lsquosurfeit1rsquo for frog and fly

Go to NCBI Blast home page then lsquoAlign two sequencesrsquo (bottom left lsquospecialrsquo panel) paste one sequence into each window and hit lsquoAlignrsquo ndash this will do a direct DNADNA comparison

Now find the open reading frames of the two genes and translate them into amino acid protein sequences then repeat the two sequences comparison

Go to NCBI ORF Finder ndash paste sequence ndash hit OrfFind ndash identify longest ORF ndash click on it ndash next screen hit Accept ndash change View to Fasta protein ndash hit View ndash copy sequence to Blast2Seqs Do the same with the other sequence

Before you hit lsquoAlignrsquo change the lsquoProgramrsquo (top left) to blastphellip

Answers Exercise 1

The Essential Taskexperiment data mining

gene sequence what is its function

database of proteins in other species

Cyclin-AFoxA1

cdc25

alpha-tubulin

Predicted protein

Gravin-like

Sprouty-2

calmodulin

KIAA10786568

frizzled

Wint8

Troponin T3

Gravin-like

we can only do this because of implied function based on orthology

Functional Orthologs

function known annotation lsquoGravinrsquo available

Human geneXenopus genefunction unknown

sequence similarityorthologs

same function But we know that function is largely determined by shape

similar shape

Which in general we cannot determine ndash but it is probably SHAPE not SEQUENCE that is conserved

We make an assumption that the same gene function is likely to be present in the two organisms and the ones that have this function are likely to be the most similar in sequence

Finding OrthologsSo how do we find orthologs and can we know when we have

The simplest is Reciprocal Best BLAST but it implicitly relies on having all the protein sequences of you own organism and the one you wish to find an ortholog in

frog proteindatabase of human proteins

best match human protein

database of frog proteins

x

Using Synteny is Better

We know that large regions of (say) vertebrate genomes have preserved their overall organisation from one organism to another

And we find the same genes (ie orthologs) in more or less the same order in the syntenic sectionsThese of course represent chromosomal re-arrangements since these organisms diverged

Human chromosome 5

Mouse chromosome 10

Mouse chromosome 2

MetazomeFortunately someone has done all the hard work for ushellip Dan Rokhsar httpwwwmetazomenet

Metazome Exercise

Go back to Entrez Gene and look for your favourite gene again

Pick probable ortholog vertebrate genes from common organisms (human mouse rat chicken frog fish) and paste their protein sequences into a temporary space

Go to Metazome (httpwwwmetazomenet) find the blast window open two versions of it and blast your sequences against the Tetrapod or Jawed vertebrate node

See if you get the same cluster ID as best top hit and have a look at the Metazome alignment(s)hellip

Part 3 Finding Sequence Similarities

We want computer programs which will compare sequences at all possible different alignments looking for a degree of similarity greater than we would expect to find by chance

But first we have to consider the implication of gapshellip

Insertions and deletions are other possible forms of mutations and they can really mess up our simple alignments

ATGCATGCTGCCAACGGATGTCCTG

ATGAAAGCCGCCTACGAAAGTCCTG||| | || ||| ||| | ||||||

ATGCATGCTGGCCAACGGATGTCCTG

ATGAAAGCCGCCTACGAAAGTCCTG||| | || ||| | | | |

Gaps in Alignments

Consider these two obviously similar sequences

TTCCCAACTCTCCTCTTTCACCATGAAGCTCAAGGACAGATTCCACTCGCCCCAAAATCAAGCTCACCCCGTCCAAGAA | || | || |||||||||||||||||||| ||||||||| ||| ||| | ||| | | |TTCCCACCTCTCCTCTTTGCACCATGAAGCTCAAGGACAAATTCCACTCCCCCAAAATCAAGCGCACCCCGTCCCAGAA

In fact we realise that the most probable alignment (regarding biological origin) is with a small gap in each sequence

TTCCCAACTCTCCTCTTT=CACCATGAAGCTCAAGGACAGATTCCACTCGCCCCAAAATCAAGCTCACCCCGTCCAAGAA |||||| ||||||||||| |||||||||||||||||||| ||||||||| |||||||||||||| |||||||||| ||||TTCCCACCTCTCCTCTTTGCACCATGAAGCTCAAGGACAAATTCCACTC=CCCCAAAATCAAGCGCACCCCGTCCCAGAA

So in general we allow ourselves to insert gaps until we find the optimal alignment

But where should this process stop

The Downside of GapsTake two random sequences with no lsquorealrsquo similarity

GACACTAGGTCGATGCGTGGTGGCGAGA

ACGCATCCGGATGTGCACCGTGGAACTG

And allow lsquocost freersquo gaps

GAC--ACT----AGGTCGATGC---GTGG---TGGCGAGA || | | | | | ||| |||| || ACGCA-TCCGGA--T-G-TGCACCGTGGAACTG

Clearly although the alignment has no mismatches it is obviously not biologically meaningful

To prevent this we assign a cost to adding gaps which is offset against the benefit of finding matches ndash and this is the essence of lsquofinding gapped alignmentsrsquo

We want to find the lsquoalignmentrsquo between the two (or more) sequences which shows the greatest degree of similarity while introducing the fewest gaps hellip

BLAST

gtqueryAGACGAACCTAGCACAAGCGCGTCTGGAAAGACCCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGTTCAGGAGTATTTGGACTGCAATATTGGCCCTCGTTCAAGGGCGCCTACCATCACCCGACGGTCATGCCGGTCCCCAGCAGCTGCTAATAACTTCCTTCGCTACTCAAGTTACCACGCTAGCAAAACCCACGGCATACCGTTTACCCTTTAAAATCAGCTTCAACCAGCAACGAA

There are many programs used to find similarities between sequencesThey range from relatively slow programs which find the exact best matching alignment through ones which take progressively inexact shortcuts to speed things up Of this latter class the best known and easily most widely used is BLAST developed by Stephen Altschul and others and continuously refined over the last 10-15 years

The essential idea is to compare your query sequence against a collection or lsquodatabasersquo of target sequences looking for the one(s) that match the query sequence the best

gttarget1AAAACAGGAATATTTACCGGGACCGGGTAATGATGCATCTCGAGGTACACAATATACCTG GAGAACCGAATTATGAGTTGGCCACCTTACTTAACGAAACCAGCAGAGAAAATCCAACAT GGCAACACCCCTCTGACTACACTAGAAGGAACTACTATGTAAGAAAACAGCCTGTCCCTT GCAGTTTGAATGACTGGGTGATGCGAAATGGGGGTCCTGCCATAGAGCGCTTCCATGGTT TACCTTGCACATTTCAGAGAAGTCCTATGCCAGGAGTCCTTCCTACAGGGCCTTCCTGAA ACTATATATGTGCTTATTCTTGTTTGATTTGGCTTTGCAGgttarget2CTCTTAATTTATTTCTCTTCCTGCAGCTCCCTCGCTTTTTCCTTTCCCTGTTACATTCAT CTGACTTGAAGAGTTGCAAATTTTCAGTGTTTCTGTTTTTGTTGCTGATATGTTGTAAAC TTTTTAATAAAATCTATTTCTATAG gttarget3GCAGTTTGAATGACTGGGTGATGCGAAATGGGGGTCCTGCCATAGAGCGCTTCCATGGTT TACCTTGCACATTTCAGAGAAGTCCTATGCCAGGAGTCCTTCCTACAGGGCCTTCCTGAA ACTATATATGTGCTTATTCTTGTTTGATTTGGCTTTGCAGCTAGGGTTTTCACCTTTTCT GGAAAAAAAAATACTGGCTTCC gttarget4CTGCTATTAATGGGCAAAACAACTCAAATAAAGTCCCTCTGCCACCCTCAGACACTGCCC CTGGCCCCCAGCTGCCCGCTGATCCTTGTAGCCAGAGCAGTAAAGTTTTGAAAGTGGAGC CCAAGGAGAATAAAGTTATTAAAGAAACTGGCTTTGAACAAGGTGAAAAGTCTTGTGCAG CACCTCTAGATCATACTGTGAAGGAAAATCTTGGACAAACTTCTAAAGAACAGGTGGTAG

query

database

COMPARE

LIST MATCHES

Flavours of BLAST

ACGATAGATCCCATCCATAAAT ATGACGATAGATCCCATCAT CGATAGGACCACCACA GATAGACCAGGATACATAGGATAATTA AGCTCGCTTGGCTCGATGGCT

MKJLSPWERSYTRGHYTWER MGHTVNBZY MKLPWRHGDBKJGMNDFD MBKLRPIUHDFRTASGSLKWWRTVBN

MKJLSPWERSYTRGHYTWER MGHTVNBZY MKLPWRHGDBKJGMNDFD MBKLRPIUHDFRTASGSLKWWRTVBN

ATGACGATAGATCCCATCAT CGATAGGACCACCACA GATAGACCAGGATACATAGGATAATTA AGCTCGCTTGGCTCGATGGCT

query sequence other operation database sequences

ATGACGATAGATCCCATCAT CGATAGGACCACCACA GATAGACCAGGATACATAGGATAATTA AGCTCGCTTGGCTCGATGGCT

BLASTn

BLASTp

BLASTx

tBLASTn

tBLASTx

ACGATAGATCCCATCCATAAAT

ACGATAGATCCCATCCATAAAT

MQWCGYRWTYQGYRW

MQWCGYRWTYQGYRW

FAST

FAST

SLOW

SLOWER

HORRIBLY

SLOW

6 fra

me

trans

latio

n

How does it work The main task of any sequence comparison program is to test all possible mutual alignments of two sequence and see how good the match is

CTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGT

CCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGTTC | | | | | ||||||||||||||||||||||||| CTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGTCTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGT || | | | | | | | | |||||||||||||||||||||||| | | | | | |

CCGAGCTTCTCATTGCTCTTCCTAACAGTG=TGATAGGCTAACCGTAATGGCGTTC||||||||||||||||||||||||| ||||||||||||||||||||||||

query

1st database sequence

This would actually be a very slow search process if implemented like thishellip

BLAST achieves its speed through two strategies

- it takes a WORD based approach- it pre-INDEXES database sequences

BLAST WORDS and INDEXING1 GACAAATCCAAACCCCTGAAGTTCTCCACCAGCAAAGCCA

2 TAAGCAAATTTAATTTTGTTTACATTTTC

3 GTTAAGACCTTCCCTGACATTTGCAGCAGTTTCAAATGTA

AAAAAAAA 00001 AAAAAAAC 00002 AAAAAAAG 00003 ACAAATCC 07967 ACAAATCC 07968 ACAAATCC 07979 GACAAATC 33568 GACAAATG 33569 TCCAAACC 64321 TCCAAACC 64322

sequence position word

1 1 33658

1 2 07967

1 3 16210

3 15 33568

3 16 07967

Database of sequences

Numbered list of all possible lsquowordsrsquo

Build a position index of all words in the database

Analyse the Query Sequence gtquery AGACAAATCCAAACCCCTGAAGTTCTCCACCAGCAAAGCCA

AAAAAAAA 00001 AAAAAAAC 00002 AAAAAAAG 00003 ACAAATCC 07967 ACAAATCC 07968 ACAAATCC 07979 GACAAATC 33568 GACAAATG 33569 TCCAAACC 64321 TCCAAACC 64322

QUERY SEQUENCE

Numbered list of all possible lsquowordsrsquo

position word

1 14236

2 33658

3 07967

Analyse QUERY SEQUENCE

sequence position word

1 1 33658

1 2 07967

1 3 16210

3 15 33568

3 16 07967

Index of database

Expand from Word Based Matches We lsquoinstantlyrsquo know which sequences in the database have at least a word length match with our query sequence and at what relative position

Next the potential alignments are expanded adding up a score for (total matches ndash mismatches ndash gap penalties) to make the best possible alignment But this is usually for a tiny proportion of the sequences in the database ndash so overall it is much quicker

The highest scoring alignments are reported

But we can potentially miss alignments with no word-size bits in common consider BLASTn with a default word-size of 11

TCGGAAGTGGAAGCTGAACCTGATTGTAGAGTTGGAGGCCAGTGTTCTGGCTGAGC||||||||| ||||| |||||||||| |||||||||| |||| ||||| ||||||| TCGGAAGTGTAAGCTCAACCTGATTGCAGAGTTGGAGTCCAGAGTTCTAGCTGAGC

Care is sometimes neededhellip

BLAST ndashTypical OutputINPUT

gtpartial cDNA sequence Xenopus tropicalisCGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGTTCCCACCTCTCCTCTTTCACCATGAAGCTCAAGGACAAATTCCACTCCCCCAAAATCAAGCGCACCCCGTCCAAGAAGGGGAAGCCGGCCGACCTCACCGTCAAAACAGAAGAGAAACCCGTCAACAAAACCTTAAGCCGCTTGGAGGAACAGGAGAAAGAAGTCGTTAATGCCTTGCGTTACTTTAAGACAATTGTTGACAAGATGGCGGTGGACAAGATGGTGCTGGTGATGCTGCCAGGGTCGGCGA

OUTPUTQuery= (311 letters) Database NCBI Protein Reference Sequences 954378 sequences 347895532 total letters

gtgi|41055060|ref|NP_9574201| similar to guanine nucleotide-releasing factor 2 (specific for crk proto-oncogene) [Danio rerio]

Length=691

Score = 133 bits (335)Expect = 6e-31 Identities = 7698 (77) Positives = 8298 (83) Gaps = 498 (4) Frame = +2

Query 26 MSGKIE-KADSQRSHLSSFTMKLKDKFHSPKIKRTPSKKGKPA--DLTVKTEEKPVNKTL 196 MSGKIE K +SQ+SHLSSFTMKL KFHSPKIKRTPSKKGK + VKT EKPVNK + Sbjct 1 MSGKIESKHESQKSHLSSFTMKLM-KFHSPKIKRTPSKKGKQLQPEPAVKTPEKPVNKKV 59

Query 197 SRLEEQEKEVVNALRYFKTIVDKMAVDKMVLVMLPGSA 310 SRLEEQEK+VV+ALRYFKTIVDKM VD VL MLPGSA Sbjct 60 SRLEEQEKDVVSALRYFKTIVDKMNVDTKVLQMLPGSA 97

When is a match significant

RFKISDCQHPCTYSHNQYMTNHMRECPYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV

NFSWKKTSEKETNCQFDYPNDYNEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFNMCWLEVNSS

RF---KISDCQHPCTYSH-NQYMTNHMREC----PYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV F K S+ + C + + N Y N +C P+ + +W +P + D I N M ++ NFSWKKTSEKETNCQFDYPNDY--NEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFN------MCWLEVNSS

Here is a lsquotypicalrsquo weak alignment from BLASTp

In fact the sequences were randomly generated so there is no biologically significant alignmenthellip

E-values

The number of matches like the discovered match that I would expect to find by chance

An E-value of 00 implies that I would expect no matches like this to arise by chance thereforehellip

An E-value of 1 implies I would expect 1 match like this to arise by chance so if I have a match with such an E-valuehellip

Also ldquoexpect valueldquo or ldquoexpectationrdquo

E-values From First Principles

Some database statistics (23rd July 2005)

Database NCBI RefSeq mRNA 272619 sequences 503566580 total letters (~50 x 108)

Database NCBI nr 3329110 sequences 14601814750 total letters (~14 x 1010)

Notation

12e-35 = 12 x 10-35

48 x 106 = 4800000

We will consider first searching a nucleotide sequence (lsquoACGTAGACGTrsquo) against a nucleotide database eg the RefSeq mRNA above

Then we will consider the more complex case of amino acid sequence (protein) searches Which is of course what we mostly do

Calculating an E-valueThe RefSeq mRNA database has ~ 50 x 108 letters There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA

Query = lsquoArsquo

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACrsquo

AC AC AC AC

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACGrsquo

ACG

Expected number of matches = (50 x 108) 4 = ~12 x 108

Expected number of matches = (50 x 108) (4x 4) = ~31 x 107

Expected number of matches = (50 x 108) (4 x 4 x 4) = ~81 x 106

Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer

Expected number of matches = (50 x 108) (4 x 4 x 4 x 4 hellip 60 times ) = (50 x 108) 1036 = 50 x 10-28

E-value = 50 x 10-28

E-values In PracticeSo if I take a 60 nt sequencegtsequenceACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA and actually BLAST it against the RefSeq mRNA database I get

BLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds

Length=6060 Score = 119 bits (60) Expect = 2e-26 Identities = 6060 (100) Gaps = 060 (0) Strand=PlusPlus

Query 1 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 60 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 2977 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 3036

What do I get if I BLAST it against the larger nr databaseBLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds

Length=6060 Score = 119 bits (60) Expect = 6e-25 Identities = 6060 (100) Gaps = 060 (0) Strand=PlusPlus

Query 1 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 60 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 2977 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 3036

theoretical value was 50e-28 -

E-value Exercise

Given a transcription factor binding site

ACC[TG]TA

How many would you expect to find by chance in a 10k promoter sequence

How would this differ if there was an optional additional base between the 4th and 5th positionsIeACC[TG]TAOR ACC[TG]TA

E-value Exercise AnswerACC[TG]TA

Expect lsquoArsquo every 4 ntExpect lsquoACCrsquo every 4x4x4 = 64 nt

Expect lsquoT or Grsquo every 2nd ntExpect lsquoACC[TG]rsquo every 64x2 nt = 128 nt

Expect lsquoTArsquo every 4x4 = 16 ntExpect lsquoACC[TG]TArsquo every 128x16 nt = 2048 nt (4x4x4x2x4x4)We would expect ~5 of these promoter sites every 10k by chance

If also ACC[TG]TAA allowed

The two motifs independently have the same E-valueTo allow either means we expect twice as many

E-values Effect of Database SizeThe nr mRNA database has ~ 14 x 1010 letters (was RefSeq and 50 x108)There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA

Query = lsquoArsquo

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACrsquo

AC AC AC AC

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACGrsquo

ACG

Expected number of matches = (14 x 1010) 4 = ~12 x 108

Expected number of matches = (14 x 1010) (4x 4) = ~31 x 107

Expected number of matches = (14 x 1010) (4 x 4 x 4) = ~81 x 106

Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer

Expected number of matches = (14 x 1010) (4 x 4 x 4 x 4 hellip 60 times ) = (14 x 1010) 1036 = 14 x 10-26

E-value = 14 x 10-26

(was E-value = 50 x 10-28)

E-values Effect of Database Size

The E-value is simply dependent on database size

RefSeq

nr

14 x 1010 letters

50 x108 letters

30 x bigger

BLAST the same sequenceagainst each E-value = 14e-26

E-value = 50e-28

The database was ~30 times bigger and so the E-value was ~30 times bigger

Why were the values differentOur calculated E-value for searching against the RefSeq mRNA database was 50 x 10-28But our actual BLAST search at NCBI gave a value of

20 x 10-26 - about 40x larger - why is this

Gapped alignments

If we were expecting N matches for a query sequence lsquoACGTACGTACGTrsquo imagine what would happen to N if we allowed gaps in our matches

ACGTACGTACGT

This would now give us additional possible alignments that would meet our lsquomatchrsquo criteria

ACGTACGTACGT ACGTACAGTACGT ACGTACCGTACGT etc|||||||||||| |||||| |||||| |||||| ||||||ACGTACGTACGT ACGTAC-GTACGT ACGTAC-GTACGT

We will expect many more matches in a given database if we allow our alignments to have gaps The E-value will be larger

E-values Effect of Query Length

Biologically itrsquos the same match Does it mean we are any less sure that this match didnrsquot occur by chance The E-value is simply dependent on match length

database

BLAST 500 nt sequence against a database

BLASTn Get a full length match with sequence XYZ at an E-value = 50e-160

gtsequence ACTAGTCTAGCTAGACATCGATCGATGATGCTACACAGATAGACGATAGATAGTAAGTCGATCGATCGCGCATCGATCGTCTAGATCGATCGCTCGCTGTGTAGATAGATCGGCGATAGA

database

BLAST half of the same sequence against the same database

BLASTn

gtsequence ACTAGTCTAGCTAGACATCGATCGATGATGCTACACAGATAGACGATAGATAGTAAGTCG

Get a match with sequence XYZ again but at an E-value = 50e-80

Why not just use identityAt some levels this a good question

But consider two very different searches both of which give a 75 identity match

Query1 was 60 nt longCGGAGCTCAGGGCTTAACGACTGATATCTCCGCGCATGTCGAGAAACGATACAGCCAGCG||||||||||| || | || | || || |||| | | | |||||| | ||||||||||CGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGWhich would have an E-value ~ 50 x 10-19

And Query2 only 16 nt longACGTACGTACGTACGT||| || | |||| ||ACGCACCTTCGTAGGTWhich would have an E-value ~ 30

And intuitively we feel we would expect to see that sort of number of matches in the database just by chancehellip

So whatrsquos the real problemBasically you are usually trying to answer the question

Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism

Are there any useful guidelines though at least for biological meaningfulness

Basically you are usually trying to answer the question

Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism

BLAST

The difficulty is because

ORTHOLOGY

BLAST Similarity + Probability

biological knowledge

nature of query sequence

phylogenetic relationship

match length PI size of databasehellip

Rules of ThumbHow good does an E-value have to be before we might even think we have an ortholog

largerworse smallerbetter

E-values 10-5 10-10 10-40 10-100 00

fantasy borderline encouraging

pretty good canrsquot get

better

But note that in some gene families with closely related members you can get an E-value of 00 for several different matches and then identity may be more sensitive Also bear in mind in cases like this that ideas of lsquofunctionalrsquo orthology may break down with more than one locus producing identical proteins which share the same functionhellip

Protein BLASTItrsquos (nearly) always better to make comparisons at the amino acid level between protein sequences than the DNA level because the amino acid sequence is more conserved than the underlying DNA sequenceDoes this cause us to treat expected values any differently

If we follow the argument as before then for an exact match of a 20 amino acid sequence in the RefSeq protein database each additional amino acid will reduce the E-value by 120th (there are 20 different amino acids) And as there are 347895532 letters in that databaseE-value = ~35 x 108 (20 x 20 x 20 hellip20 times) = ~35 x 10-18

But this is what we get if we run the blast at NCBI

Score = 431 bits (100) Expect = 8e-04 Identities = 2020 (100) Positives = 2020 (100) Gaps = 020 (0) Frame = +3

Query 3 SSSSFRAYRAALSEVEPPCI 62 SSSSFRAYRAALSEVEPPCISbjct 972 SSSSFRAYRAALSEVEPPCI 991

Really too big a discrepancy to easily explain with hand wavinghellip

Amino Acid Substitutions

A SC F LWYG I LMVL IMFVM ILVP V ILMW FY

N DHSQ REHKS ANTT SY HFW

H NQYK RQER QK

D NEE DQK

In fact we need to take into account both amino acid substitutability as well as as before allowing gapped alignments On average any residue can be substituted for by about 2 others so each position has about 17th chance of lsquomatchingrsquo rather than 120th

So now we getE-value = ~35 x 108 (7 x 7 x 7 hellip20 times) = ~44 x 10-9which is much closer to the actual BLAST value

These substitutabilities are dealt with by the BLOSUM and PAM matrices

ExercisesGo to the file random-DNA-sequenceshtml select one of the 20 randomly generated nucleotide sequences and do a BLASTx (translated DNA-gtprotein) at NCBI against the nr protein database

Did you find any lsquosignificantrsquo hits

Repeat with a second sequence

What conclusions might you draw from this exercise

Try the same sequence(s) against the nr nucleotide database

Is there any general difference

Part 4 Tweaking BLASTAlthough you normally see BLAST as a web page with boxes to place data in and tick boxes etc it is actually a command line program that can be run just by typing the appropriate command and options eg

promptgt blastall ndashp ltblast typegt ndashi ltinput sequencegt ndashd ltdatabasegt

This is the simplest form where the basic program lsquoblastallrsquo takes a number of different options or parameters indicated by the ndashx and followed by its value -p ltwhich blast flavour to rungt-i ltfile with query sequence ingt-d ltpre-indexed database namegt

Not All Parameters are hellipThere are many other parameters and if not listed explicitly they will use a default value most appropriate to the blast flavour requested Eg for ndashW ltword sizegt blastn uses ndashW 11 where blastx uses ndashW 3

There are also some options that appear on the web pages that are not really parameters but manage the job in a similar way One of the most useful of these is on the NCBI blast pages where you can use Entrez queries or pick from an organism list to modify your search

The Many Parameters of BLASTThere are almost literally hundreds of parameters but most are way too obscure even for die-hard techies like me Very few of them are regularly useful in any but their default value but just occasionally they are very necessary

Here are some of the ones that I have used

-e max expected value -m output format (graphical or tabularspreadsheet)-F filter query sequence for low complexity (default TRUE)-U use only upper case regions of query (default FALSE)-G gap opening cost-E gap extension cost-q nucleotide mismatch penalty (BLASTx uses matrices)-r nucleotide match reward-b number of matching sequences to report-g allow gaps (default TRUE)-W word size-z effective database size (removes effect of actual database size)-S query strands to search (default both directions)-l restrict database sequences to given list of lsquogilsquo numbers

BLAST Parameters Exercises1 BLASTn vs BLASTx

Open the file example-sequenceshtml copy the sequence gtblastn-vs-blastxThis is a Xenopus tropicalis cDNA sequence

Go to the NCBI BLAST Home PageNucleotide-nucleotide BLAST (blastn) section Paste your sequence into the box

Run BLASTn against the nr nucleotide database using all default optionsThen hit [format] to wait for the results in a new page(hint if you paste the sequence definition line lsquogtnamersquo into the box as well your results will be labelled accordingly which can be useful)

Now repeat but go to the TRANSLATED BLAST section and BLAST against the nr protein database using BLASTx

How might the different results help us view the presence of this gene in other vertebrates

Results for Exercise 1 BLASTn

BLASTx

BLAST Parameters Exercises2 Low complexity filtering

Open the file example-sequenceshtml copy the sequence gtlow-complexity-filtering-A This sequence contains a long AT tandem repeat

Go to the NCBI BLAST Home PageTRANSLATED BLAST sectionBLASTx Paste your sequence into the box

Carefully UNTICK the ldquoChoose filter [ ] Low complexityrdquo BOX in the second section And then run BLASTx against the nr database

What do you feel about these alignmentsRe-run but leave the low-complexity filter ON this timeDoes this change our view of the protein matches

Now continue with gtlow-complexity-filtering-B and ndashCC is an especially interesting case ndash what can we deduce about the cDNA sequence Annotators beware

Results for Exercise 2A (OFF) BLASTn ndash low complexity filtering OFF

Results for Exercise 2A (ON) BLASTn ndash low complexity filtering ON

Results for Exercise 2B

ON OFF

Results for Exercise 2C

There is a sequence error an extra G at position 117 in the sequence cDNA (117)AGAAAAGAAGAAACATGGCAATGGATCAGAA|||||||||||||||| ||||||||||||||AGAAAAGAAGAAACAT-GCAATGGATCAGAA

Genomic sequence

ON

OFF

BLAST Parameters Exercises3 Limit by Entrez query

Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching sequences in fruit fly enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT

Open the file example-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home Page TRANSLATED BLAST sectionBLASTx and paste the sequence

Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip

BLAST Parameters Exercises4 BLASTn vs tBLASTx and nucleotide mismatch penalties

Open the file example-sequenceshtml

Also open the NCBI BLAST Home PageSPECIAL ndash Align two sequences section

There are several Xenopus tropicalis cyclins in the examples fileCopy the sequence gtcyclin-A1-Xt to the Sequence 1 BLAST windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 BLAST window(i) Run the default comparison should be BLASTn Note the alignmentNow run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these

(ii) Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1 Then try reducing the gap open and gap extension penaltieshellipWhat do we learn from this

(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip

Results for Exercise 4 (i)

BLASTn tBLASTx

Results for Exercise 4 (ii)

Mismatch penalty = -2 (default) Mismatch penalty = -1

BLAST Parameters Exercises5 E-Value maximum for reporting

Open the file example-sequenceshtml

Copy the sequence gtsumo-binding-motif and go to the NCBI BLAST Home PageGo to the PROTEIN BLAST section BLASTp and paste the sequence

Run the search with the default values

Now re-run the search setting the maximum E-value in the box

Expect 100

What difference does this make

BLAST Parameters Exercises6 Word Size

Open the file example-sequenceshtml

Copy the sequence gtmorpholino and go to the NCBI BLAST Home PageGo to the NUCLEOTIDE BLAST section BLASTn and paste the sequence

Check OFF the low complexity filter and then run the search

Now re-run the search setting the following parameters

Low complexity OFFExpect 100Word Size 7Other advanced -q-1 (mismatch penalty -1 instead of default -3)

What difference does this make

END

  • Bioinformatics Workshop 1 Sequences and Similarity Searches
  • The Basic Questions
  • Part 1 Structural Genomics
  • Chromosomes and Genes
  • Gene to Protein
  • Sequence Signals
  • Genomic Signals
  • Derivative Sequences
  • Gene Models
  • Sequences and Genes (Accession Numbers and Names)
  • Gene Symbols Names Etc
  • A Gene-Centric View
  • Sequences and Accession Numbers
  • mRNA Splicing Signals
  • Gene Predictions
  • Supporting Evidence
  • TheoreticalPredicted Sequences
  • Sequences for a model organism
  • So Whatrsquos in the Databases Now
  • Part 2 Comparative Genomics
  • Speciation
  • Residual Similarity
  • Computers Can Detect Homology
  • Orthologs
  • Paralogs
  • lsquoOtherrsquo-logs
  • The Essential Paradigm
  • Function Conserved Longer than Detectable Similarity
  • Redundancy in the Genetic Code
  • Protein Similarity Persists Longer
  • Always Compare Protein Sequences
  • Exercise 1 nucleotide vs amino acid search
  • Answers Exercise 1
  • The Essential Task
  • Functional Orthologs
  • Finding Orthologs
  • Using Synteny is Better
  • Metazome
  • Metazome Exercise
  • Part 3 Finding Sequence Similarities
  • Gaps in Alignments
  • The Downside of Gaps
  • BLAST
  • Flavours of BLAST
  • How does it work
  • BLAST WORDS and INDEXING
  • Analyse the Query Sequence
  • Expand from Word Based Matches
  • BLAST ndashTypical Output
  • When is a match significant
  • E-values
  • E-values From First Principles
  • Calculating an E-value
  • E-values In Practice
  • E-value Exercise
  • E-value Exercise Answer
  • E-values Effect of Database Size
  • Slide 58
  • Why were the values different
  • E-values Effect of Query Length
  • Why not just use identity
  • So whatrsquos the real problem
  • Rules of Thumb
  • Protein BLAST
  • Amino Acid Substitutions
  • Exercises
  • Part 4 Tweaking BLAST
  • Not All Parameters are hellip
  • The Many Parameters of BLAST
  • Slide 70
  • BLAST Parameters Exercises
  • Results for Exercise 1
  • Slide 73
  • Results for Exercise 2A (OFF)
  • Results for Exercise 2A (ON)
  • Results for Exercise 2B
  • Results for Exercise 2C
  • Slide 78
  • Slide 79
  • Results for Exercise 4 (i)
  • Results for Exercise 4 (ii)
  • Slide 82
  • Slide 83
  • END
Page 21: Bioinformatics Workshop 1 Sequences and Similarity Searches

Speciation

ATGAAGGCTGCCTACGACTGCCGTG

ATGCAGGCTGCCTACGACTGCCGTGATGCAGGCTGCCAACGACTGCCGTGATGCATGCTGCCAACGACTGCCGTGATGCATGCTGCCAACGACTGCCCTGATGCATGCTGCCAACGGCTGCCCTGATGCATGCTGCCAACGGATGCCCTG

Gene AATGAAGGCTGCCTACGACTGCCGTG

ATGAAGGCCGCCTACGACTGCCGTGATGAAGGCCGCCAACGACTGTCGTGATGAAAGCCGCCAACGACTGTCGTGATGAAAGCCGCCAACGACAGTCGTGATGAAAGCCGCCTACGACAGTCGTGATGAAAGCCGCCTACGACAGTCCTG

ATGCATGCTGCCAACGGATGCCCTG

ATGAAAGCCGCCTACGACAGTCCTG||| | || ||| ||| | ||||

If the genetic difference means they can no longer interbreed with fertile offspring ndash then we have a new specieshellip

Residual Similarity

ATGCATGCTGCCAACGGATGCCCTG

ATGAAAGCCGCCTACGACAGTCCTG||| | || ||| ||| | ||||

ATGCATGCTGCCAACGGATGCCCTG

ATGGAAGGCGCTTAGGATAGTCCAG||| | | || | | | || |

After longer periods of evolution homology may no longer be detectable in the DNA sequencehellip

We can still easily detect residual similarity between these sequences this is what we call homology ndash detectable similarity because of common evolutionary origin

Computers Can Detect Homology

In fact computers are very good at this task ndash the two primary challenges are

(a) performing the search fast enough to look through millions of sequence in a timescale compatible with a lab scientistrsquos attention span

(b) at low levels of similarity being able to distinguish between biologically related sequences and chance matcheshellip

ATGCATGCTGCCAACGGATGCCCTG

ATGAAAGCCGCCTACGACAGTCCTG||| | || ||| ||| | ||||

GCTGACTCGTAGCGCTTAGCTAGCT

CCAACATCTAGCCAGATTAGTTAGT | || | | | |

Orthologs

A A

A Gene duplication though speciation The two copies of Gene

A will now evolve independently but will continue to have the ~same function

They are ORTHOLOGS

Paralogs

A

Gene duplication though internal genome duplication

The two copies of Gene A will now evolve independently but will probably not continue to have exactly the same function

They are PARALOGS

A

A Arsquo

A

lsquoOtherrsquo-logsWhat about gene duplication after speciation

How can we describe the relationship(s) between the various copies of gene A in the two frogs

Bear in mind that understanding gene function is more important than semanticshellip

The two copies of A in the orange frog are sometimes called IN-PARALOGS

If they were also present in the green frog (and therefore were in the ancestor species) they would be OUT-PARALOGS

A

A

A

Arsquo A

The Essential Paradigm

1 any group of modern species can be traced back to some extinct common ancestor

A

A

2 in all likelihood they share orthologous genes which have the same function in the modern animal as in the extinct ancestor

3 If we can experimentally determine the function of a gene in one of these organisms then there is a good chance the ORTHOLOGOUS gene in another organism will have the same function

A A

cyclin b1

cyclin b1

Function Conserved Longer than Detectable Similarity

start from first self-replicating sequence

same function detectable similarity

living organisms

whole genome duplication local duplication

Redundancy in the Genetic Code

GCA A alanine GCC A GCG A GCT A

TGC C cystine TGT C

GAC D aspartate GAT D

GGA G glycine GGC G GGG G GGT G

lsquoSynonymousrsquo or lsquosilentrsquo mutations in the third position of the codon triplets have no effect on the amino acid coded for ndash so there is no evolutionary pressure against thishellip

Protein Similarity Persists Longer

CTATCACGAGAACCTGTGCTATCCCGAGAACCTGTGCTATCCCGAGAACCAGTGCTATCCCGTGAACCAGTGCTATCCCGTGAGCCAGTGCTATCCCGTGAGCCAGTTCTGTCCCGTGAGCCAGTT

CTATCACGAGAACCTGTG

CTGTCCCGTGAGCCAGTT|| || || || || ||

LSREPV

LSREPV||||||

CTATCACGAGAACCTGTG

TTGTCCCGGTCGCCAGTT | || | || ||

LSREPV

LSRFPV||| ||

67 100

44 80

Always Compare Protein Sequences

ATGAATGCAGCCTATGATTGCCGAGCCAGAATGCTAAGG MNAAYDCRARMLR ||||| || || || || || || || ||||| || || | ||||||||+||ATGAAGGCCGCATACGACTGTCGTGCTAGAATCCTGAGA MKAAYDCRARILR

DNA comparison amino acid comparison

The DNA sequence can change while the amino acid sequence stays the same so always look for similarities by comparing amino acid sequences

Exercise 1nucleotide vs amino acid search

Go to the file example-sequenceshtml and locate the section for this exercise There should be two sequences lsquosurfeit1rsquo for frog and fly

Go to NCBI Blast home page then lsquoAlign two sequencesrsquo (bottom left lsquospecialrsquo panel) paste one sequence into each window and hit lsquoAlignrsquo ndash this will do a direct DNADNA comparison

Now find the open reading frames of the two genes and translate them into amino acid protein sequences then repeat the two sequences comparison

Go to NCBI ORF Finder ndash paste sequence ndash hit OrfFind ndash identify longest ORF ndash click on it ndash next screen hit Accept ndash change View to Fasta protein ndash hit View ndash copy sequence to Blast2Seqs Do the same with the other sequence

Before you hit lsquoAlignrsquo change the lsquoProgramrsquo (top left) to blastphellip

Answers Exercise 1

The Essential Taskexperiment data mining

gene sequence what is its function

database of proteins in other species

Cyclin-AFoxA1

cdc25

alpha-tubulin

Predicted protein

Gravin-like

Sprouty-2

calmodulin

KIAA10786568

frizzled

Wint8

Troponin T3

Gravin-like

we can only do this because of implied function based on orthology

Functional Orthologs

function known annotation lsquoGravinrsquo available

Human geneXenopus genefunction unknown

sequence similarityorthologs

same function But we know that function is largely determined by shape

similar shape

Which in general we cannot determine ndash but it is probably SHAPE not SEQUENCE that is conserved

We make an assumption that the same gene function is likely to be present in the two organisms and the ones that have this function are likely to be the most similar in sequence

Finding OrthologsSo how do we find orthologs and can we know when we have

The simplest is Reciprocal Best BLAST but it implicitly relies on having all the protein sequences of you own organism and the one you wish to find an ortholog in

frog proteindatabase of human proteins

best match human protein

database of frog proteins

x

Using Synteny is Better

We know that large regions of (say) vertebrate genomes have preserved their overall organisation from one organism to another

And we find the same genes (ie orthologs) in more or less the same order in the syntenic sectionsThese of course represent chromosomal re-arrangements since these organisms diverged

Human chromosome 5

Mouse chromosome 10

Mouse chromosome 2

MetazomeFortunately someone has done all the hard work for ushellip Dan Rokhsar httpwwwmetazomenet

Metazome Exercise

Go back to Entrez Gene and look for your favourite gene again

Pick probable ortholog vertebrate genes from common organisms (human mouse rat chicken frog fish) and paste their protein sequences into a temporary space

Go to Metazome (httpwwwmetazomenet) find the blast window open two versions of it and blast your sequences against the Tetrapod or Jawed vertebrate node

See if you get the same cluster ID as best top hit and have a look at the Metazome alignment(s)hellip

Part 3 Finding Sequence Similarities

We want computer programs which will compare sequences at all possible different alignments looking for a degree of similarity greater than we would expect to find by chance

But first we have to consider the implication of gapshellip

Insertions and deletions are other possible forms of mutations and they can really mess up our simple alignments

ATGCATGCTGCCAACGGATGTCCTG

ATGAAAGCCGCCTACGAAAGTCCTG||| | || ||| ||| | ||||||

ATGCATGCTGGCCAACGGATGTCCTG

ATGAAAGCCGCCTACGAAAGTCCTG||| | || ||| | | | |

Gaps in Alignments

Consider these two obviously similar sequences

TTCCCAACTCTCCTCTTTCACCATGAAGCTCAAGGACAGATTCCACTCGCCCCAAAATCAAGCTCACCCCGTCCAAGAA | || | || |||||||||||||||||||| ||||||||| ||| ||| | ||| | | |TTCCCACCTCTCCTCTTTGCACCATGAAGCTCAAGGACAAATTCCACTCCCCCAAAATCAAGCGCACCCCGTCCCAGAA

In fact we realise that the most probable alignment (regarding biological origin) is with a small gap in each sequence

TTCCCAACTCTCCTCTTT=CACCATGAAGCTCAAGGACAGATTCCACTCGCCCCAAAATCAAGCTCACCCCGTCCAAGAA |||||| ||||||||||| |||||||||||||||||||| ||||||||| |||||||||||||| |||||||||| ||||TTCCCACCTCTCCTCTTTGCACCATGAAGCTCAAGGACAAATTCCACTC=CCCCAAAATCAAGCGCACCCCGTCCCAGAA

So in general we allow ourselves to insert gaps until we find the optimal alignment

But where should this process stop

The Downside of GapsTake two random sequences with no lsquorealrsquo similarity

GACACTAGGTCGATGCGTGGTGGCGAGA

ACGCATCCGGATGTGCACCGTGGAACTG

And allow lsquocost freersquo gaps

GAC--ACT----AGGTCGATGC---GTGG---TGGCGAGA || | | | | | ||| |||| || ACGCA-TCCGGA--T-G-TGCACCGTGGAACTG

Clearly although the alignment has no mismatches it is obviously not biologically meaningful

To prevent this we assign a cost to adding gaps which is offset against the benefit of finding matches ndash and this is the essence of lsquofinding gapped alignmentsrsquo

We want to find the lsquoalignmentrsquo between the two (or more) sequences which shows the greatest degree of similarity while introducing the fewest gaps hellip

BLAST

gtqueryAGACGAACCTAGCACAAGCGCGTCTGGAAAGACCCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGTTCAGGAGTATTTGGACTGCAATATTGGCCCTCGTTCAAGGGCGCCTACCATCACCCGACGGTCATGCCGGTCCCCAGCAGCTGCTAATAACTTCCTTCGCTACTCAAGTTACCACGCTAGCAAAACCCACGGCATACCGTTTACCCTTTAAAATCAGCTTCAACCAGCAACGAA

There are many programs used to find similarities between sequencesThey range from relatively slow programs which find the exact best matching alignment through ones which take progressively inexact shortcuts to speed things up Of this latter class the best known and easily most widely used is BLAST developed by Stephen Altschul and others and continuously refined over the last 10-15 years

The essential idea is to compare your query sequence against a collection or lsquodatabasersquo of target sequences looking for the one(s) that match the query sequence the best

gttarget1AAAACAGGAATATTTACCGGGACCGGGTAATGATGCATCTCGAGGTACACAATATACCTG GAGAACCGAATTATGAGTTGGCCACCTTACTTAACGAAACCAGCAGAGAAAATCCAACAT GGCAACACCCCTCTGACTACACTAGAAGGAACTACTATGTAAGAAAACAGCCTGTCCCTT GCAGTTTGAATGACTGGGTGATGCGAAATGGGGGTCCTGCCATAGAGCGCTTCCATGGTT TACCTTGCACATTTCAGAGAAGTCCTATGCCAGGAGTCCTTCCTACAGGGCCTTCCTGAA ACTATATATGTGCTTATTCTTGTTTGATTTGGCTTTGCAGgttarget2CTCTTAATTTATTTCTCTTCCTGCAGCTCCCTCGCTTTTTCCTTTCCCTGTTACATTCAT CTGACTTGAAGAGTTGCAAATTTTCAGTGTTTCTGTTTTTGTTGCTGATATGTTGTAAAC TTTTTAATAAAATCTATTTCTATAG gttarget3GCAGTTTGAATGACTGGGTGATGCGAAATGGGGGTCCTGCCATAGAGCGCTTCCATGGTT TACCTTGCACATTTCAGAGAAGTCCTATGCCAGGAGTCCTTCCTACAGGGCCTTCCTGAA ACTATATATGTGCTTATTCTTGTTTGATTTGGCTTTGCAGCTAGGGTTTTCACCTTTTCT GGAAAAAAAAATACTGGCTTCC gttarget4CTGCTATTAATGGGCAAAACAACTCAAATAAAGTCCCTCTGCCACCCTCAGACACTGCCC CTGGCCCCCAGCTGCCCGCTGATCCTTGTAGCCAGAGCAGTAAAGTTTTGAAAGTGGAGC CCAAGGAGAATAAAGTTATTAAAGAAACTGGCTTTGAACAAGGTGAAAAGTCTTGTGCAG CACCTCTAGATCATACTGTGAAGGAAAATCTTGGACAAACTTCTAAAGAACAGGTGGTAG

query

database

COMPARE

LIST MATCHES

Flavours of BLAST

ACGATAGATCCCATCCATAAAT ATGACGATAGATCCCATCAT CGATAGGACCACCACA GATAGACCAGGATACATAGGATAATTA AGCTCGCTTGGCTCGATGGCT

MKJLSPWERSYTRGHYTWER MGHTVNBZY MKLPWRHGDBKJGMNDFD MBKLRPIUHDFRTASGSLKWWRTVBN

MKJLSPWERSYTRGHYTWER MGHTVNBZY MKLPWRHGDBKJGMNDFD MBKLRPIUHDFRTASGSLKWWRTVBN

ATGACGATAGATCCCATCAT CGATAGGACCACCACA GATAGACCAGGATACATAGGATAATTA AGCTCGCTTGGCTCGATGGCT

query sequence other operation database sequences

ATGACGATAGATCCCATCAT CGATAGGACCACCACA GATAGACCAGGATACATAGGATAATTA AGCTCGCTTGGCTCGATGGCT

BLASTn

BLASTp

BLASTx

tBLASTn

tBLASTx

ACGATAGATCCCATCCATAAAT

ACGATAGATCCCATCCATAAAT

MQWCGYRWTYQGYRW

MQWCGYRWTYQGYRW

FAST

FAST

SLOW

SLOWER

HORRIBLY

SLOW

6 fra

me

trans

latio

n

How does it work The main task of any sequence comparison program is to test all possible mutual alignments of two sequence and see how good the match is

CTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGT

CCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGTTC | | | | | ||||||||||||||||||||||||| CTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGTCTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGT || | | | | | | | | |||||||||||||||||||||||| | | | | | |

CCGAGCTTCTCATTGCTCTTCCTAACAGTG=TGATAGGCTAACCGTAATGGCGTTC||||||||||||||||||||||||| ||||||||||||||||||||||||

query

1st database sequence

This would actually be a very slow search process if implemented like thishellip

BLAST achieves its speed through two strategies

- it takes a WORD based approach- it pre-INDEXES database sequences

BLAST WORDS and INDEXING1 GACAAATCCAAACCCCTGAAGTTCTCCACCAGCAAAGCCA

2 TAAGCAAATTTAATTTTGTTTACATTTTC

3 GTTAAGACCTTCCCTGACATTTGCAGCAGTTTCAAATGTA

AAAAAAAA 00001 AAAAAAAC 00002 AAAAAAAG 00003 ACAAATCC 07967 ACAAATCC 07968 ACAAATCC 07979 GACAAATC 33568 GACAAATG 33569 TCCAAACC 64321 TCCAAACC 64322

sequence position word

1 1 33658

1 2 07967

1 3 16210

3 15 33568

3 16 07967

Database of sequences

Numbered list of all possible lsquowordsrsquo

Build a position index of all words in the database

Analyse the Query Sequence gtquery AGACAAATCCAAACCCCTGAAGTTCTCCACCAGCAAAGCCA

AAAAAAAA 00001 AAAAAAAC 00002 AAAAAAAG 00003 ACAAATCC 07967 ACAAATCC 07968 ACAAATCC 07979 GACAAATC 33568 GACAAATG 33569 TCCAAACC 64321 TCCAAACC 64322

QUERY SEQUENCE

Numbered list of all possible lsquowordsrsquo

position word

1 14236

2 33658

3 07967

Analyse QUERY SEQUENCE

sequence position word

1 1 33658

1 2 07967

1 3 16210

3 15 33568

3 16 07967

Index of database

Expand from Word Based Matches We lsquoinstantlyrsquo know which sequences in the database have at least a word length match with our query sequence and at what relative position

Next the potential alignments are expanded adding up a score for (total matches ndash mismatches ndash gap penalties) to make the best possible alignment But this is usually for a tiny proportion of the sequences in the database ndash so overall it is much quicker

The highest scoring alignments are reported

But we can potentially miss alignments with no word-size bits in common consider BLASTn with a default word-size of 11

TCGGAAGTGGAAGCTGAACCTGATTGTAGAGTTGGAGGCCAGTGTTCTGGCTGAGC||||||||| ||||| |||||||||| |||||||||| |||| ||||| ||||||| TCGGAAGTGTAAGCTCAACCTGATTGCAGAGTTGGAGTCCAGAGTTCTAGCTGAGC

Care is sometimes neededhellip

BLAST ndashTypical OutputINPUT

gtpartial cDNA sequence Xenopus tropicalisCGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGTTCCCACCTCTCCTCTTTCACCATGAAGCTCAAGGACAAATTCCACTCCCCCAAAATCAAGCGCACCCCGTCCAAGAAGGGGAAGCCGGCCGACCTCACCGTCAAAACAGAAGAGAAACCCGTCAACAAAACCTTAAGCCGCTTGGAGGAACAGGAGAAAGAAGTCGTTAATGCCTTGCGTTACTTTAAGACAATTGTTGACAAGATGGCGGTGGACAAGATGGTGCTGGTGATGCTGCCAGGGTCGGCGA

OUTPUTQuery= (311 letters) Database NCBI Protein Reference Sequences 954378 sequences 347895532 total letters

gtgi|41055060|ref|NP_9574201| similar to guanine nucleotide-releasing factor 2 (specific for crk proto-oncogene) [Danio rerio]

Length=691

Score = 133 bits (335)Expect = 6e-31 Identities = 7698 (77) Positives = 8298 (83) Gaps = 498 (4) Frame = +2

Query 26 MSGKIE-KADSQRSHLSSFTMKLKDKFHSPKIKRTPSKKGKPA--DLTVKTEEKPVNKTL 196 MSGKIE K +SQ+SHLSSFTMKL KFHSPKIKRTPSKKGK + VKT EKPVNK + Sbjct 1 MSGKIESKHESQKSHLSSFTMKLM-KFHSPKIKRTPSKKGKQLQPEPAVKTPEKPVNKKV 59

Query 197 SRLEEQEKEVVNALRYFKTIVDKMAVDKMVLVMLPGSA 310 SRLEEQEK+VV+ALRYFKTIVDKM VD VL MLPGSA Sbjct 60 SRLEEQEKDVVSALRYFKTIVDKMNVDTKVLQMLPGSA 97

When is a match significant

RFKISDCQHPCTYSHNQYMTNHMRECPYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV

NFSWKKTSEKETNCQFDYPNDYNEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFNMCWLEVNSS

RF---KISDCQHPCTYSH-NQYMTNHMREC----PYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV F K S+ + C + + N Y N +C P+ + +W +P + D I N M ++ NFSWKKTSEKETNCQFDYPNDY--NEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFN------MCWLEVNSS

Here is a lsquotypicalrsquo weak alignment from BLASTp

In fact the sequences were randomly generated so there is no biologically significant alignmenthellip

E-values

The number of matches like the discovered match that I would expect to find by chance

An E-value of 00 implies that I would expect no matches like this to arise by chance thereforehellip

An E-value of 1 implies I would expect 1 match like this to arise by chance so if I have a match with such an E-valuehellip

Also ldquoexpect valueldquo or ldquoexpectationrdquo

E-values From First Principles

Some database statistics (23rd July 2005)

Database NCBI RefSeq mRNA 272619 sequences 503566580 total letters (~50 x 108)

Database NCBI nr 3329110 sequences 14601814750 total letters (~14 x 1010)

Notation

12e-35 = 12 x 10-35

48 x 106 = 4800000

We will consider first searching a nucleotide sequence (lsquoACGTAGACGTrsquo) against a nucleotide database eg the RefSeq mRNA above

Then we will consider the more complex case of amino acid sequence (protein) searches Which is of course what we mostly do

Calculating an E-valueThe RefSeq mRNA database has ~ 50 x 108 letters There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA

Query = lsquoArsquo

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACrsquo

AC AC AC AC

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACGrsquo

ACG

Expected number of matches = (50 x 108) 4 = ~12 x 108

Expected number of matches = (50 x 108) (4x 4) = ~31 x 107

Expected number of matches = (50 x 108) (4 x 4 x 4) = ~81 x 106

Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer

Expected number of matches = (50 x 108) (4 x 4 x 4 x 4 hellip 60 times ) = (50 x 108) 1036 = 50 x 10-28

E-value = 50 x 10-28

E-values In PracticeSo if I take a 60 nt sequencegtsequenceACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA and actually BLAST it against the RefSeq mRNA database I get

BLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds

Length=6060 Score = 119 bits (60) Expect = 2e-26 Identities = 6060 (100) Gaps = 060 (0) Strand=PlusPlus

Query 1 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 60 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 2977 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 3036

What do I get if I BLAST it against the larger nr databaseBLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds

Length=6060 Score = 119 bits (60) Expect = 6e-25 Identities = 6060 (100) Gaps = 060 (0) Strand=PlusPlus

Query 1 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 60 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 2977 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 3036

theoretical value was 50e-28 -

E-value Exercise

Given a transcription factor binding site

ACC[TG]TA

How many would you expect to find by chance in a 10k promoter sequence

How would this differ if there was an optional additional base between the 4th and 5th positionsIeACC[TG]TAOR ACC[TG]TA

E-value Exercise AnswerACC[TG]TA

Expect lsquoArsquo every 4 ntExpect lsquoACCrsquo every 4x4x4 = 64 nt

Expect lsquoT or Grsquo every 2nd ntExpect lsquoACC[TG]rsquo every 64x2 nt = 128 nt

Expect lsquoTArsquo every 4x4 = 16 ntExpect lsquoACC[TG]TArsquo every 128x16 nt = 2048 nt (4x4x4x2x4x4)We would expect ~5 of these promoter sites every 10k by chance

If also ACC[TG]TAA allowed

The two motifs independently have the same E-valueTo allow either means we expect twice as many

E-values Effect of Database SizeThe nr mRNA database has ~ 14 x 1010 letters (was RefSeq and 50 x108)There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA

Query = lsquoArsquo

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACrsquo

AC AC AC AC

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACGrsquo

ACG

Expected number of matches = (14 x 1010) 4 = ~12 x 108

Expected number of matches = (14 x 1010) (4x 4) = ~31 x 107

Expected number of matches = (14 x 1010) (4 x 4 x 4) = ~81 x 106

Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer

Expected number of matches = (14 x 1010) (4 x 4 x 4 x 4 hellip 60 times ) = (14 x 1010) 1036 = 14 x 10-26

E-value = 14 x 10-26

(was E-value = 50 x 10-28)

E-values Effect of Database Size

The E-value is simply dependent on database size

RefSeq

nr

14 x 1010 letters

50 x108 letters

30 x bigger

BLAST the same sequenceagainst each E-value = 14e-26

E-value = 50e-28

The database was ~30 times bigger and so the E-value was ~30 times bigger

Why were the values differentOur calculated E-value for searching against the RefSeq mRNA database was 50 x 10-28But our actual BLAST search at NCBI gave a value of

20 x 10-26 - about 40x larger - why is this

Gapped alignments

If we were expecting N matches for a query sequence lsquoACGTACGTACGTrsquo imagine what would happen to N if we allowed gaps in our matches

ACGTACGTACGT

This would now give us additional possible alignments that would meet our lsquomatchrsquo criteria

ACGTACGTACGT ACGTACAGTACGT ACGTACCGTACGT etc|||||||||||| |||||| |||||| |||||| ||||||ACGTACGTACGT ACGTAC-GTACGT ACGTAC-GTACGT

We will expect many more matches in a given database if we allow our alignments to have gaps The E-value will be larger

E-values Effect of Query Length

Biologically itrsquos the same match Does it mean we are any less sure that this match didnrsquot occur by chance The E-value is simply dependent on match length

database

BLAST 500 nt sequence against a database

BLASTn Get a full length match with sequence XYZ at an E-value = 50e-160

gtsequence ACTAGTCTAGCTAGACATCGATCGATGATGCTACACAGATAGACGATAGATAGTAAGTCGATCGATCGCGCATCGATCGTCTAGATCGATCGCTCGCTGTGTAGATAGATCGGCGATAGA

database

BLAST half of the same sequence against the same database

BLASTn

gtsequence ACTAGTCTAGCTAGACATCGATCGATGATGCTACACAGATAGACGATAGATAGTAAGTCG

Get a match with sequence XYZ again but at an E-value = 50e-80

Why not just use identityAt some levels this a good question

But consider two very different searches both of which give a 75 identity match

Query1 was 60 nt longCGGAGCTCAGGGCTTAACGACTGATATCTCCGCGCATGTCGAGAAACGATACAGCCAGCG||||||||||| || | || | || || |||| | | | |||||| | ||||||||||CGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGWhich would have an E-value ~ 50 x 10-19

And Query2 only 16 nt longACGTACGTACGTACGT||| || | |||| ||ACGCACCTTCGTAGGTWhich would have an E-value ~ 30

And intuitively we feel we would expect to see that sort of number of matches in the database just by chancehellip

So whatrsquos the real problemBasically you are usually trying to answer the question

Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism

Are there any useful guidelines though at least for biological meaningfulness

Basically you are usually trying to answer the question

Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism

BLAST

The difficulty is because

ORTHOLOGY

BLAST Similarity + Probability

biological knowledge

nature of query sequence

phylogenetic relationship

match length PI size of databasehellip

Rules of ThumbHow good does an E-value have to be before we might even think we have an ortholog

largerworse smallerbetter

E-values 10-5 10-10 10-40 10-100 00

fantasy borderline encouraging

pretty good canrsquot get

better

But note that in some gene families with closely related members you can get an E-value of 00 for several different matches and then identity may be more sensitive Also bear in mind in cases like this that ideas of lsquofunctionalrsquo orthology may break down with more than one locus producing identical proteins which share the same functionhellip

Protein BLASTItrsquos (nearly) always better to make comparisons at the amino acid level between protein sequences than the DNA level because the amino acid sequence is more conserved than the underlying DNA sequenceDoes this cause us to treat expected values any differently

If we follow the argument as before then for an exact match of a 20 amino acid sequence in the RefSeq protein database each additional amino acid will reduce the E-value by 120th (there are 20 different amino acids) And as there are 347895532 letters in that databaseE-value = ~35 x 108 (20 x 20 x 20 hellip20 times) = ~35 x 10-18

But this is what we get if we run the blast at NCBI

Score = 431 bits (100) Expect = 8e-04 Identities = 2020 (100) Positives = 2020 (100) Gaps = 020 (0) Frame = +3

Query 3 SSSSFRAYRAALSEVEPPCI 62 SSSSFRAYRAALSEVEPPCISbjct 972 SSSSFRAYRAALSEVEPPCI 991

Really too big a discrepancy to easily explain with hand wavinghellip

Amino Acid Substitutions

A SC F LWYG I LMVL IMFVM ILVP V ILMW FY

N DHSQ REHKS ANTT SY HFW

H NQYK RQER QK

D NEE DQK

In fact we need to take into account both amino acid substitutability as well as as before allowing gapped alignments On average any residue can be substituted for by about 2 others so each position has about 17th chance of lsquomatchingrsquo rather than 120th

So now we getE-value = ~35 x 108 (7 x 7 x 7 hellip20 times) = ~44 x 10-9which is much closer to the actual BLAST value

These substitutabilities are dealt with by the BLOSUM and PAM matrices

ExercisesGo to the file random-DNA-sequenceshtml select one of the 20 randomly generated nucleotide sequences and do a BLASTx (translated DNA-gtprotein) at NCBI against the nr protein database

Did you find any lsquosignificantrsquo hits

Repeat with a second sequence

What conclusions might you draw from this exercise

Try the same sequence(s) against the nr nucleotide database

Is there any general difference

Part 4 Tweaking BLASTAlthough you normally see BLAST as a web page with boxes to place data in and tick boxes etc it is actually a command line program that can be run just by typing the appropriate command and options eg

promptgt blastall ndashp ltblast typegt ndashi ltinput sequencegt ndashd ltdatabasegt

This is the simplest form where the basic program lsquoblastallrsquo takes a number of different options or parameters indicated by the ndashx and followed by its value -p ltwhich blast flavour to rungt-i ltfile with query sequence ingt-d ltpre-indexed database namegt

Not All Parameters are hellipThere are many other parameters and if not listed explicitly they will use a default value most appropriate to the blast flavour requested Eg for ndashW ltword sizegt blastn uses ndashW 11 where blastx uses ndashW 3

There are also some options that appear on the web pages that are not really parameters but manage the job in a similar way One of the most useful of these is on the NCBI blast pages where you can use Entrez queries or pick from an organism list to modify your search

The Many Parameters of BLASTThere are almost literally hundreds of parameters but most are way too obscure even for die-hard techies like me Very few of them are regularly useful in any but their default value but just occasionally they are very necessary

Here are some of the ones that I have used

-e max expected value -m output format (graphical or tabularspreadsheet)-F filter query sequence for low complexity (default TRUE)-U use only upper case regions of query (default FALSE)-G gap opening cost-E gap extension cost-q nucleotide mismatch penalty (BLASTx uses matrices)-r nucleotide match reward-b number of matching sequences to report-g allow gaps (default TRUE)-W word size-z effective database size (removes effect of actual database size)-S query strands to search (default both directions)-l restrict database sequences to given list of lsquogilsquo numbers

BLAST Parameters Exercises1 BLASTn vs BLASTx

Open the file example-sequenceshtml copy the sequence gtblastn-vs-blastxThis is a Xenopus tropicalis cDNA sequence

Go to the NCBI BLAST Home PageNucleotide-nucleotide BLAST (blastn) section Paste your sequence into the box

Run BLASTn against the nr nucleotide database using all default optionsThen hit [format] to wait for the results in a new page(hint if you paste the sequence definition line lsquogtnamersquo into the box as well your results will be labelled accordingly which can be useful)

Now repeat but go to the TRANSLATED BLAST section and BLAST against the nr protein database using BLASTx

How might the different results help us view the presence of this gene in other vertebrates

Results for Exercise 1 BLASTn

BLASTx

BLAST Parameters Exercises2 Low complexity filtering

Open the file example-sequenceshtml copy the sequence gtlow-complexity-filtering-A This sequence contains a long AT tandem repeat

Go to the NCBI BLAST Home PageTRANSLATED BLAST sectionBLASTx Paste your sequence into the box

Carefully UNTICK the ldquoChoose filter [ ] Low complexityrdquo BOX in the second section And then run BLASTx against the nr database

What do you feel about these alignmentsRe-run but leave the low-complexity filter ON this timeDoes this change our view of the protein matches

Now continue with gtlow-complexity-filtering-B and ndashCC is an especially interesting case ndash what can we deduce about the cDNA sequence Annotators beware

Results for Exercise 2A (OFF) BLASTn ndash low complexity filtering OFF

Results for Exercise 2A (ON) BLASTn ndash low complexity filtering ON

Results for Exercise 2B

ON OFF

Results for Exercise 2C

There is a sequence error an extra G at position 117 in the sequence cDNA (117)AGAAAAGAAGAAACATGGCAATGGATCAGAA|||||||||||||||| ||||||||||||||AGAAAAGAAGAAACAT-GCAATGGATCAGAA

Genomic sequence

ON

OFF

BLAST Parameters Exercises3 Limit by Entrez query

Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching sequences in fruit fly enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT

Open the file example-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home Page TRANSLATED BLAST sectionBLASTx and paste the sequence

Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip

BLAST Parameters Exercises4 BLASTn vs tBLASTx and nucleotide mismatch penalties

Open the file example-sequenceshtml

Also open the NCBI BLAST Home PageSPECIAL ndash Align two sequences section

There are several Xenopus tropicalis cyclins in the examples fileCopy the sequence gtcyclin-A1-Xt to the Sequence 1 BLAST windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 BLAST window(i) Run the default comparison should be BLASTn Note the alignmentNow run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these

(ii) Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1 Then try reducing the gap open and gap extension penaltieshellipWhat do we learn from this

(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip

Results for Exercise 4 (i)

BLASTn tBLASTx

Results for Exercise 4 (ii)

Mismatch penalty = -2 (default) Mismatch penalty = -1

BLAST Parameters Exercises5 E-Value maximum for reporting

Open the file example-sequenceshtml

Copy the sequence gtsumo-binding-motif and go to the NCBI BLAST Home PageGo to the PROTEIN BLAST section BLASTp and paste the sequence

Run the search with the default values

Now re-run the search setting the maximum E-value in the box

Expect 100

What difference does this make

BLAST Parameters Exercises6 Word Size

Open the file example-sequenceshtml

Copy the sequence gtmorpholino and go to the NCBI BLAST Home PageGo to the NUCLEOTIDE BLAST section BLASTn and paste the sequence

Check OFF the low complexity filter and then run the search

Now re-run the search setting the following parameters

Low complexity OFFExpect 100Word Size 7Other advanced -q-1 (mismatch penalty -1 instead of default -3)

What difference does this make

END

  • Bioinformatics Workshop 1 Sequences and Similarity Searches
  • The Basic Questions
  • Part 1 Structural Genomics
  • Chromosomes and Genes
  • Gene to Protein
  • Sequence Signals
  • Genomic Signals
  • Derivative Sequences
  • Gene Models
  • Sequences and Genes (Accession Numbers and Names)
  • Gene Symbols Names Etc
  • A Gene-Centric View
  • Sequences and Accession Numbers
  • mRNA Splicing Signals
  • Gene Predictions
  • Supporting Evidence
  • TheoreticalPredicted Sequences
  • Sequences for a model organism
  • So Whatrsquos in the Databases Now
  • Part 2 Comparative Genomics
  • Speciation
  • Residual Similarity
  • Computers Can Detect Homology
  • Orthologs
  • Paralogs
  • lsquoOtherrsquo-logs
  • The Essential Paradigm
  • Function Conserved Longer than Detectable Similarity
  • Redundancy in the Genetic Code
  • Protein Similarity Persists Longer
  • Always Compare Protein Sequences
  • Exercise 1 nucleotide vs amino acid search
  • Answers Exercise 1
  • The Essential Task
  • Functional Orthologs
  • Finding Orthologs
  • Using Synteny is Better
  • Metazome
  • Metazome Exercise
  • Part 3 Finding Sequence Similarities
  • Gaps in Alignments
  • The Downside of Gaps
  • BLAST
  • Flavours of BLAST
  • How does it work
  • BLAST WORDS and INDEXING
  • Analyse the Query Sequence
  • Expand from Word Based Matches
  • BLAST ndashTypical Output
  • When is a match significant
  • E-values
  • E-values From First Principles
  • Calculating an E-value
  • E-values In Practice
  • E-value Exercise
  • E-value Exercise Answer
  • E-values Effect of Database Size
  • Slide 58
  • Why were the values different
  • E-values Effect of Query Length
  • Why not just use identity
  • So whatrsquos the real problem
  • Rules of Thumb
  • Protein BLAST
  • Amino Acid Substitutions
  • Exercises
  • Part 4 Tweaking BLAST
  • Not All Parameters are hellip
  • The Many Parameters of BLAST
  • Slide 70
  • BLAST Parameters Exercises
  • Results for Exercise 1
  • Slide 73
  • Results for Exercise 2A (OFF)
  • Results for Exercise 2A (ON)
  • Results for Exercise 2B
  • Results for Exercise 2C
  • Slide 78
  • Slide 79
  • Results for Exercise 4 (i)
  • Results for Exercise 4 (ii)
  • Slide 82
  • Slide 83
  • END
Page 22: Bioinformatics Workshop 1 Sequences and Similarity Searches

Residual Similarity

ATGCATGCTGCCAACGGATGCCCTG

ATGAAAGCCGCCTACGACAGTCCTG||| | || ||| ||| | ||||

ATGCATGCTGCCAACGGATGCCCTG

ATGGAAGGCGCTTAGGATAGTCCAG||| | | || | | | || |

After longer periods of evolution homology may no longer be detectable in the DNA sequencehellip

We can still easily detect residual similarity between these sequences this is what we call homology ndash detectable similarity because of common evolutionary origin

Computers Can Detect Homology

In fact computers are very good at this task ndash the two primary challenges are

(a) performing the search fast enough to look through millions of sequence in a timescale compatible with a lab scientistrsquos attention span

(b) at low levels of similarity being able to distinguish between biologically related sequences and chance matcheshellip

ATGCATGCTGCCAACGGATGCCCTG

ATGAAAGCCGCCTACGACAGTCCTG||| | || ||| ||| | ||||

GCTGACTCGTAGCGCTTAGCTAGCT

CCAACATCTAGCCAGATTAGTTAGT | || | | | |

Orthologs

A A

A Gene duplication though speciation The two copies of Gene

A will now evolve independently but will continue to have the ~same function

They are ORTHOLOGS

Paralogs

A

Gene duplication though internal genome duplication

The two copies of Gene A will now evolve independently but will probably not continue to have exactly the same function

They are PARALOGS

A

A Arsquo

A

lsquoOtherrsquo-logsWhat about gene duplication after speciation

How can we describe the relationship(s) between the various copies of gene A in the two frogs

Bear in mind that understanding gene function is more important than semanticshellip

The two copies of A in the orange frog are sometimes called IN-PARALOGS

If they were also present in the green frog (and therefore were in the ancestor species) they would be OUT-PARALOGS

A

A

A

Arsquo A

The Essential Paradigm

1 any group of modern species can be traced back to some extinct common ancestor

A

A

2 in all likelihood they share orthologous genes which have the same function in the modern animal as in the extinct ancestor

3 If we can experimentally determine the function of a gene in one of these organisms then there is a good chance the ORTHOLOGOUS gene in another organism will have the same function

A A

cyclin b1

cyclin b1

Function Conserved Longer than Detectable Similarity

start from first self-replicating sequence

same function detectable similarity

living organisms

whole genome duplication local duplication

Redundancy in the Genetic Code

GCA A alanine GCC A GCG A GCT A

TGC C cystine TGT C

GAC D aspartate GAT D

GGA G glycine GGC G GGG G GGT G

lsquoSynonymousrsquo or lsquosilentrsquo mutations in the third position of the codon triplets have no effect on the amino acid coded for ndash so there is no evolutionary pressure against thishellip

Protein Similarity Persists Longer

CTATCACGAGAACCTGTGCTATCCCGAGAACCTGTGCTATCCCGAGAACCAGTGCTATCCCGTGAACCAGTGCTATCCCGTGAGCCAGTGCTATCCCGTGAGCCAGTTCTGTCCCGTGAGCCAGTT

CTATCACGAGAACCTGTG

CTGTCCCGTGAGCCAGTT|| || || || || ||

LSREPV

LSREPV||||||

CTATCACGAGAACCTGTG

TTGTCCCGGTCGCCAGTT | || | || ||

LSREPV

LSRFPV||| ||

67 100

44 80

Always Compare Protein Sequences

ATGAATGCAGCCTATGATTGCCGAGCCAGAATGCTAAGG MNAAYDCRARMLR ||||| || || || || || || || ||||| || || | ||||||||+||ATGAAGGCCGCATACGACTGTCGTGCTAGAATCCTGAGA MKAAYDCRARILR

DNA comparison amino acid comparison

The DNA sequence can change while the amino acid sequence stays the same so always look for similarities by comparing amino acid sequences

Exercise 1nucleotide vs amino acid search

Go to the file example-sequenceshtml and locate the section for this exercise There should be two sequences lsquosurfeit1rsquo for frog and fly

Go to NCBI Blast home page then lsquoAlign two sequencesrsquo (bottom left lsquospecialrsquo panel) paste one sequence into each window and hit lsquoAlignrsquo ndash this will do a direct DNADNA comparison

Now find the open reading frames of the two genes and translate them into amino acid protein sequences then repeat the two sequences comparison

Go to NCBI ORF Finder ndash paste sequence ndash hit OrfFind ndash identify longest ORF ndash click on it ndash next screen hit Accept ndash change View to Fasta protein ndash hit View ndash copy sequence to Blast2Seqs Do the same with the other sequence

Before you hit lsquoAlignrsquo change the lsquoProgramrsquo (top left) to blastphellip

Answers Exercise 1

The Essential Taskexperiment data mining

gene sequence what is its function

database of proteins in other species

Cyclin-AFoxA1

cdc25

alpha-tubulin

Predicted protein

Gravin-like

Sprouty-2

calmodulin

KIAA10786568

frizzled

Wint8

Troponin T3

Gravin-like

we can only do this because of implied function based on orthology

Functional Orthologs

function known annotation lsquoGravinrsquo available

Human geneXenopus genefunction unknown

sequence similarityorthologs

same function But we know that function is largely determined by shape

similar shape

Which in general we cannot determine ndash but it is probably SHAPE not SEQUENCE that is conserved

We make an assumption that the same gene function is likely to be present in the two organisms and the ones that have this function are likely to be the most similar in sequence

Finding OrthologsSo how do we find orthologs and can we know when we have

The simplest is Reciprocal Best BLAST but it implicitly relies on having all the protein sequences of you own organism and the one you wish to find an ortholog in

frog proteindatabase of human proteins

best match human protein

database of frog proteins

x

Using Synteny is Better

We know that large regions of (say) vertebrate genomes have preserved their overall organisation from one organism to another

And we find the same genes (ie orthologs) in more or less the same order in the syntenic sectionsThese of course represent chromosomal re-arrangements since these organisms diverged

Human chromosome 5

Mouse chromosome 10

Mouse chromosome 2

MetazomeFortunately someone has done all the hard work for ushellip Dan Rokhsar httpwwwmetazomenet

Metazome Exercise

Go back to Entrez Gene and look for your favourite gene again

Pick probable ortholog vertebrate genes from common organisms (human mouse rat chicken frog fish) and paste their protein sequences into a temporary space

Go to Metazome (httpwwwmetazomenet) find the blast window open two versions of it and blast your sequences against the Tetrapod or Jawed vertebrate node

See if you get the same cluster ID as best top hit and have a look at the Metazome alignment(s)hellip

Part 3 Finding Sequence Similarities

We want computer programs which will compare sequences at all possible different alignments looking for a degree of similarity greater than we would expect to find by chance

But first we have to consider the implication of gapshellip

Insertions and deletions are other possible forms of mutations and they can really mess up our simple alignments

ATGCATGCTGCCAACGGATGTCCTG

ATGAAAGCCGCCTACGAAAGTCCTG||| | || ||| ||| | ||||||

ATGCATGCTGGCCAACGGATGTCCTG

ATGAAAGCCGCCTACGAAAGTCCTG||| | || ||| | | | |

Gaps in Alignments

Consider these two obviously similar sequences

TTCCCAACTCTCCTCTTTCACCATGAAGCTCAAGGACAGATTCCACTCGCCCCAAAATCAAGCTCACCCCGTCCAAGAA | || | || |||||||||||||||||||| ||||||||| ||| ||| | ||| | | |TTCCCACCTCTCCTCTTTGCACCATGAAGCTCAAGGACAAATTCCACTCCCCCAAAATCAAGCGCACCCCGTCCCAGAA

In fact we realise that the most probable alignment (regarding biological origin) is with a small gap in each sequence

TTCCCAACTCTCCTCTTT=CACCATGAAGCTCAAGGACAGATTCCACTCGCCCCAAAATCAAGCTCACCCCGTCCAAGAA |||||| ||||||||||| |||||||||||||||||||| ||||||||| |||||||||||||| |||||||||| ||||TTCCCACCTCTCCTCTTTGCACCATGAAGCTCAAGGACAAATTCCACTC=CCCCAAAATCAAGCGCACCCCGTCCCAGAA

So in general we allow ourselves to insert gaps until we find the optimal alignment

But where should this process stop

The Downside of GapsTake two random sequences with no lsquorealrsquo similarity

GACACTAGGTCGATGCGTGGTGGCGAGA

ACGCATCCGGATGTGCACCGTGGAACTG

And allow lsquocost freersquo gaps

GAC--ACT----AGGTCGATGC---GTGG---TGGCGAGA || | | | | | ||| |||| || ACGCA-TCCGGA--T-G-TGCACCGTGGAACTG

Clearly although the alignment has no mismatches it is obviously not biologically meaningful

To prevent this we assign a cost to adding gaps which is offset against the benefit of finding matches ndash and this is the essence of lsquofinding gapped alignmentsrsquo

We want to find the lsquoalignmentrsquo between the two (or more) sequences which shows the greatest degree of similarity while introducing the fewest gaps hellip

BLAST

gtqueryAGACGAACCTAGCACAAGCGCGTCTGGAAAGACCCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGTTCAGGAGTATTTGGACTGCAATATTGGCCCTCGTTCAAGGGCGCCTACCATCACCCGACGGTCATGCCGGTCCCCAGCAGCTGCTAATAACTTCCTTCGCTACTCAAGTTACCACGCTAGCAAAACCCACGGCATACCGTTTACCCTTTAAAATCAGCTTCAACCAGCAACGAA

There are many programs used to find similarities between sequencesThey range from relatively slow programs which find the exact best matching alignment through ones which take progressively inexact shortcuts to speed things up Of this latter class the best known and easily most widely used is BLAST developed by Stephen Altschul and others and continuously refined over the last 10-15 years

The essential idea is to compare your query sequence against a collection or lsquodatabasersquo of target sequences looking for the one(s) that match the query sequence the best

gttarget1AAAACAGGAATATTTACCGGGACCGGGTAATGATGCATCTCGAGGTACACAATATACCTG GAGAACCGAATTATGAGTTGGCCACCTTACTTAACGAAACCAGCAGAGAAAATCCAACAT GGCAACACCCCTCTGACTACACTAGAAGGAACTACTATGTAAGAAAACAGCCTGTCCCTT GCAGTTTGAATGACTGGGTGATGCGAAATGGGGGTCCTGCCATAGAGCGCTTCCATGGTT TACCTTGCACATTTCAGAGAAGTCCTATGCCAGGAGTCCTTCCTACAGGGCCTTCCTGAA ACTATATATGTGCTTATTCTTGTTTGATTTGGCTTTGCAGgttarget2CTCTTAATTTATTTCTCTTCCTGCAGCTCCCTCGCTTTTTCCTTTCCCTGTTACATTCAT CTGACTTGAAGAGTTGCAAATTTTCAGTGTTTCTGTTTTTGTTGCTGATATGTTGTAAAC TTTTTAATAAAATCTATTTCTATAG gttarget3GCAGTTTGAATGACTGGGTGATGCGAAATGGGGGTCCTGCCATAGAGCGCTTCCATGGTT TACCTTGCACATTTCAGAGAAGTCCTATGCCAGGAGTCCTTCCTACAGGGCCTTCCTGAA ACTATATATGTGCTTATTCTTGTTTGATTTGGCTTTGCAGCTAGGGTTTTCACCTTTTCT GGAAAAAAAAATACTGGCTTCC gttarget4CTGCTATTAATGGGCAAAACAACTCAAATAAAGTCCCTCTGCCACCCTCAGACACTGCCC CTGGCCCCCAGCTGCCCGCTGATCCTTGTAGCCAGAGCAGTAAAGTTTTGAAAGTGGAGC CCAAGGAGAATAAAGTTATTAAAGAAACTGGCTTTGAACAAGGTGAAAAGTCTTGTGCAG CACCTCTAGATCATACTGTGAAGGAAAATCTTGGACAAACTTCTAAAGAACAGGTGGTAG

query

database

COMPARE

LIST MATCHES

Flavours of BLAST

ACGATAGATCCCATCCATAAAT ATGACGATAGATCCCATCAT CGATAGGACCACCACA GATAGACCAGGATACATAGGATAATTA AGCTCGCTTGGCTCGATGGCT

MKJLSPWERSYTRGHYTWER MGHTVNBZY MKLPWRHGDBKJGMNDFD MBKLRPIUHDFRTASGSLKWWRTVBN

MKJLSPWERSYTRGHYTWER MGHTVNBZY MKLPWRHGDBKJGMNDFD MBKLRPIUHDFRTASGSLKWWRTVBN

ATGACGATAGATCCCATCAT CGATAGGACCACCACA GATAGACCAGGATACATAGGATAATTA AGCTCGCTTGGCTCGATGGCT

query sequence other operation database sequences

ATGACGATAGATCCCATCAT CGATAGGACCACCACA GATAGACCAGGATACATAGGATAATTA AGCTCGCTTGGCTCGATGGCT

BLASTn

BLASTp

BLASTx

tBLASTn

tBLASTx

ACGATAGATCCCATCCATAAAT

ACGATAGATCCCATCCATAAAT

MQWCGYRWTYQGYRW

MQWCGYRWTYQGYRW

FAST

FAST

SLOW

SLOWER

HORRIBLY

SLOW

6 fra

me

trans

latio

n

How does it work The main task of any sequence comparison program is to test all possible mutual alignments of two sequence and see how good the match is

CTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGT

CCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGTTC | | | | | ||||||||||||||||||||||||| CTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGTCTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGT || | | | | | | | | |||||||||||||||||||||||| | | | | | |

CCGAGCTTCTCATTGCTCTTCCTAACAGTG=TGATAGGCTAACCGTAATGGCGTTC||||||||||||||||||||||||| ||||||||||||||||||||||||

query

1st database sequence

This would actually be a very slow search process if implemented like thishellip

BLAST achieves its speed through two strategies

- it takes a WORD based approach- it pre-INDEXES database sequences

BLAST WORDS and INDEXING1 GACAAATCCAAACCCCTGAAGTTCTCCACCAGCAAAGCCA

2 TAAGCAAATTTAATTTTGTTTACATTTTC

3 GTTAAGACCTTCCCTGACATTTGCAGCAGTTTCAAATGTA

AAAAAAAA 00001 AAAAAAAC 00002 AAAAAAAG 00003 ACAAATCC 07967 ACAAATCC 07968 ACAAATCC 07979 GACAAATC 33568 GACAAATG 33569 TCCAAACC 64321 TCCAAACC 64322

sequence position word

1 1 33658

1 2 07967

1 3 16210

3 15 33568

3 16 07967

Database of sequences

Numbered list of all possible lsquowordsrsquo

Build a position index of all words in the database

Analyse the Query Sequence gtquery AGACAAATCCAAACCCCTGAAGTTCTCCACCAGCAAAGCCA

AAAAAAAA 00001 AAAAAAAC 00002 AAAAAAAG 00003 ACAAATCC 07967 ACAAATCC 07968 ACAAATCC 07979 GACAAATC 33568 GACAAATG 33569 TCCAAACC 64321 TCCAAACC 64322

QUERY SEQUENCE

Numbered list of all possible lsquowordsrsquo

position word

1 14236

2 33658

3 07967

Analyse QUERY SEQUENCE

sequence position word

1 1 33658

1 2 07967

1 3 16210

3 15 33568

3 16 07967

Index of database

Expand from Word Based Matches We lsquoinstantlyrsquo know which sequences in the database have at least a word length match with our query sequence and at what relative position

Next the potential alignments are expanded adding up a score for (total matches ndash mismatches ndash gap penalties) to make the best possible alignment But this is usually for a tiny proportion of the sequences in the database ndash so overall it is much quicker

The highest scoring alignments are reported

But we can potentially miss alignments with no word-size bits in common consider BLASTn with a default word-size of 11

TCGGAAGTGGAAGCTGAACCTGATTGTAGAGTTGGAGGCCAGTGTTCTGGCTGAGC||||||||| ||||| |||||||||| |||||||||| |||| ||||| ||||||| TCGGAAGTGTAAGCTCAACCTGATTGCAGAGTTGGAGTCCAGAGTTCTAGCTGAGC

Care is sometimes neededhellip

BLAST ndashTypical OutputINPUT

gtpartial cDNA sequence Xenopus tropicalisCGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGTTCCCACCTCTCCTCTTTCACCATGAAGCTCAAGGACAAATTCCACTCCCCCAAAATCAAGCGCACCCCGTCCAAGAAGGGGAAGCCGGCCGACCTCACCGTCAAAACAGAAGAGAAACCCGTCAACAAAACCTTAAGCCGCTTGGAGGAACAGGAGAAAGAAGTCGTTAATGCCTTGCGTTACTTTAAGACAATTGTTGACAAGATGGCGGTGGACAAGATGGTGCTGGTGATGCTGCCAGGGTCGGCGA

OUTPUTQuery= (311 letters) Database NCBI Protein Reference Sequences 954378 sequences 347895532 total letters

gtgi|41055060|ref|NP_9574201| similar to guanine nucleotide-releasing factor 2 (specific for crk proto-oncogene) [Danio rerio]

Length=691

Score = 133 bits (335)Expect = 6e-31 Identities = 7698 (77) Positives = 8298 (83) Gaps = 498 (4) Frame = +2

Query 26 MSGKIE-KADSQRSHLSSFTMKLKDKFHSPKIKRTPSKKGKPA--DLTVKTEEKPVNKTL 196 MSGKIE K +SQ+SHLSSFTMKL KFHSPKIKRTPSKKGK + VKT EKPVNK + Sbjct 1 MSGKIESKHESQKSHLSSFTMKLM-KFHSPKIKRTPSKKGKQLQPEPAVKTPEKPVNKKV 59

Query 197 SRLEEQEKEVVNALRYFKTIVDKMAVDKMVLVMLPGSA 310 SRLEEQEK+VV+ALRYFKTIVDKM VD VL MLPGSA Sbjct 60 SRLEEQEKDVVSALRYFKTIVDKMNVDTKVLQMLPGSA 97

When is a match significant

RFKISDCQHPCTYSHNQYMTNHMRECPYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV

NFSWKKTSEKETNCQFDYPNDYNEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFNMCWLEVNSS

RF---KISDCQHPCTYSH-NQYMTNHMREC----PYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV F K S+ + C + + N Y N +C P+ + +W +P + D I N M ++ NFSWKKTSEKETNCQFDYPNDY--NEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFN------MCWLEVNSS

Here is a lsquotypicalrsquo weak alignment from BLASTp

In fact the sequences were randomly generated so there is no biologically significant alignmenthellip

E-values

The number of matches like the discovered match that I would expect to find by chance

An E-value of 00 implies that I would expect no matches like this to arise by chance thereforehellip

An E-value of 1 implies I would expect 1 match like this to arise by chance so if I have a match with such an E-valuehellip

Also ldquoexpect valueldquo or ldquoexpectationrdquo

E-values From First Principles

Some database statistics (23rd July 2005)

Database NCBI RefSeq mRNA 272619 sequences 503566580 total letters (~50 x 108)

Database NCBI nr 3329110 sequences 14601814750 total letters (~14 x 1010)

Notation

12e-35 = 12 x 10-35

48 x 106 = 4800000

We will consider first searching a nucleotide sequence (lsquoACGTAGACGTrsquo) against a nucleotide database eg the RefSeq mRNA above

Then we will consider the more complex case of amino acid sequence (protein) searches Which is of course what we mostly do

Calculating an E-valueThe RefSeq mRNA database has ~ 50 x 108 letters There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA

Query = lsquoArsquo

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACrsquo

AC AC AC AC

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACGrsquo

ACG

Expected number of matches = (50 x 108) 4 = ~12 x 108

Expected number of matches = (50 x 108) (4x 4) = ~31 x 107

Expected number of matches = (50 x 108) (4 x 4 x 4) = ~81 x 106

Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer

Expected number of matches = (50 x 108) (4 x 4 x 4 x 4 hellip 60 times ) = (50 x 108) 1036 = 50 x 10-28

E-value = 50 x 10-28

E-values In PracticeSo if I take a 60 nt sequencegtsequenceACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA and actually BLAST it against the RefSeq mRNA database I get

BLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds

Length=6060 Score = 119 bits (60) Expect = 2e-26 Identities = 6060 (100) Gaps = 060 (0) Strand=PlusPlus

Query 1 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 60 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 2977 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 3036

What do I get if I BLAST it against the larger nr databaseBLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds

Length=6060 Score = 119 bits (60) Expect = 6e-25 Identities = 6060 (100) Gaps = 060 (0) Strand=PlusPlus

Query 1 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 60 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 2977 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 3036

theoretical value was 50e-28 -

E-value Exercise

Given a transcription factor binding site

ACC[TG]TA

How many would you expect to find by chance in a 10k promoter sequence

How would this differ if there was an optional additional base between the 4th and 5th positionsIeACC[TG]TAOR ACC[TG]TA

E-value Exercise AnswerACC[TG]TA

Expect lsquoArsquo every 4 ntExpect lsquoACCrsquo every 4x4x4 = 64 nt

Expect lsquoT or Grsquo every 2nd ntExpect lsquoACC[TG]rsquo every 64x2 nt = 128 nt

Expect lsquoTArsquo every 4x4 = 16 ntExpect lsquoACC[TG]TArsquo every 128x16 nt = 2048 nt (4x4x4x2x4x4)We would expect ~5 of these promoter sites every 10k by chance

If also ACC[TG]TAA allowed

The two motifs independently have the same E-valueTo allow either means we expect twice as many

E-values Effect of Database SizeThe nr mRNA database has ~ 14 x 1010 letters (was RefSeq and 50 x108)There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA

Query = lsquoArsquo

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACrsquo

AC AC AC AC

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACGrsquo

ACG

Expected number of matches = (14 x 1010) 4 = ~12 x 108

Expected number of matches = (14 x 1010) (4x 4) = ~31 x 107

Expected number of matches = (14 x 1010) (4 x 4 x 4) = ~81 x 106

Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer

Expected number of matches = (14 x 1010) (4 x 4 x 4 x 4 hellip 60 times ) = (14 x 1010) 1036 = 14 x 10-26

E-value = 14 x 10-26

(was E-value = 50 x 10-28)

E-values Effect of Database Size

The E-value is simply dependent on database size

RefSeq

nr

14 x 1010 letters

50 x108 letters

30 x bigger

BLAST the same sequenceagainst each E-value = 14e-26

E-value = 50e-28

The database was ~30 times bigger and so the E-value was ~30 times bigger

Why were the values differentOur calculated E-value for searching against the RefSeq mRNA database was 50 x 10-28But our actual BLAST search at NCBI gave a value of

20 x 10-26 - about 40x larger - why is this

Gapped alignments

If we were expecting N matches for a query sequence lsquoACGTACGTACGTrsquo imagine what would happen to N if we allowed gaps in our matches

ACGTACGTACGT

This would now give us additional possible alignments that would meet our lsquomatchrsquo criteria

ACGTACGTACGT ACGTACAGTACGT ACGTACCGTACGT etc|||||||||||| |||||| |||||| |||||| ||||||ACGTACGTACGT ACGTAC-GTACGT ACGTAC-GTACGT

We will expect many more matches in a given database if we allow our alignments to have gaps The E-value will be larger

E-values Effect of Query Length

Biologically itrsquos the same match Does it mean we are any less sure that this match didnrsquot occur by chance The E-value is simply dependent on match length

database

BLAST 500 nt sequence against a database

BLASTn Get a full length match with sequence XYZ at an E-value = 50e-160

gtsequence ACTAGTCTAGCTAGACATCGATCGATGATGCTACACAGATAGACGATAGATAGTAAGTCGATCGATCGCGCATCGATCGTCTAGATCGATCGCTCGCTGTGTAGATAGATCGGCGATAGA

database

BLAST half of the same sequence against the same database

BLASTn

gtsequence ACTAGTCTAGCTAGACATCGATCGATGATGCTACACAGATAGACGATAGATAGTAAGTCG

Get a match with sequence XYZ again but at an E-value = 50e-80

Why not just use identityAt some levels this a good question

But consider two very different searches both of which give a 75 identity match

Query1 was 60 nt longCGGAGCTCAGGGCTTAACGACTGATATCTCCGCGCATGTCGAGAAACGATACAGCCAGCG||||||||||| || | || | || || |||| | | | |||||| | ||||||||||CGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGWhich would have an E-value ~ 50 x 10-19

And Query2 only 16 nt longACGTACGTACGTACGT||| || | |||| ||ACGCACCTTCGTAGGTWhich would have an E-value ~ 30

And intuitively we feel we would expect to see that sort of number of matches in the database just by chancehellip

So whatrsquos the real problemBasically you are usually trying to answer the question

Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism

Are there any useful guidelines though at least for biological meaningfulness

Basically you are usually trying to answer the question

Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism

BLAST

The difficulty is because

ORTHOLOGY

BLAST Similarity + Probability

biological knowledge

nature of query sequence

phylogenetic relationship

match length PI size of databasehellip

Rules of ThumbHow good does an E-value have to be before we might even think we have an ortholog

largerworse smallerbetter

E-values 10-5 10-10 10-40 10-100 00

fantasy borderline encouraging

pretty good canrsquot get

better

But note that in some gene families with closely related members you can get an E-value of 00 for several different matches and then identity may be more sensitive Also bear in mind in cases like this that ideas of lsquofunctionalrsquo orthology may break down with more than one locus producing identical proteins which share the same functionhellip

Protein BLASTItrsquos (nearly) always better to make comparisons at the amino acid level between protein sequences than the DNA level because the amino acid sequence is more conserved than the underlying DNA sequenceDoes this cause us to treat expected values any differently

If we follow the argument as before then for an exact match of a 20 amino acid sequence in the RefSeq protein database each additional amino acid will reduce the E-value by 120th (there are 20 different amino acids) And as there are 347895532 letters in that databaseE-value = ~35 x 108 (20 x 20 x 20 hellip20 times) = ~35 x 10-18

But this is what we get if we run the blast at NCBI

Score = 431 bits (100) Expect = 8e-04 Identities = 2020 (100) Positives = 2020 (100) Gaps = 020 (0) Frame = +3

Query 3 SSSSFRAYRAALSEVEPPCI 62 SSSSFRAYRAALSEVEPPCISbjct 972 SSSSFRAYRAALSEVEPPCI 991

Really too big a discrepancy to easily explain with hand wavinghellip

Amino Acid Substitutions

A SC F LWYG I LMVL IMFVM ILVP V ILMW FY

N DHSQ REHKS ANTT SY HFW

H NQYK RQER QK

D NEE DQK

In fact we need to take into account both amino acid substitutability as well as as before allowing gapped alignments On average any residue can be substituted for by about 2 others so each position has about 17th chance of lsquomatchingrsquo rather than 120th

So now we getE-value = ~35 x 108 (7 x 7 x 7 hellip20 times) = ~44 x 10-9which is much closer to the actual BLAST value

These substitutabilities are dealt with by the BLOSUM and PAM matrices

ExercisesGo to the file random-DNA-sequenceshtml select one of the 20 randomly generated nucleotide sequences and do a BLASTx (translated DNA-gtprotein) at NCBI against the nr protein database

Did you find any lsquosignificantrsquo hits

Repeat with a second sequence

What conclusions might you draw from this exercise

Try the same sequence(s) against the nr nucleotide database

Is there any general difference

Part 4 Tweaking BLASTAlthough you normally see BLAST as a web page with boxes to place data in and tick boxes etc it is actually a command line program that can be run just by typing the appropriate command and options eg

promptgt blastall ndashp ltblast typegt ndashi ltinput sequencegt ndashd ltdatabasegt

This is the simplest form where the basic program lsquoblastallrsquo takes a number of different options or parameters indicated by the ndashx and followed by its value -p ltwhich blast flavour to rungt-i ltfile with query sequence ingt-d ltpre-indexed database namegt

Not All Parameters are hellipThere are many other parameters and if not listed explicitly they will use a default value most appropriate to the blast flavour requested Eg for ndashW ltword sizegt blastn uses ndashW 11 where blastx uses ndashW 3

There are also some options that appear on the web pages that are not really parameters but manage the job in a similar way One of the most useful of these is on the NCBI blast pages where you can use Entrez queries or pick from an organism list to modify your search

The Many Parameters of BLASTThere are almost literally hundreds of parameters but most are way too obscure even for die-hard techies like me Very few of them are regularly useful in any but their default value but just occasionally they are very necessary

Here are some of the ones that I have used

-e max expected value -m output format (graphical or tabularspreadsheet)-F filter query sequence for low complexity (default TRUE)-U use only upper case regions of query (default FALSE)-G gap opening cost-E gap extension cost-q nucleotide mismatch penalty (BLASTx uses matrices)-r nucleotide match reward-b number of matching sequences to report-g allow gaps (default TRUE)-W word size-z effective database size (removes effect of actual database size)-S query strands to search (default both directions)-l restrict database sequences to given list of lsquogilsquo numbers

BLAST Parameters Exercises1 BLASTn vs BLASTx

Open the file example-sequenceshtml copy the sequence gtblastn-vs-blastxThis is a Xenopus tropicalis cDNA sequence

Go to the NCBI BLAST Home PageNucleotide-nucleotide BLAST (blastn) section Paste your sequence into the box

Run BLASTn against the nr nucleotide database using all default optionsThen hit [format] to wait for the results in a new page(hint if you paste the sequence definition line lsquogtnamersquo into the box as well your results will be labelled accordingly which can be useful)

Now repeat but go to the TRANSLATED BLAST section and BLAST against the nr protein database using BLASTx

How might the different results help us view the presence of this gene in other vertebrates

Results for Exercise 1 BLASTn

BLASTx

BLAST Parameters Exercises2 Low complexity filtering

Open the file example-sequenceshtml copy the sequence gtlow-complexity-filtering-A This sequence contains a long AT tandem repeat

Go to the NCBI BLAST Home PageTRANSLATED BLAST sectionBLASTx Paste your sequence into the box

Carefully UNTICK the ldquoChoose filter [ ] Low complexityrdquo BOX in the second section And then run BLASTx against the nr database

What do you feel about these alignmentsRe-run but leave the low-complexity filter ON this timeDoes this change our view of the protein matches

Now continue with gtlow-complexity-filtering-B and ndashCC is an especially interesting case ndash what can we deduce about the cDNA sequence Annotators beware

Results for Exercise 2A (OFF) BLASTn ndash low complexity filtering OFF

Results for Exercise 2A (ON) BLASTn ndash low complexity filtering ON

Results for Exercise 2B

ON OFF

Results for Exercise 2C

There is a sequence error an extra G at position 117 in the sequence cDNA (117)AGAAAAGAAGAAACATGGCAATGGATCAGAA|||||||||||||||| ||||||||||||||AGAAAAGAAGAAACAT-GCAATGGATCAGAA

Genomic sequence

ON

OFF

BLAST Parameters Exercises3 Limit by Entrez query

Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching sequences in fruit fly enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT

Open the file example-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home Page TRANSLATED BLAST sectionBLASTx and paste the sequence

Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip

BLAST Parameters Exercises4 BLASTn vs tBLASTx and nucleotide mismatch penalties

Open the file example-sequenceshtml

Also open the NCBI BLAST Home PageSPECIAL ndash Align two sequences section

There are several Xenopus tropicalis cyclins in the examples fileCopy the sequence gtcyclin-A1-Xt to the Sequence 1 BLAST windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 BLAST window(i) Run the default comparison should be BLASTn Note the alignmentNow run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these

(ii) Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1 Then try reducing the gap open and gap extension penaltieshellipWhat do we learn from this

(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip

Results for Exercise 4 (i)

BLASTn tBLASTx

Results for Exercise 4 (ii)

Mismatch penalty = -2 (default) Mismatch penalty = -1

BLAST Parameters Exercises5 E-Value maximum for reporting

Open the file example-sequenceshtml

Copy the sequence gtsumo-binding-motif and go to the NCBI BLAST Home PageGo to the PROTEIN BLAST section BLASTp and paste the sequence

Run the search with the default values

Now re-run the search setting the maximum E-value in the box

Expect 100

What difference does this make

BLAST Parameters Exercises6 Word Size

Open the file example-sequenceshtml

Copy the sequence gtmorpholino and go to the NCBI BLAST Home PageGo to the NUCLEOTIDE BLAST section BLASTn and paste the sequence

Check OFF the low complexity filter and then run the search

Now re-run the search setting the following parameters

Low complexity OFFExpect 100Word Size 7Other advanced -q-1 (mismatch penalty -1 instead of default -3)

What difference does this make

END

  • Bioinformatics Workshop 1 Sequences and Similarity Searches
  • The Basic Questions
  • Part 1 Structural Genomics
  • Chromosomes and Genes
  • Gene to Protein
  • Sequence Signals
  • Genomic Signals
  • Derivative Sequences
  • Gene Models
  • Sequences and Genes (Accession Numbers and Names)
  • Gene Symbols Names Etc
  • A Gene-Centric View
  • Sequences and Accession Numbers
  • mRNA Splicing Signals
  • Gene Predictions
  • Supporting Evidence
  • TheoreticalPredicted Sequences
  • Sequences for a model organism
  • So Whatrsquos in the Databases Now
  • Part 2 Comparative Genomics
  • Speciation
  • Residual Similarity
  • Computers Can Detect Homology
  • Orthologs
  • Paralogs
  • lsquoOtherrsquo-logs
  • The Essential Paradigm
  • Function Conserved Longer than Detectable Similarity
  • Redundancy in the Genetic Code
  • Protein Similarity Persists Longer
  • Always Compare Protein Sequences
  • Exercise 1 nucleotide vs amino acid search
  • Answers Exercise 1
  • The Essential Task
  • Functional Orthologs
  • Finding Orthologs
  • Using Synteny is Better
  • Metazome
  • Metazome Exercise
  • Part 3 Finding Sequence Similarities
  • Gaps in Alignments
  • The Downside of Gaps
  • BLAST
  • Flavours of BLAST
  • How does it work
  • BLAST WORDS and INDEXING
  • Analyse the Query Sequence
  • Expand from Word Based Matches
  • BLAST ndashTypical Output
  • When is a match significant
  • E-values
  • E-values From First Principles
  • Calculating an E-value
  • E-values In Practice
  • E-value Exercise
  • E-value Exercise Answer
  • E-values Effect of Database Size
  • Slide 58
  • Why were the values different
  • E-values Effect of Query Length
  • Why not just use identity
  • So whatrsquos the real problem
  • Rules of Thumb
  • Protein BLAST
  • Amino Acid Substitutions
  • Exercises
  • Part 4 Tweaking BLAST
  • Not All Parameters are hellip
  • The Many Parameters of BLAST
  • Slide 70
  • BLAST Parameters Exercises
  • Results for Exercise 1
  • Slide 73
  • Results for Exercise 2A (OFF)
  • Results for Exercise 2A (ON)
  • Results for Exercise 2B
  • Results for Exercise 2C
  • Slide 78
  • Slide 79
  • Results for Exercise 4 (i)
  • Results for Exercise 4 (ii)
  • Slide 82
  • Slide 83
  • END
Page 23: Bioinformatics Workshop 1 Sequences and Similarity Searches

Computers Can Detect Homology

In fact computers are very good at this task ndash the two primary challenges are

(a) performing the search fast enough to look through millions of sequence in a timescale compatible with a lab scientistrsquos attention span

(b) at low levels of similarity being able to distinguish between biologically related sequences and chance matcheshellip

ATGCATGCTGCCAACGGATGCCCTG

ATGAAAGCCGCCTACGACAGTCCTG||| | || ||| ||| | ||||

GCTGACTCGTAGCGCTTAGCTAGCT

CCAACATCTAGCCAGATTAGTTAGT | || | | | |

Orthologs

A A

A Gene duplication though speciation The two copies of Gene

A will now evolve independently but will continue to have the ~same function

They are ORTHOLOGS

Paralogs

A

Gene duplication though internal genome duplication

The two copies of Gene A will now evolve independently but will probably not continue to have exactly the same function

They are PARALOGS

A

A Arsquo

A

lsquoOtherrsquo-logsWhat about gene duplication after speciation

How can we describe the relationship(s) between the various copies of gene A in the two frogs

Bear in mind that understanding gene function is more important than semanticshellip

The two copies of A in the orange frog are sometimes called IN-PARALOGS

If they were also present in the green frog (and therefore were in the ancestor species) they would be OUT-PARALOGS

A

A

A

Arsquo A

The Essential Paradigm

1 any group of modern species can be traced back to some extinct common ancestor

A

A

2 in all likelihood they share orthologous genes which have the same function in the modern animal as in the extinct ancestor

3 If we can experimentally determine the function of a gene in one of these organisms then there is a good chance the ORTHOLOGOUS gene in another organism will have the same function

A A

cyclin b1

cyclin b1

Function Conserved Longer than Detectable Similarity

start from first self-replicating sequence

same function detectable similarity

living organisms

whole genome duplication local duplication

Redundancy in the Genetic Code

GCA A alanine GCC A GCG A GCT A

TGC C cystine TGT C

GAC D aspartate GAT D

GGA G glycine GGC G GGG G GGT G

lsquoSynonymousrsquo or lsquosilentrsquo mutations in the third position of the codon triplets have no effect on the amino acid coded for ndash so there is no evolutionary pressure against thishellip

Protein Similarity Persists Longer

CTATCACGAGAACCTGTGCTATCCCGAGAACCTGTGCTATCCCGAGAACCAGTGCTATCCCGTGAACCAGTGCTATCCCGTGAGCCAGTGCTATCCCGTGAGCCAGTTCTGTCCCGTGAGCCAGTT

CTATCACGAGAACCTGTG

CTGTCCCGTGAGCCAGTT|| || || || || ||

LSREPV

LSREPV||||||

CTATCACGAGAACCTGTG

TTGTCCCGGTCGCCAGTT | || | || ||

LSREPV

LSRFPV||| ||

67 100

44 80

Always Compare Protein Sequences

ATGAATGCAGCCTATGATTGCCGAGCCAGAATGCTAAGG MNAAYDCRARMLR ||||| || || || || || || || ||||| || || | ||||||||+||ATGAAGGCCGCATACGACTGTCGTGCTAGAATCCTGAGA MKAAYDCRARILR

DNA comparison amino acid comparison

The DNA sequence can change while the amino acid sequence stays the same so always look for similarities by comparing amino acid sequences

Exercise 1nucleotide vs amino acid search

Go to the file example-sequenceshtml and locate the section for this exercise There should be two sequences lsquosurfeit1rsquo for frog and fly

Go to NCBI Blast home page then lsquoAlign two sequencesrsquo (bottom left lsquospecialrsquo panel) paste one sequence into each window and hit lsquoAlignrsquo ndash this will do a direct DNADNA comparison

Now find the open reading frames of the two genes and translate them into amino acid protein sequences then repeat the two sequences comparison

Go to NCBI ORF Finder ndash paste sequence ndash hit OrfFind ndash identify longest ORF ndash click on it ndash next screen hit Accept ndash change View to Fasta protein ndash hit View ndash copy sequence to Blast2Seqs Do the same with the other sequence

Before you hit lsquoAlignrsquo change the lsquoProgramrsquo (top left) to blastphellip

Answers Exercise 1

The Essential Taskexperiment data mining

gene sequence what is its function

database of proteins in other species

Cyclin-AFoxA1

cdc25

alpha-tubulin

Predicted protein

Gravin-like

Sprouty-2

calmodulin

KIAA10786568

frizzled

Wint8

Troponin T3

Gravin-like

we can only do this because of implied function based on orthology

Functional Orthologs

function known annotation lsquoGravinrsquo available

Human geneXenopus genefunction unknown

sequence similarityorthologs

same function But we know that function is largely determined by shape

similar shape

Which in general we cannot determine ndash but it is probably SHAPE not SEQUENCE that is conserved

We make an assumption that the same gene function is likely to be present in the two organisms and the ones that have this function are likely to be the most similar in sequence

Finding OrthologsSo how do we find orthologs and can we know when we have

The simplest is Reciprocal Best BLAST but it implicitly relies on having all the protein sequences of you own organism and the one you wish to find an ortholog in

frog proteindatabase of human proteins

best match human protein

database of frog proteins

x

Using Synteny is Better

We know that large regions of (say) vertebrate genomes have preserved their overall organisation from one organism to another

And we find the same genes (ie orthologs) in more or less the same order in the syntenic sectionsThese of course represent chromosomal re-arrangements since these organisms diverged

Human chromosome 5

Mouse chromosome 10

Mouse chromosome 2

MetazomeFortunately someone has done all the hard work for ushellip Dan Rokhsar httpwwwmetazomenet

Metazome Exercise

Go back to Entrez Gene and look for your favourite gene again

Pick probable ortholog vertebrate genes from common organisms (human mouse rat chicken frog fish) and paste their protein sequences into a temporary space

Go to Metazome (httpwwwmetazomenet) find the blast window open two versions of it and blast your sequences against the Tetrapod or Jawed vertebrate node

See if you get the same cluster ID as best top hit and have a look at the Metazome alignment(s)hellip

Part 3 Finding Sequence Similarities

We want computer programs which will compare sequences at all possible different alignments looking for a degree of similarity greater than we would expect to find by chance

But first we have to consider the implication of gapshellip

Insertions and deletions are other possible forms of mutations and they can really mess up our simple alignments

ATGCATGCTGCCAACGGATGTCCTG

ATGAAAGCCGCCTACGAAAGTCCTG||| | || ||| ||| | ||||||

ATGCATGCTGGCCAACGGATGTCCTG

ATGAAAGCCGCCTACGAAAGTCCTG||| | || ||| | | | |

Gaps in Alignments

Consider these two obviously similar sequences

TTCCCAACTCTCCTCTTTCACCATGAAGCTCAAGGACAGATTCCACTCGCCCCAAAATCAAGCTCACCCCGTCCAAGAA | || | || |||||||||||||||||||| ||||||||| ||| ||| | ||| | | |TTCCCACCTCTCCTCTTTGCACCATGAAGCTCAAGGACAAATTCCACTCCCCCAAAATCAAGCGCACCCCGTCCCAGAA

In fact we realise that the most probable alignment (regarding biological origin) is with a small gap in each sequence

TTCCCAACTCTCCTCTTT=CACCATGAAGCTCAAGGACAGATTCCACTCGCCCCAAAATCAAGCTCACCCCGTCCAAGAA |||||| ||||||||||| |||||||||||||||||||| ||||||||| |||||||||||||| |||||||||| ||||TTCCCACCTCTCCTCTTTGCACCATGAAGCTCAAGGACAAATTCCACTC=CCCCAAAATCAAGCGCACCCCGTCCCAGAA

So in general we allow ourselves to insert gaps until we find the optimal alignment

But where should this process stop

The Downside of GapsTake two random sequences with no lsquorealrsquo similarity

GACACTAGGTCGATGCGTGGTGGCGAGA

ACGCATCCGGATGTGCACCGTGGAACTG

And allow lsquocost freersquo gaps

GAC--ACT----AGGTCGATGC---GTGG---TGGCGAGA || | | | | | ||| |||| || ACGCA-TCCGGA--T-G-TGCACCGTGGAACTG

Clearly although the alignment has no mismatches it is obviously not biologically meaningful

To prevent this we assign a cost to adding gaps which is offset against the benefit of finding matches ndash and this is the essence of lsquofinding gapped alignmentsrsquo

We want to find the lsquoalignmentrsquo between the two (or more) sequences which shows the greatest degree of similarity while introducing the fewest gaps hellip

BLAST

gtqueryAGACGAACCTAGCACAAGCGCGTCTGGAAAGACCCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGTTCAGGAGTATTTGGACTGCAATATTGGCCCTCGTTCAAGGGCGCCTACCATCACCCGACGGTCATGCCGGTCCCCAGCAGCTGCTAATAACTTCCTTCGCTACTCAAGTTACCACGCTAGCAAAACCCACGGCATACCGTTTACCCTTTAAAATCAGCTTCAACCAGCAACGAA

There are many programs used to find similarities between sequencesThey range from relatively slow programs which find the exact best matching alignment through ones which take progressively inexact shortcuts to speed things up Of this latter class the best known and easily most widely used is BLAST developed by Stephen Altschul and others and continuously refined over the last 10-15 years

The essential idea is to compare your query sequence against a collection or lsquodatabasersquo of target sequences looking for the one(s) that match the query sequence the best

gttarget1AAAACAGGAATATTTACCGGGACCGGGTAATGATGCATCTCGAGGTACACAATATACCTG GAGAACCGAATTATGAGTTGGCCACCTTACTTAACGAAACCAGCAGAGAAAATCCAACAT GGCAACACCCCTCTGACTACACTAGAAGGAACTACTATGTAAGAAAACAGCCTGTCCCTT GCAGTTTGAATGACTGGGTGATGCGAAATGGGGGTCCTGCCATAGAGCGCTTCCATGGTT TACCTTGCACATTTCAGAGAAGTCCTATGCCAGGAGTCCTTCCTACAGGGCCTTCCTGAA ACTATATATGTGCTTATTCTTGTTTGATTTGGCTTTGCAGgttarget2CTCTTAATTTATTTCTCTTCCTGCAGCTCCCTCGCTTTTTCCTTTCCCTGTTACATTCAT CTGACTTGAAGAGTTGCAAATTTTCAGTGTTTCTGTTTTTGTTGCTGATATGTTGTAAAC TTTTTAATAAAATCTATTTCTATAG gttarget3GCAGTTTGAATGACTGGGTGATGCGAAATGGGGGTCCTGCCATAGAGCGCTTCCATGGTT TACCTTGCACATTTCAGAGAAGTCCTATGCCAGGAGTCCTTCCTACAGGGCCTTCCTGAA ACTATATATGTGCTTATTCTTGTTTGATTTGGCTTTGCAGCTAGGGTTTTCACCTTTTCT GGAAAAAAAAATACTGGCTTCC gttarget4CTGCTATTAATGGGCAAAACAACTCAAATAAAGTCCCTCTGCCACCCTCAGACACTGCCC CTGGCCCCCAGCTGCCCGCTGATCCTTGTAGCCAGAGCAGTAAAGTTTTGAAAGTGGAGC CCAAGGAGAATAAAGTTATTAAAGAAACTGGCTTTGAACAAGGTGAAAAGTCTTGTGCAG CACCTCTAGATCATACTGTGAAGGAAAATCTTGGACAAACTTCTAAAGAACAGGTGGTAG

query

database

COMPARE

LIST MATCHES

Flavours of BLAST

ACGATAGATCCCATCCATAAAT ATGACGATAGATCCCATCAT CGATAGGACCACCACA GATAGACCAGGATACATAGGATAATTA AGCTCGCTTGGCTCGATGGCT

MKJLSPWERSYTRGHYTWER MGHTVNBZY MKLPWRHGDBKJGMNDFD MBKLRPIUHDFRTASGSLKWWRTVBN

MKJLSPWERSYTRGHYTWER MGHTVNBZY MKLPWRHGDBKJGMNDFD MBKLRPIUHDFRTASGSLKWWRTVBN

ATGACGATAGATCCCATCAT CGATAGGACCACCACA GATAGACCAGGATACATAGGATAATTA AGCTCGCTTGGCTCGATGGCT

query sequence other operation database sequences

ATGACGATAGATCCCATCAT CGATAGGACCACCACA GATAGACCAGGATACATAGGATAATTA AGCTCGCTTGGCTCGATGGCT

BLASTn

BLASTp

BLASTx

tBLASTn

tBLASTx

ACGATAGATCCCATCCATAAAT

ACGATAGATCCCATCCATAAAT

MQWCGYRWTYQGYRW

MQWCGYRWTYQGYRW

FAST

FAST

SLOW

SLOWER

HORRIBLY

SLOW

6 fra

me

trans

latio

n

How does it work The main task of any sequence comparison program is to test all possible mutual alignments of two sequence and see how good the match is

CTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGT

CCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGTTC | | | | | ||||||||||||||||||||||||| CTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGTCTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGT || | | | | | | | | |||||||||||||||||||||||| | | | | | |

CCGAGCTTCTCATTGCTCTTCCTAACAGTG=TGATAGGCTAACCGTAATGGCGTTC||||||||||||||||||||||||| ||||||||||||||||||||||||

query

1st database sequence

This would actually be a very slow search process if implemented like thishellip

BLAST achieves its speed through two strategies

- it takes a WORD based approach- it pre-INDEXES database sequences

BLAST WORDS and INDEXING1 GACAAATCCAAACCCCTGAAGTTCTCCACCAGCAAAGCCA

2 TAAGCAAATTTAATTTTGTTTACATTTTC

3 GTTAAGACCTTCCCTGACATTTGCAGCAGTTTCAAATGTA

AAAAAAAA 00001 AAAAAAAC 00002 AAAAAAAG 00003 ACAAATCC 07967 ACAAATCC 07968 ACAAATCC 07979 GACAAATC 33568 GACAAATG 33569 TCCAAACC 64321 TCCAAACC 64322

sequence position word

1 1 33658

1 2 07967

1 3 16210

3 15 33568

3 16 07967

Database of sequences

Numbered list of all possible lsquowordsrsquo

Build a position index of all words in the database

Analyse the Query Sequence gtquery AGACAAATCCAAACCCCTGAAGTTCTCCACCAGCAAAGCCA

AAAAAAAA 00001 AAAAAAAC 00002 AAAAAAAG 00003 ACAAATCC 07967 ACAAATCC 07968 ACAAATCC 07979 GACAAATC 33568 GACAAATG 33569 TCCAAACC 64321 TCCAAACC 64322

QUERY SEQUENCE

Numbered list of all possible lsquowordsrsquo

position word

1 14236

2 33658

3 07967

Analyse QUERY SEQUENCE

sequence position word

1 1 33658

1 2 07967

1 3 16210

3 15 33568

3 16 07967

Index of database

Expand from Word Based Matches We lsquoinstantlyrsquo know which sequences in the database have at least a word length match with our query sequence and at what relative position

Next the potential alignments are expanded adding up a score for (total matches ndash mismatches ndash gap penalties) to make the best possible alignment But this is usually for a tiny proportion of the sequences in the database ndash so overall it is much quicker

The highest scoring alignments are reported

But we can potentially miss alignments with no word-size bits in common consider BLASTn with a default word-size of 11

TCGGAAGTGGAAGCTGAACCTGATTGTAGAGTTGGAGGCCAGTGTTCTGGCTGAGC||||||||| ||||| |||||||||| |||||||||| |||| ||||| ||||||| TCGGAAGTGTAAGCTCAACCTGATTGCAGAGTTGGAGTCCAGAGTTCTAGCTGAGC

Care is sometimes neededhellip

BLAST ndashTypical OutputINPUT

gtpartial cDNA sequence Xenopus tropicalisCGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGTTCCCACCTCTCCTCTTTCACCATGAAGCTCAAGGACAAATTCCACTCCCCCAAAATCAAGCGCACCCCGTCCAAGAAGGGGAAGCCGGCCGACCTCACCGTCAAAACAGAAGAGAAACCCGTCAACAAAACCTTAAGCCGCTTGGAGGAACAGGAGAAAGAAGTCGTTAATGCCTTGCGTTACTTTAAGACAATTGTTGACAAGATGGCGGTGGACAAGATGGTGCTGGTGATGCTGCCAGGGTCGGCGA

OUTPUTQuery= (311 letters) Database NCBI Protein Reference Sequences 954378 sequences 347895532 total letters

gtgi|41055060|ref|NP_9574201| similar to guanine nucleotide-releasing factor 2 (specific for crk proto-oncogene) [Danio rerio]

Length=691

Score = 133 bits (335)Expect = 6e-31 Identities = 7698 (77) Positives = 8298 (83) Gaps = 498 (4) Frame = +2

Query 26 MSGKIE-KADSQRSHLSSFTMKLKDKFHSPKIKRTPSKKGKPA--DLTVKTEEKPVNKTL 196 MSGKIE K +SQ+SHLSSFTMKL KFHSPKIKRTPSKKGK + VKT EKPVNK + Sbjct 1 MSGKIESKHESQKSHLSSFTMKLM-KFHSPKIKRTPSKKGKQLQPEPAVKTPEKPVNKKV 59

Query 197 SRLEEQEKEVVNALRYFKTIVDKMAVDKMVLVMLPGSA 310 SRLEEQEK+VV+ALRYFKTIVDKM VD VL MLPGSA Sbjct 60 SRLEEQEKDVVSALRYFKTIVDKMNVDTKVLQMLPGSA 97

When is a match significant

RFKISDCQHPCTYSHNQYMTNHMRECPYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV

NFSWKKTSEKETNCQFDYPNDYNEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFNMCWLEVNSS

RF---KISDCQHPCTYSH-NQYMTNHMREC----PYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV F K S+ + C + + N Y N +C P+ + +W +P + D I N M ++ NFSWKKTSEKETNCQFDYPNDY--NEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFN------MCWLEVNSS

Here is a lsquotypicalrsquo weak alignment from BLASTp

In fact the sequences were randomly generated so there is no biologically significant alignmenthellip

E-values

The number of matches like the discovered match that I would expect to find by chance

An E-value of 00 implies that I would expect no matches like this to arise by chance thereforehellip

An E-value of 1 implies I would expect 1 match like this to arise by chance so if I have a match with such an E-valuehellip

Also ldquoexpect valueldquo or ldquoexpectationrdquo

E-values From First Principles

Some database statistics (23rd July 2005)

Database NCBI RefSeq mRNA 272619 sequences 503566580 total letters (~50 x 108)

Database NCBI nr 3329110 sequences 14601814750 total letters (~14 x 1010)

Notation

12e-35 = 12 x 10-35

48 x 106 = 4800000

We will consider first searching a nucleotide sequence (lsquoACGTAGACGTrsquo) against a nucleotide database eg the RefSeq mRNA above

Then we will consider the more complex case of amino acid sequence (protein) searches Which is of course what we mostly do

Calculating an E-valueThe RefSeq mRNA database has ~ 50 x 108 letters There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA

Query = lsquoArsquo

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACrsquo

AC AC AC AC

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACGrsquo

ACG

Expected number of matches = (50 x 108) 4 = ~12 x 108

Expected number of matches = (50 x 108) (4x 4) = ~31 x 107

Expected number of matches = (50 x 108) (4 x 4 x 4) = ~81 x 106

Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer

Expected number of matches = (50 x 108) (4 x 4 x 4 x 4 hellip 60 times ) = (50 x 108) 1036 = 50 x 10-28

E-value = 50 x 10-28

E-values In PracticeSo if I take a 60 nt sequencegtsequenceACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA and actually BLAST it against the RefSeq mRNA database I get

BLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds

Length=6060 Score = 119 bits (60) Expect = 2e-26 Identities = 6060 (100) Gaps = 060 (0) Strand=PlusPlus

Query 1 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 60 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 2977 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 3036

What do I get if I BLAST it against the larger nr databaseBLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds

Length=6060 Score = 119 bits (60) Expect = 6e-25 Identities = 6060 (100) Gaps = 060 (0) Strand=PlusPlus

Query 1 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 60 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 2977 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 3036

theoretical value was 50e-28 -

E-value Exercise

Given a transcription factor binding site

ACC[TG]TA

How many would you expect to find by chance in a 10k promoter sequence

How would this differ if there was an optional additional base between the 4th and 5th positionsIeACC[TG]TAOR ACC[TG]TA

E-value Exercise AnswerACC[TG]TA

Expect lsquoArsquo every 4 ntExpect lsquoACCrsquo every 4x4x4 = 64 nt

Expect lsquoT or Grsquo every 2nd ntExpect lsquoACC[TG]rsquo every 64x2 nt = 128 nt

Expect lsquoTArsquo every 4x4 = 16 ntExpect lsquoACC[TG]TArsquo every 128x16 nt = 2048 nt (4x4x4x2x4x4)We would expect ~5 of these promoter sites every 10k by chance

If also ACC[TG]TAA allowed

The two motifs independently have the same E-valueTo allow either means we expect twice as many

E-values Effect of Database SizeThe nr mRNA database has ~ 14 x 1010 letters (was RefSeq and 50 x108)There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA

Query = lsquoArsquo

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACrsquo

AC AC AC AC

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACGrsquo

ACG

Expected number of matches = (14 x 1010) 4 = ~12 x 108

Expected number of matches = (14 x 1010) (4x 4) = ~31 x 107

Expected number of matches = (14 x 1010) (4 x 4 x 4) = ~81 x 106

Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer

Expected number of matches = (14 x 1010) (4 x 4 x 4 x 4 hellip 60 times ) = (14 x 1010) 1036 = 14 x 10-26

E-value = 14 x 10-26

(was E-value = 50 x 10-28)

E-values Effect of Database Size

The E-value is simply dependent on database size

RefSeq

nr

14 x 1010 letters

50 x108 letters

30 x bigger

BLAST the same sequenceagainst each E-value = 14e-26

E-value = 50e-28

The database was ~30 times bigger and so the E-value was ~30 times bigger

Why were the values differentOur calculated E-value for searching against the RefSeq mRNA database was 50 x 10-28But our actual BLAST search at NCBI gave a value of

20 x 10-26 - about 40x larger - why is this

Gapped alignments

If we were expecting N matches for a query sequence lsquoACGTACGTACGTrsquo imagine what would happen to N if we allowed gaps in our matches

ACGTACGTACGT

This would now give us additional possible alignments that would meet our lsquomatchrsquo criteria

ACGTACGTACGT ACGTACAGTACGT ACGTACCGTACGT etc|||||||||||| |||||| |||||| |||||| ||||||ACGTACGTACGT ACGTAC-GTACGT ACGTAC-GTACGT

We will expect many more matches in a given database if we allow our alignments to have gaps The E-value will be larger

E-values Effect of Query Length

Biologically itrsquos the same match Does it mean we are any less sure that this match didnrsquot occur by chance The E-value is simply dependent on match length

database

BLAST 500 nt sequence against a database

BLASTn Get a full length match with sequence XYZ at an E-value = 50e-160

gtsequence ACTAGTCTAGCTAGACATCGATCGATGATGCTACACAGATAGACGATAGATAGTAAGTCGATCGATCGCGCATCGATCGTCTAGATCGATCGCTCGCTGTGTAGATAGATCGGCGATAGA

database

BLAST half of the same sequence against the same database

BLASTn

gtsequence ACTAGTCTAGCTAGACATCGATCGATGATGCTACACAGATAGACGATAGATAGTAAGTCG

Get a match with sequence XYZ again but at an E-value = 50e-80

Why not just use identityAt some levels this a good question

But consider two very different searches both of which give a 75 identity match

Query1 was 60 nt longCGGAGCTCAGGGCTTAACGACTGATATCTCCGCGCATGTCGAGAAACGATACAGCCAGCG||||||||||| || | || | || || |||| | | | |||||| | ||||||||||CGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGWhich would have an E-value ~ 50 x 10-19

And Query2 only 16 nt longACGTACGTACGTACGT||| || | |||| ||ACGCACCTTCGTAGGTWhich would have an E-value ~ 30

And intuitively we feel we would expect to see that sort of number of matches in the database just by chancehellip

So whatrsquos the real problemBasically you are usually trying to answer the question

Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism

Are there any useful guidelines though at least for biological meaningfulness

Basically you are usually trying to answer the question

Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism

BLAST

The difficulty is because

ORTHOLOGY

BLAST Similarity + Probability

biological knowledge

nature of query sequence

phylogenetic relationship

match length PI size of databasehellip

Rules of ThumbHow good does an E-value have to be before we might even think we have an ortholog

largerworse smallerbetter

E-values 10-5 10-10 10-40 10-100 00

fantasy borderline encouraging

pretty good canrsquot get

better

But note that in some gene families with closely related members you can get an E-value of 00 for several different matches and then identity may be more sensitive Also bear in mind in cases like this that ideas of lsquofunctionalrsquo orthology may break down with more than one locus producing identical proteins which share the same functionhellip

Protein BLASTItrsquos (nearly) always better to make comparisons at the amino acid level between protein sequences than the DNA level because the amino acid sequence is more conserved than the underlying DNA sequenceDoes this cause us to treat expected values any differently

If we follow the argument as before then for an exact match of a 20 amino acid sequence in the RefSeq protein database each additional amino acid will reduce the E-value by 120th (there are 20 different amino acids) And as there are 347895532 letters in that databaseE-value = ~35 x 108 (20 x 20 x 20 hellip20 times) = ~35 x 10-18

But this is what we get if we run the blast at NCBI

Score = 431 bits (100) Expect = 8e-04 Identities = 2020 (100) Positives = 2020 (100) Gaps = 020 (0) Frame = +3

Query 3 SSSSFRAYRAALSEVEPPCI 62 SSSSFRAYRAALSEVEPPCISbjct 972 SSSSFRAYRAALSEVEPPCI 991

Really too big a discrepancy to easily explain with hand wavinghellip

Amino Acid Substitutions

A SC F LWYG I LMVL IMFVM ILVP V ILMW FY

N DHSQ REHKS ANTT SY HFW

H NQYK RQER QK

D NEE DQK

In fact we need to take into account both amino acid substitutability as well as as before allowing gapped alignments On average any residue can be substituted for by about 2 others so each position has about 17th chance of lsquomatchingrsquo rather than 120th

So now we getE-value = ~35 x 108 (7 x 7 x 7 hellip20 times) = ~44 x 10-9which is much closer to the actual BLAST value

These substitutabilities are dealt with by the BLOSUM and PAM matrices

ExercisesGo to the file random-DNA-sequenceshtml select one of the 20 randomly generated nucleotide sequences and do a BLASTx (translated DNA-gtprotein) at NCBI against the nr protein database

Did you find any lsquosignificantrsquo hits

Repeat with a second sequence

What conclusions might you draw from this exercise

Try the same sequence(s) against the nr nucleotide database

Is there any general difference

Part 4 Tweaking BLASTAlthough you normally see BLAST as a web page with boxes to place data in and tick boxes etc it is actually a command line program that can be run just by typing the appropriate command and options eg

promptgt blastall ndashp ltblast typegt ndashi ltinput sequencegt ndashd ltdatabasegt

This is the simplest form where the basic program lsquoblastallrsquo takes a number of different options or parameters indicated by the ndashx and followed by its value -p ltwhich blast flavour to rungt-i ltfile with query sequence ingt-d ltpre-indexed database namegt

Not All Parameters are hellipThere are many other parameters and if not listed explicitly they will use a default value most appropriate to the blast flavour requested Eg for ndashW ltword sizegt blastn uses ndashW 11 where blastx uses ndashW 3

There are also some options that appear on the web pages that are not really parameters but manage the job in a similar way One of the most useful of these is on the NCBI blast pages where you can use Entrez queries or pick from an organism list to modify your search

The Many Parameters of BLASTThere are almost literally hundreds of parameters but most are way too obscure even for die-hard techies like me Very few of them are regularly useful in any but their default value but just occasionally they are very necessary

Here are some of the ones that I have used

-e max expected value -m output format (graphical or tabularspreadsheet)-F filter query sequence for low complexity (default TRUE)-U use only upper case regions of query (default FALSE)-G gap opening cost-E gap extension cost-q nucleotide mismatch penalty (BLASTx uses matrices)-r nucleotide match reward-b number of matching sequences to report-g allow gaps (default TRUE)-W word size-z effective database size (removes effect of actual database size)-S query strands to search (default both directions)-l restrict database sequences to given list of lsquogilsquo numbers

BLAST Parameters Exercises1 BLASTn vs BLASTx

Open the file example-sequenceshtml copy the sequence gtblastn-vs-blastxThis is a Xenopus tropicalis cDNA sequence

Go to the NCBI BLAST Home PageNucleotide-nucleotide BLAST (blastn) section Paste your sequence into the box

Run BLASTn against the nr nucleotide database using all default optionsThen hit [format] to wait for the results in a new page(hint if you paste the sequence definition line lsquogtnamersquo into the box as well your results will be labelled accordingly which can be useful)

Now repeat but go to the TRANSLATED BLAST section and BLAST against the nr protein database using BLASTx

How might the different results help us view the presence of this gene in other vertebrates

Results for Exercise 1 BLASTn

BLASTx

BLAST Parameters Exercises2 Low complexity filtering

Open the file example-sequenceshtml copy the sequence gtlow-complexity-filtering-A This sequence contains a long AT tandem repeat

Go to the NCBI BLAST Home PageTRANSLATED BLAST sectionBLASTx Paste your sequence into the box

Carefully UNTICK the ldquoChoose filter [ ] Low complexityrdquo BOX in the second section And then run BLASTx against the nr database

What do you feel about these alignmentsRe-run but leave the low-complexity filter ON this timeDoes this change our view of the protein matches

Now continue with gtlow-complexity-filtering-B and ndashCC is an especially interesting case ndash what can we deduce about the cDNA sequence Annotators beware

Results for Exercise 2A (OFF) BLASTn ndash low complexity filtering OFF

Results for Exercise 2A (ON) BLASTn ndash low complexity filtering ON

Results for Exercise 2B

ON OFF

Results for Exercise 2C

There is a sequence error an extra G at position 117 in the sequence cDNA (117)AGAAAAGAAGAAACATGGCAATGGATCAGAA|||||||||||||||| ||||||||||||||AGAAAAGAAGAAACAT-GCAATGGATCAGAA

Genomic sequence

ON

OFF

BLAST Parameters Exercises3 Limit by Entrez query

Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching sequences in fruit fly enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT

Open the file example-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home Page TRANSLATED BLAST sectionBLASTx and paste the sequence

Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip

BLAST Parameters Exercises4 BLASTn vs tBLASTx and nucleotide mismatch penalties

Open the file example-sequenceshtml

Also open the NCBI BLAST Home PageSPECIAL ndash Align two sequences section

There are several Xenopus tropicalis cyclins in the examples fileCopy the sequence gtcyclin-A1-Xt to the Sequence 1 BLAST windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 BLAST window(i) Run the default comparison should be BLASTn Note the alignmentNow run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these

(ii) Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1 Then try reducing the gap open and gap extension penaltieshellipWhat do we learn from this

(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip

Results for Exercise 4 (i)

BLASTn tBLASTx

Results for Exercise 4 (ii)

Mismatch penalty = -2 (default) Mismatch penalty = -1

BLAST Parameters Exercises5 E-Value maximum for reporting

Open the file example-sequenceshtml

Copy the sequence gtsumo-binding-motif and go to the NCBI BLAST Home PageGo to the PROTEIN BLAST section BLASTp and paste the sequence

Run the search with the default values

Now re-run the search setting the maximum E-value in the box

Expect 100

What difference does this make

BLAST Parameters Exercises6 Word Size

Open the file example-sequenceshtml

Copy the sequence gtmorpholino and go to the NCBI BLAST Home PageGo to the NUCLEOTIDE BLAST section BLASTn and paste the sequence

Check OFF the low complexity filter and then run the search

Now re-run the search setting the following parameters

Low complexity OFFExpect 100Word Size 7Other advanced -q-1 (mismatch penalty -1 instead of default -3)

What difference does this make

END

  • Bioinformatics Workshop 1 Sequences and Similarity Searches
  • The Basic Questions
  • Part 1 Structural Genomics
  • Chromosomes and Genes
  • Gene to Protein
  • Sequence Signals
  • Genomic Signals
  • Derivative Sequences
  • Gene Models
  • Sequences and Genes (Accession Numbers and Names)
  • Gene Symbols Names Etc
  • A Gene-Centric View
  • Sequences and Accession Numbers
  • mRNA Splicing Signals
  • Gene Predictions
  • Supporting Evidence
  • TheoreticalPredicted Sequences
  • Sequences for a model organism
  • So Whatrsquos in the Databases Now
  • Part 2 Comparative Genomics
  • Speciation
  • Residual Similarity
  • Computers Can Detect Homology
  • Orthologs
  • Paralogs
  • lsquoOtherrsquo-logs
  • The Essential Paradigm
  • Function Conserved Longer than Detectable Similarity
  • Redundancy in the Genetic Code
  • Protein Similarity Persists Longer
  • Always Compare Protein Sequences
  • Exercise 1 nucleotide vs amino acid search
  • Answers Exercise 1
  • The Essential Task
  • Functional Orthologs
  • Finding Orthologs
  • Using Synteny is Better
  • Metazome
  • Metazome Exercise
  • Part 3 Finding Sequence Similarities
  • Gaps in Alignments
  • The Downside of Gaps
  • BLAST
  • Flavours of BLAST
  • How does it work
  • BLAST WORDS and INDEXING
  • Analyse the Query Sequence
  • Expand from Word Based Matches
  • BLAST ndashTypical Output
  • When is a match significant
  • E-values
  • E-values From First Principles
  • Calculating an E-value
  • E-values In Practice
  • E-value Exercise
  • E-value Exercise Answer
  • E-values Effect of Database Size
  • Slide 58
  • Why were the values different
  • E-values Effect of Query Length
  • Why not just use identity
  • So whatrsquos the real problem
  • Rules of Thumb
  • Protein BLAST
  • Amino Acid Substitutions
  • Exercises
  • Part 4 Tweaking BLAST
  • Not All Parameters are hellip
  • The Many Parameters of BLAST
  • Slide 70
  • BLAST Parameters Exercises
  • Results for Exercise 1
  • Slide 73
  • Results for Exercise 2A (OFF)
  • Results for Exercise 2A (ON)
  • Results for Exercise 2B
  • Results for Exercise 2C
  • Slide 78
  • Slide 79
  • Results for Exercise 4 (i)
  • Results for Exercise 4 (ii)
  • Slide 82
  • Slide 83
  • END
Page 24: Bioinformatics Workshop 1 Sequences and Similarity Searches

Orthologs

A A

A Gene duplication though speciation The two copies of Gene

A will now evolve independently but will continue to have the ~same function

They are ORTHOLOGS

Paralogs

A

Gene duplication though internal genome duplication

The two copies of Gene A will now evolve independently but will probably not continue to have exactly the same function

They are PARALOGS

A

A Arsquo

A

lsquoOtherrsquo-logsWhat about gene duplication after speciation

How can we describe the relationship(s) between the various copies of gene A in the two frogs

Bear in mind that understanding gene function is more important than semanticshellip

The two copies of A in the orange frog are sometimes called IN-PARALOGS

If they were also present in the green frog (and therefore were in the ancestor species) they would be OUT-PARALOGS

A

A

A

Arsquo A

The Essential Paradigm

1 any group of modern species can be traced back to some extinct common ancestor

A

A

2 in all likelihood they share orthologous genes which have the same function in the modern animal as in the extinct ancestor

3 If we can experimentally determine the function of a gene in one of these organisms then there is a good chance the ORTHOLOGOUS gene in another organism will have the same function

A A

cyclin b1

cyclin b1

Function Conserved Longer than Detectable Similarity

start from first self-replicating sequence

same function detectable similarity

living organisms

whole genome duplication local duplication

Redundancy in the Genetic Code

GCA A alanine GCC A GCG A GCT A

TGC C cystine TGT C

GAC D aspartate GAT D

GGA G glycine GGC G GGG G GGT G

lsquoSynonymousrsquo or lsquosilentrsquo mutations in the third position of the codon triplets have no effect on the amino acid coded for ndash so there is no evolutionary pressure against thishellip

Protein Similarity Persists Longer

CTATCACGAGAACCTGTGCTATCCCGAGAACCTGTGCTATCCCGAGAACCAGTGCTATCCCGTGAACCAGTGCTATCCCGTGAGCCAGTGCTATCCCGTGAGCCAGTTCTGTCCCGTGAGCCAGTT

CTATCACGAGAACCTGTG

CTGTCCCGTGAGCCAGTT|| || || || || ||

LSREPV

LSREPV||||||

CTATCACGAGAACCTGTG

TTGTCCCGGTCGCCAGTT | || | || ||

LSREPV

LSRFPV||| ||

67 100

44 80

Always Compare Protein Sequences

ATGAATGCAGCCTATGATTGCCGAGCCAGAATGCTAAGG MNAAYDCRARMLR ||||| || || || || || || || ||||| || || | ||||||||+||ATGAAGGCCGCATACGACTGTCGTGCTAGAATCCTGAGA MKAAYDCRARILR

DNA comparison amino acid comparison

The DNA sequence can change while the amino acid sequence stays the same so always look for similarities by comparing amino acid sequences

Exercise 1nucleotide vs amino acid search

Go to the file example-sequenceshtml and locate the section for this exercise There should be two sequences lsquosurfeit1rsquo for frog and fly

Go to NCBI Blast home page then lsquoAlign two sequencesrsquo (bottom left lsquospecialrsquo panel) paste one sequence into each window and hit lsquoAlignrsquo ndash this will do a direct DNADNA comparison

Now find the open reading frames of the two genes and translate them into amino acid protein sequences then repeat the two sequences comparison

Go to NCBI ORF Finder ndash paste sequence ndash hit OrfFind ndash identify longest ORF ndash click on it ndash next screen hit Accept ndash change View to Fasta protein ndash hit View ndash copy sequence to Blast2Seqs Do the same with the other sequence

Before you hit lsquoAlignrsquo change the lsquoProgramrsquo (top left) to blastphellip

Answers Exercise 1

The Essential Taskexperiment data mining

gene sequence what is its function

database of proteins in other species

Cyclin-AFoxA1

cdc25

alpha-tubulin

Predicted protein

Gravin-like

Sprouty-2

calmodulin

KIAA10786568

frizzled

Wint8

Troponin T3

Gravin-like

we can only do this because of implied function based on orthology

Functional Orthologs

function known annotation lsquoGravinrsquo available

Human geneXenopus genefunction unknown

sequence similarityorthologs

same function But we know that function is largely determined by shape

similar shape

Which in general we cannot determine ndash but it is probably SHAPE not SEQUENCE that is conserved

We make an assumption that the same gene function is likely to be present in the two organisms and the ones that have this function are likely to be the most similar in sequence

Finding OrthologsSo how do we find orthologs and can we know when we have

The simplest is Reciprocal Best BLAST but it implicitly relies on having all the protein sequences of you own organism and the one you wish to find an ortholog in

frog proteindatabase of human proteins

best match human protein

database of frog proteins

x

Using Synteny is Better

We know that large regions of (say) vertebrate genomes have preserved their overall organisation from one organism to another

And we find the same genes (ie orthologs) in more or less the same order in the syntenic sectionsThese of course represent chromosomal re-arrangements since these organisms diverged

Human chromosome 5

Mouse chromosome 10

Mouse chromosome 2

MetazomeFortunately someone has done all the hard work for ushellip Dan Rokhsar httpwwwmetazomenet

Metazome Exercise

Go back to Entrez Gene and look for your favourite gene again

Pick probable ortholog vertebrate genes from common organisms (human mouse rat chicken frog fish) and paste their protein sequences into a temporary space

Go to Metazome (httpwwwmetazomenet) find the blast window open two versions of it and blast your sequences against the Tetrapod or Jawed vertebrate node

See if you get the same cluster ID as best top hit and have a look at the Metazome alignment(s)hellip

Part 3 Finding Sequence Similarities

We want computer programs which will compare sequences at all possible different alignments looking for a degree of similarity greater than we would expect to find by chance

But first we have to consider the implication of gapshellip

Insertions and deletions are other possible forms of mutations and they can really mess up our simple alignments

ATGCATGCTGCCAACGGATGTCCTG

ATGAAAGCCGCCTACGAAAGTCCTG||| | || ||| ||| | ||||||

ATGCATGCTGGCCAACGGATGTCCTG

ATGAAAGCCGCCTACGAAAGTCCTG||| | || ||| | | | |

Gaps in Alignments

Consider these two obviously similar sequences

TTCCCAACTCTCCTCTTTCACCATGAAGCTCAAGGACAGATTCCACTCGCCCCAAAATCAAGCTCACCCCGTCCAAGAA | || | || |||||||||||||||||||| ||||||||| ||| ||| | ||| | | |TTCCCACCTCTCCTCTTTGCACCATGAAGCTCAAGGACAAATTCCACTCCCCCAAAATCAAGCGCACCCCGTCCCAGAA

In fact we realise that the most probable alignment (regarding biological origin) is with a small gap in each sequence

TTCCCAACTCTCCTCTTT=CACCATGAAGCTCAAGGACAGATTCCACTCGCCCCAAAATCAAGCTCACCCCGTCCAAGAA |||||| ||||||||||| |||||||||||||||||||| ||||||||| |||||||||||||| |||||||||| ||||TTCCCACCTCTCCTCTTTGCACCATGAAGCTCAAGGACAAATTCCACTC=CCCCAAAATCAAGCGCACCCCGTCCCAGAA

So in general we allow ourselves to insert gaps until we find the optimal alignment

But where should this process stop

The Downside of GapsTake two random sequences with no lsquorealrsquo similarity

GACACTAGGTCGATGCGTGGTGGCGAGA

ACGCATCCGGATGTGCACCGTGGAACTG

And allow lsquocost freersquo gaps

GAC--ACT----AGGTCGATGC---GTGG---TGGCGAGA || | | | | | ||| |||| || ACGCA-TCCGGA--T-G-TGCACCGTGGAACTG

Clearly although the alignment has no mismatches it is obviously not biologically meaningful

To prevent this we assign a cost to adding gaps which is offset against the benefit of finding matches ndash and this is the essence of lsquofinding gapped alignmentsrsquo

We want to find the lsquoalignmentrsquo between the two (or more) sequences which shows the greatest degree of similarity while introducing the fewest gaps hellip

BLAST

gtqueryAGACGAACCTAGCACAAGCGCGTCTGGAAAGACCCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGTTCAGGAGTATTTGGACTGCAATATTGGCCCTCGTTCAAGGGCGCCTACCATCACCCGACGGTCATGCCGGTCCCCAGCAGCTGCTAATAACTTCCTTCGCTACTCAAGTTACCACGCTAGCAAAACCCACGGCATACCGTTTACCCTTTAAAATCAGCTTCAACCAGCAACGAA

There are many programs used to find similarities between sequencesThey range from relatively slow programs which find the exact best matching alignment through ones which take progressively inexact shortcuts to speed things up Of this latter class the best known and easily most widely used is BLAST developed by Stephen Altschul and others and continuously refined over the last 10-15 years

The essential idea is to compare your query sequence against a collection or lsquodatabasersquo of target sequences looking for the one(s) that match the query sequence the best

gttarget1AAAACAGGAATATTTACCGGGACCGGGTAATGATGCATCTCGAGGTACACAATATACCTG GAGAACCGAATTATGAGTTGGCCACCTTACTTAACGAAACCAGCAGAGAAAATCCAACAT GGCAACACCCCTCTGACTACACTAGAAGGAACTACTATGTAAGAAAACAGCCTGTCCCTT GCAGTTTGAATGACTGGGTGATGCGAAATGGGGGTCCTGCCATAGAGCGCTTCCATGGTT TACCTTGCACATTTCAGAGAAGTCCTATGCCAGGAGTCCTTCCTACAGGGCCTTCCTGAA ACTATATATGTGCTTATTCTTGTTTGATTTGGCTTTGCAGgttarget2CTCTTAATTTATTTCTCTTCCTGCAGCTCCCTCGCTTTTTCCTTTCCCTGTTACATTCAT CTGACTTGAAGAGTTGCAAATTTTCAGTGTTTCTGTTTTTGTTGCTGATATGTTGTAAAC TTTTTAATAAAATCTATTTCTATAG gttarget3GCAGTTTGAATGACTGGGTGATGCGAAATGGGGGTCCTGCCATAGAGCGCTTCCATGGTT TACCTTGCACATTTCAGAGAAGTCCTATGCCAGGAGTCCTTCCTACAGGGCCTTCCTGAA ACTATATATGTGCTTATTCTTGTTTGATTTGGCTTTGCAGCTAGGGTTTTCACCTTTTCT GGAAAAAAAAATACTGGCTTCC gttarget4CTGCTATTAATGGGCAAAACAACTCAAATAAAGTCCCTCTGCCACCCTCAGACACTGCCC CTGGCCCCCAGCTGCCCGCTGATCCTTGTAGCCAGAGCAGTAAAGTTTTGAAAGTGGAGC CCAAGGAGAATAAAGTTATTAAAGAAACTGGCTTTGAACAAGGTGAAAAGTCTTGTGCAG CACCTCTAGATCATACTGTGAAGGAAAATCTTGGACAAACTTCTAAAGAACAGGTGGTAG

query

database

COMPARE

LIST MATCHES

Flavours of BLAST

ACGATAGATCCCATCCATAAAT ATGACGATAGATCCCATCAT CGATAGGACCACCACA GATAGACCAGGATACATAGGATAATTA AGCTCGCTTGGCTCGATGGCT

MKJLSPWERSYTRGHYTWER MGHTVNBZY MKLPWRHGDBKJGMNDFD MBKLRPIUHDFRTASGSLKWWRTVBN

MKJLSPWERSYTRGHYTWER MGHTVNBZY MKLPWRHGDBKJGMNDFD MBKLRPIUHDFRTASGSLKWWRTVBN

ATGACGATAGATCCCATCAT CGATAGGACCACCACA GATAGACCAGGATACATAGGATAATTA AGCTCGCTTGGCTCGATGGCT

query sequence other operation database sequences

ATGACGATAGATCCCATCAT CGATAGGACCACCACA GATAGACCAGGATACATAGGATAATTA AGCTCGCTTGGCTCGATGGCT

BLASTn

BLASTp

BLASTx

tBLASTn

tBLASTx

ACGATAGATCCCATCCATAAAT

ACGATAGATCCCATCCATAAAT

MQWCGYRWTYQGYRW

MQWCGYRWTYQGYRW

FAST

FAST

SLOW

SLOWER

HORRIBLY

SLOW

6 fra

me

trans

latio

n

How does it work The main task of any sequence comparison program is to test all possible mutual alignments of two sequence and see how good the match is

CTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGT

CCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGTTC | | | | | ||||||||||||||||||||||||| CTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGTCTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGT || | | | | | | | | |||||||||||||||||||||||| | | | | | |

CCGAGCTTCTCATTGCTCTTCCTAACAGTG=TGATAGGCTAACCGTAATGGCGTTC||||||||||||||||||||||||| ||||||||||||||||||||||||

query

1st database sequence

This would actually be a very slow search process if implemented like thishellip

BLAST achieves its speed through two strategies

- it takes a WORD based approach- it pre-INDEXES database sequences

BLAST WORDS and INDEXING1 GACAAATCCAAACCCCTGAAGTTCTCCACCAGCAAAGCCA

2 TAAGCAAATTTAATTTTGTTTACATTTTC

3 GTTAAGACCTTCCCTGACATTTGCAGCAGTTTCAAATGTA

AAAAAAAA 00001 AAAAAAAC 00002 AAAAAAAG 00003 ACAAATCC 07967 ACAAATCC 07968 ACAAATCC 07979 GACAAATC 33568 GACAAATG 33569 TCCAAACC 64321 TCCAAACC 64322

sequence position word

1 1 33658

1 2 07967

1 3 16210

3 15 33568

3 16 07967

Database of sequences

Numbered list of all possible lsquowordsrsquo

Build a position index of all words in the database

Analyse the Query Sequence gtquery AGACAAATCCAAACCCCTGAAGTTCTCCACCAGCAAAGCCA

AAAAAAAA 00001 AAAAAAAC 00002 AAAAAAAG 00003 ACAAATCC 07967 ACAAATCC 07968 ACAAATCC 07979 GACAAATC 33568 GACAAATG 33569 TCCAAACC 64321 TCCAAACC 64322

QUERY SEQUENCE

Numbered list of all possible lsquowordsrsquo

position word

1 14236

2 33658

3 07967

Analyse QUERY SEQUENCE

sequence position word

1 1 33658

1 2 07967

1 3 16210

3 15 33568

3 16 07967

Index of database

Expand from Word Based Matches We lsquoinstantlyrsquo know which sequences in the database have at least a word length match with our query sequence and at what relative position

Next the potential alignments are expanded adding up a score for (total matches ndash mismatches ndash gap penalties) to make the best possible alignment But this is usually for a tiny proportion of the sequences in the database ndash so overall it is much quicker

The highest scoring alignments are reported

But we can potentially miss alignments with no word-size bits in common consider BLASTn with a default word-size of 11

TCGGAAGTGGAAGCTGAACCTGATTGTAGAGTTGGAGGCCAGTGTTCTGGCTGAGC||||||||| ||||| |||||||||| |||||||||| |||| ||||| ||||||| TCGGAAGTGTAAGCTCAACCTGATTGCAGAGTTGGAGTCCAGAGTTCTAGCTGAGC

Care is sometimes neededhellip

BLAST ndashTypical OutputINPUT

gtpartial cDNA sequence Xenopus tropicalisCGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGTTCCCACCTCTCCTCTTTCACCATGAAGCTCAAGGACAAATTCCACTCCCCCAAAATCAAGCGCACCCCGTCCAAGAAGGGGAAGCCGGCCGACCTCACCGTCAAAACAGAAGAGAAACCCGTCAACAAAACCTTAAGCCGCTTGGAGGAACAGGAGAAAGAAGTCGTTAATGCCTTGCGTTACTTTAAGACAATTGTTGACAAGATGGCGGTGGACAAGATGGTGCTGGTGATGCTGCCAGGGTCGGCGA

OUTPUTQuery= (311 letters) Database NCBI Protein Reference Sequences 954378 sequences 347895532 total letters

gtgi|41055060|ref|NP_9574201| similar to guanine nucleotide-releasing factor 2 (specific for crk proto-oncogene) [Danio rerio]

Length=691

Score = 133 bits (335)Expect = 6e-31 Identities = 7698 (77) Positives = 8298 (83) Gaps = 498 (4) Frame = +2

Query 26 MSGKIE-KADSQRSHLSSFTMKLKDKFHSPKIKRTPSKKGKPA--DLTVKTEEKPVNKTL 196 MSGKIE K +SQ+SHLSSFTMKL KFHSPKIKRTPSKKGK + VKT EKPVNK + Sbjct 1 MSGKIESKHESQKSHLSSFTMKLM-KFHSPKIKRTPSKKGKQLQPEPAVKTPEKPVNKKV 59

Query 197 SRLEEQEKEVVNALRYFKTIVDKMAVDKMVLVMLPGSA 310 SRLEEQEK+VV+ALRYFKTIVDKM VD VL MLPGSA Sbjct 60 SRLEEQEKDVVSALRYFKTIVDKMNVDTKVLQMLPGSA 97

When is a match significant

RFKISDCQHPCTYSHNQYMTNHMRECPYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV

NFSWKKTSEKETNCQFDYPNDYNEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFNMCWLEVNSS

RF---KISDCQHPCTYSH-NQYMTNHMREC----PYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV F K S+ + C + + N Y N +C P+ + +W +P + D I N M ++ NFSWKKTSEKETNCQFDYPNDY--NEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFN------MCWLEVNSS

Here is a lsquotypicalrsquo weak alignment from BLASTp

In fact the sequences were randomly generated so there is no biologically significant alignmenthellip

E-values

The number of matches like the discovered match that I would expect to find by chance

An E-value of 00 implies that I would expect no matches like this to arise by chance thereforehellip

An E-value of 1 implies I would expect 1 match like this to arise by chance so if I have a match with such an E-valuehellip

Also ldquoexpect valueldquo or ldquoexpectationrdquo

E-values From First Principles

Some database statistics (23rd July 2005)

Database NCBI RefSeq mRNA 272619 sequences 503566580 total letters (~50 x 108)

Database NCBI nr 3329110 sequences 14601814750 total letters (~14 x 1010)

Notation

12e-35 = 12 x 10-35

48 x 106 = 4800000

We will consider first searching a nucleotide sequence (lsquoACGTAGACGTrsquo) against a nucleotide database eg the RefSeq mRNA above

Then we will consider the more complex case of amino acid sequence (protein) searches Which is of course what we mostly do

Calculating an E-valueThe RefSeq mRNA database has ~ 50 x 108 letters There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA

Query = lsquoArsquo

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACrsquo

AC AC AC AC

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACGrsquo

ACG

Expected number of matches = (50 x 108) 4 = ~12 x 108

Expected number of matches = (50 x 108) (4x 4) = ~31 x 107

Expected number of matches = (50 x 108) (4 x 4 x 4) = ~81 x 106

Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer

Expected number of matches = (50 x 108) (4 x 4 x 4 x 4 hellip 60 times ) = (50 x 108) 1036 = 50 x 10-28

E-value = 50 x 10-28

E-values In PracticeSo if I take a 60 nt sequencegtsequenceACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA and actually BLAST it against the RefSeq mRNA database I get

BLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds

Length=6060 Score = 119 bits (60) Expect = 2e-26 Identities = 6060 (100) Gaps = 060 (0) Strand=PlusPlus

Query 1 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 60 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 2977 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 3036

What do I get if I BLAST it against the larger nr databaseBLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds

Length=6060 Score = 119 bits (60) Expect = 6e-25 Identities = 6060 (100) Gaps = 060 (0) Strand=PlusPlus

Query 1 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 60 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 2977 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 3036

theoretical value was 50e-28 -

E-value Exercise

Given a transcription factor binding site

ACC[TG]TA

How many would you expect to find by chance in a 10k promoter sequence

How would this differ if there was an optional additional base between the 4th and 5th positionsIeACC[TG]TAOR ACC[TG]TA

E-value Exercise AnswerACC[TG]TA

Expect lsquoArsquo every 4 ntExpect lsquoACCrsquo every 4x4x4 = 64 nt

Expect lsquoT or Grsquo every 2nd ntExpect lsquoACC[TG]rsquo every 64x2 nt = 128 nt

Expect lsquoTArsquo every 4x4 = 16 ntExpect lsquoACC[TG]TArsquo every 128x16 nt = 2048 nt (4x4x4x2x4x4)We would expect ~5 of these promoter sites every 10k by chance

If also ACC[TG]TAA allowed

The two motifs independently have the same E-valueTo allow either means we expect twice as many

E-values Effect of Database SizeThe nr mRNA database has ~ 14 x 1010 letters (was RefSeq and 50 x108)There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA

Query = lsquoArsquo

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACrsquo

AC AC AC AC

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACGrsquo

ACG

Expected number of matches = (14 x 1010) 4 = ~12 x 108

Expected number of matches = (14 x 1010) (4x 4) = ~31 x 107

Expected number of matches = (14 x 1010) (4 x 4 x 4) = ~81 x 106

Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer

Expected number of matches = (14 x 1010) (4 x 4 x 4 x 4 hellip 60 times ) = (14 x 1010) 1036 = 14 x 10-26

E-value = 14 x 10-26

(was E-value = 50 x 10-28)

E-values Effect of Database Size

The E-value is simply dependent on database size

RefSeq

nr

14 x 1010 letters

50 x108 letters

30 x bigger

BLAST the same sequenceagainst each E-value = 14e-26

E-value = 50e-28

The database was ~30 times bigger and so the E-value was ~30 times bigger

Why were the values differentOur calculated E-value for searching against the RefSeq mRNA database was 50 x 10-28But our actual BLAST search at NCBI gave a value of

20 x 10-26 - about 40x larger - why is this

Gapped alignments

If we were expecting N matches for a query sequence lsquoACGTACGTACGTrsquo imagine what would happen to N if we allowed gaps in our matches

ACGTACGTACGT

This would now give us additional possible alignments that would meet our lsquomatchrsquo criteria

ACGTACGTACGT ACGTACAGTACGT ACGTACCGTACGT etc|||||||||||| |||||| |||||| |||||| ||||||ACGTACGTACGT ACGTAC-GTACGT ACGTAC-GTACGT

We will expect many more matches in a given database if we allow our alignments to have gaps The E-value will be larger

E-values Effect of Query Length

Biologically itrsquos the same match Does it mean we are any less sure that this match didnrsquot occur by chance The E-value is simply dependent on match length

database

BLAST 500 nt sequence against a database

BLASTn Get a full length match with sequence XYZ at an E-value = 50e-160

gtsequence ACTAGTCTAGCTAGACATCGATCGATGATGCTACACAGATAGACGATAGATAGTAAGTCGATCGATCGCGCATCGATCGTCTAGATCGATCGCTCGCTGTGTAGATAGATCGGCGATAGA

database

BLAST half of the same sequence against the same database

BLASTn

gtsequence ACTAGTCTAGCTAGACATCGATCGATGATGCTACACAGATAGACGATAGATAGTAAGTCG

Get a match with sequence XYZ again but at an E-value = 50e-80

Why not just use identityAt some levels this a good question

But consider two very different searches both of which give a 75 identity match

Query1 was 60 nt longCGGAGCTCAGGGCTTAACGACTGATATCTCCGCGCATGTCGAGAAACGATACAGCCAGCG||||||||||| || | || | || || |||| | | | |||||| | ||||||||||CGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGWhich would have an E-value ~ 50 x 10-19

And Query2 only 16 nt longACGTACGTACGTACGT||| || | |||| ||ACGCACCTTCGTAGGTWhich would have an E-value ~ 30

And intuitively we feel we would expect to see that sort of number of matches in the database just by chancehellip

So whatrsquos the real problemBasically you are usually trying to answer the question

Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism

Are there any useful guidelines though at least for biological meaningfulness

Basically you are usually trying to answer the question

Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism

BLAST

The difficulty is because

ORTHOLOGY

BLAST Similarity + Probability

biological knowledge

nature of query sequence

phylogenetic relationship

match length PI size of databasehellip

Rules of ThumbHow good does an E-value have to be before we might even think we have an ortholog

largerworse smallerbetter

E-values 10-5 10-10 10-40 10-100 00

fantasy borderline encouraging

pretty good canrsquot get

better

But note that in some gene families with closely related members you can get an E-value of 00 for several different matches and then identity may be more sensitive Also bear in mind in cases like this that ideas of lsquofunctionalrsquo orthology may break down with more than one locus producing identical proteins which share the same functionhellip

Protein BLASTItrsquos (nearly) always better to make comparisons at the amino acid level between protein sequences than the DNA level because the amino acid sequence is more conserved than the underlying DNA sequenceDoes this cause us to treat expected values any differently

If we follow the argument as before then for an exact match of a 20 amino acid sequence in the RefSeq protein database each additional amino acid will reduce the E-value by 120th (there are 20 different amino acids) And as there are 347895532 letters in that databaseE-value = ~35 x 108 (20 x 20 x 20 hellip20 times) = ~35 x 10-18

But this is what we get if we run the blast at NCBI

Score = 431 bits (100) Expect = 8e-04 Identities = 2020 (100) Positives = 2020 (100) Gaps = 020 (0) Frame = +3

Query 3 SSSSFRAYRAALSEVEPPCI 62 SSSSFRAYRAALSEVEPPCISbjct 972 SSSSFRAYRAALSEVEPPCI 991

Really too big a discrepancy to easily explain with hand wavinghellip

Amino Acid Substitutions

A SC F LWYG I LMVL IMFVM ILVP V ILMW FY

N DHSQ REHKS ANTT SY HFW

H NQYK RQER QK

D NEE DQK

In fact we need to take into account both amino acid substitutability as well as as before allowing gapped alignments On average any residue can be substituted for by about 2 others so each position has about 17th chance of lsquomatchingrsquo rather than 120th

So now we getE-value = ~35 x 108 (7 x 7 x 7 hellip20 times) = ~44 x 10-9which is much closer to the actual BLAST value

These substitutabilities are dealt with by the BLOSUM and PAM matrices

ExercisesGo to the file random-DNA-sequenceshtml select one of the 20 randomly generated nucleotide sequences and do a BLASTx (translated DNA-gtprotein) at NCBI against the nr protein database

Did you find any lsquosignificantrsquo hits

Repeat with a second sequence

What conclusions might you draw from this exercise

Try the same sequence(s) against the nr nucleotide database

Is there any general difference

Part 4 Tweaking BLASTAlthough you normally see BLAST as a web page with boxes to place data in and tick boxes etc it is actually a command line program that can be run just by typing the appropriate command and options eg

promptgt blastall ndashp ltblast typegt ndashi ltinput sequencegt ndashd ltdatabasegt

This is the simplest form where the basic program lsquoblastallrsquo takes a number of different options or parameters indicated by the ndashx and followed by its value -p ltwhich blast flavour to rungt-i ltfile with query sequence ingt-d ltpre-indexed database namegt

Not All Parameters are hellipThere are many other parameters and if not listed explicitly they will use a default value most appropriate to the blast flavour requested Eg for ndashW ltword sizegt blastn uses ndashW 11 where blastx uses ndashW 3

There are also some options that appear on the web pages that are not really parameters but manage the job in a similar way One of the most useful of these is on the NCBI blast pages where you can use Entrez queries or pick from an organism list to modify your search

The Many Parameters of BLASTThere are almost literally hundreds of parameters but most are way too obscure even for die-hard techies like me Very few of them are regularly useful in any but their default value but just occasionally they are very necessary

Here are some of the ones that I have used

-e max expected value -m output format (graphical or tabularspreadsheet)-F filter query sequence for low complexity (default TRUE)-U use only upper case regions of query (default FALSE)-G gap opening cost-E gap extension cost-q nucleotide mismatch penalty (BLASTx uses matrices)-r nucleotide match reward-b number of matching sequences to report-g allow gaps (default TRUE)-W word size-z effective database size (removes effect of actual database size)-S query strands to search (default both directions)-l restrict database sequences to given list of lsquogilsquo numbers

BLAST Parameters Exercises1 BLASTn vs BLASTx

Open the file example-sequenceshtml copy the sequence gtblastn-vs-blastxThis is a Xenopus tropicalis cDNA sequence

Go to the NCBI BLAST Home PageNucleotide-nucleotide BLAST (blastn) section Paste your sequence into the box

Run BLASTn against the nr nucleotide database using all default optionsThen hit [format] to wait for the results in a new page(hint if you paste the sequence definition line lsquogtnamersquo into the box as well your results will be labelled accordingly which can be useful)

Now repeat but go to the TRANSLATED BLAST section and BLAST against the nr protein database using BLASTx

How might the different results help us view the presence of this gene in other vertebrates

Results for Exercise 1 BLASTn

BLASTx

BLAST Parameters Exercises2 Low complexity filtering

Open the file example-sequenceshtml copy the sequence gtlow-complexity-filtering-A This sequence contains a long AT tandem repeat

Go to the NCBI BLAST Home PageTRANSLATED BLAST sectionBLASTx Paste your sequence into the box

Carefully UNTICK the ldquoChoose filter [ ] Low complexityrdquo BOX in the second section And then run BLASTx against the nr database

What do you feel about these alignmentsRe-run but leave the low-complexity filter ON this timeDoes this change our view of the protein matches

Now continue with gtlow-complexity-filtering-B and ndashCC is an especially interesting case ndash what can we deduce about the cDNA sequence Annotators beware

Results for Exercise 2A (OFF) BLASTn ndash low complexity filtering OFF

Results for Exercise 2A (ON) BLASTn ndash low complexity filtering ON

Results for Exercise 2B

ON OFF

Results for Exercise 2C

There is a sequence error an extra G at position 117 in the sequence cDNA (117)AGAAAAGAAGAAACATGGCAATGGATCAGAA|||||||||||||||| ||||||||||||||AGAAAAGAAGAAACAT-GCAATGGATCAGAA

Genomic sequence

ON

OFF

BLAST Parameters Exercises3 Limit by Entrez query

Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching sequences in fruit fly enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT

Open the file example-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home Page TRANSLATED BLAST sectionBLASTx and paste the sequence

Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip

BLAST Parameters Exercises4 BLASTn vs tBLASTx and nucleotide mismatch penalties

Open the file example-sequenceshtml

Also open the NCBI BLAST Home PageSPECIAL ndash Align two sequences section

There are several Xenopus tropicalis cyclins in the examples fileCopy the sequence gtcyclin-A1-Xt to the Sequence 1 BLAST windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 BLAST window(i) Run the default comparison should be BLASTn Note the alignmentNow run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these

(ii) Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1 Then try reducing the gap open and gap extension penaltieshellipWhat do we learn from this

(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip

Results for Exercise 4 (i)

BLASTn tBLASTx

Results for Exercise 4 (ii)

Mismatch penalty = -2 (default) Mismatch penalty = -1

BLAST Parameters Exercises5 E-Value maximum for reporting

Open the file example-sequenceshtml

Copy the sequence gtsumo-binding-motif and go to the NCBI BLAST Home PageGo to the PROTEIN BLAST section BLASTp and paste the sequence

Run the search with the default values

Now re-run the search setting the maximum E-value in the box

Expect 100

What difference does this make

BLAST Parameters Exercises6 Word Size

Open the file example-sequenceshtml

Copy the sequence gtmorpholino and go to the NCBI BLAST Home PageGo to the NUCLEOTIDE BLAST section BLASTn and paste the sequence

Check OFF the low complexity filter and then run the search

Now re-run the search setting the following parameters

Low complexity OFFExpect 100Word Size 7Other advanced -q-1 (mismatch penalty -1 instead of default -3)

What difference does this make

END

  • Bioinformatics Workshop 1 Sequences and Similarity Searches
  • The Basic Questions
  • Part 1 Structural Genomics
  • Chromosomes and Genes
  • Gene to Protein
  • Sequence Signals
  • Genomic Signals
  • Derivative Sequences
  • Gene Models
  • Sequences and Genes (Accession Numbers and Names)
  • Gene Symbols Names Etc
  • A Gene-Centric View
  • Sequences and Accession Numbers
  • mRNA Splicing Signals
  • Gene Predictions
  • Supporting Evidence
  • TheoreticalPredicted Sequences
  • Sequences for a model organism
  • So Whatrsquos in the Databases Now
  • Part 2 Comparative Genomics
  • Speciation
  • Residual Similarity
  • Computers Can Detect Homology
  • Orthologs
  • Paralogs
  • lsquoOtherrsquo-logs
  • The Essential Paradigm
  • Function Conserved Longer than Detectable Similarity
  • Redundancy in the Genetic Code
  • Protein Similarity Persists Longer
  • Always Compare Protein Sequences
  • Exercise 1 nucleotide vs amino acid search
  • Answers Exercise 1
  • The Essential Task
  • Functional Orthologs
  • Finding Orthologs
  • Using Synteny is Better
  • Metazome
  • Metazome Exercise
  • Part 3 Finding Sequence Similarities
  • Gaps in Alignments
  • The Downside of Gaps
  • BLAST
  • Flavours of BLAST
  • How does it work
  • BLAST WORDS and INDEXING
  • Analyse the Query Sequence
  • Expand from Word Based Matches
  • BLAST ndashTypical Output
  • When is a match significant
  • E-values
  • E-values From First Principles
  • Calculating an E-value
  • E-values In Practice
  • E-value Exercise
  • E-value Exercise Answer
  • E-values Effect of Database Size
  • Slide 58
  • Why were the values different
  • E-values Effect of Query Length
  • Why not just use identity
  • So whatrsquos the real problem
  • Rules of Thumb
  • Protein BLAST
  • Amino Acid Substitutions
  • Exercises
  • Part 4 Tweaking BLAST
  • Not All Parameters are hellip
  • The Many Parameters of BLAST
  • Slide 70
  • BLAST Parameters Exercises
  • Results for Exercise 1
  • Slide 73
  • Results for Exercise 2A (OFF)
  • Results for Exercise 2A (ON)
  • Results for Exercise 2B
  • Results for Exercise 2C
  • Slide 78
  • Slide 79
  • Results for Exercise 4 (i)
  • Results for Exercise 4 (ii)
  • Slide 82
  • Slide 83
  • END
Page 25: Bioinformatics Workshop 1 Sequences and Similarity Searches

Paralogs

A

Gene duplication though internal genome duplication

The two copies of Gene A will now evolve independently but will probably not continue to have exactly the same function

They are PARALOGS

A

A Arsquo

A

lsquoOtherrsquo-logsWhat about gene duplication after speciation

How can we describe the relationship(s) between the various copies of gene A in the two frogs

Bear in mind that understanding gene function is more important than semanticshellip

The two copies of A in the orange frog are sometimes called IN-PARALOGS

If they were also present in the green frog (and therefore were in the ancestor species) they would be OUT-PARALOGS

A

A

A

Arsquo A

The Essential Paradigm

1 any group of modern species can be traced back to some extinct common ancestor

A

A

2 in all likelihood they share orthologous genes which have the same function in the modern animal as in the extinct ancestor

3 If we can experimentally determine the function of a gene in one of these organisms then there is a good chance the ORTHOLOGOUS gene in another organism will have the same function

A A

cyclin b1

cyclin b1

Function Conserved Longer than Detectable Similarity

start from first self-replicating sequence

same function detectable similarity

living organisms

whole genome duplication local duplication

Redundancy in the Genetic Code

GCA A alanine GCC A GCG A GCT A

TGC C cystine TGT C

GAC D aspartate GAT D

GGA G glycine GGC G GGG G GGT G

lsquoSynonymousrsquo or lsquosilentrsquo mutations in the third position of the codon triplets have no effect on the amino acid coded for ndash so there is no evolutionary pressure against thishellip

Protein Similarity Persists Longer

CTATCACGAGAACCTGTGCTATCCCGAGAACCTGTGCTATCCCGAGAACCAGTGCTATCCCGTGAACCAGTGCTATCCCGTGAGCCAGTGCTATCCCGTGAGCCAGTTCTGTCCCGTGAGCCAGTT

CTATCACGAGAACCTGTG

CTGTCCCGTGAGCCAGTT|| || || || || ||

LSREPV

LSREPV||||||

CTATCACGAGAACCTGTG

TTGTCCCGGTCGCCAGTT | || | || ||

LSREPV

LSRFPV||| ||

67 100

44 80

Always Compare Protein Sequences

ATGAATGCAGCCTATGATTGCCGAGCCAGAATGCTAAGG MNAAYDCRARMLR ||||| || || || || || || || ||||| || || | ||||||||+||ATGAAGGCCGCATACGACTGTCGTGCTAGAATCCTGAGA MKAAYDCRARILR

DNA comparison amino acid comparison

The DNA sequence can change while the amino acid sequence stays the same so always look for similarities by comparing amino acid sequences

Exercise 1nucleotide vs amino acid search

Go to the file example-sequenceshtml and locate the section for this exercise There should be two sequences lsquosurfeit1rsquo for frog and fly

Go to NCBI Blast home page then lsquoAlign two sequencesrsquo (bottom left lsquospecialrsquo panel) paste one sequence into each window and hit lsquoAlignrsquo ndash this will do a direct DNADNA comparison

Now find the open reading frames of the two genes and translate them into amino acid protein sequences then repeat the two sequences comparison

Go to NCBI ORF Finder ndash paste sequence ndash hit OrfFind ndash identify longest ORF ndash click on it ndash next screen hit Accept ndash change View to Fasta protein ndash hit View ndash copy sequence to Blast2Seqs Do the same with the other sequence

Before you hit lsquoAlignrsquo change the lsquoProgramrsquo (top left) to blastphellip

Answers Exercise 1

The Essential Taskexperiment data mining

gene sequence what is its function

database of proteins in other species

Cyclin-AFoxA1

cdc25

alpha-tubulin

Predicted protein

Gravin-like

Sprouty-2

calmodulin

KIAA10786568

frizzled

Wint8

Troponin T3

Gravin-like

we can only do this because of implied function based on orthology

Functional Orthologs

function known annotation lsquoGravinrsquo available

Human geneXenopus genefunction unknown

sequence similarityorthologs

same function But we know that function is largely determined by shape

similar shape

Which in general we cannot determine ndash but it is probably SHAPE not SEQUENCE that is conserved

We make an assumption that the same gene function is likely to be present in the two organisms and the ones that have this function are likely to be the most similar in sequence

Finding OrthologsSo how do we find orthologs and can we know when we have

The simplest is Reciprocal Best BLAST but it implicitly relies on having all the protein sequences of you own organism and the one you wish to find an ortholog in

frog proteindatabase of human proteins

best match human protein

database of frog proteins

x

Using Synteny is Better

We know that large regions of (say) vertebrate genomes have preserved their overall organisation from one organism to another

And we find the same genes (ie orthologs) in more or less the same order in the syntenic sectionsThese of course represent chromosomal re-arrangements since these organisms diverged

Human chromosome 5

Mouse chromosome 10

Mouse chromosome 2

MetazomeFortunately someone has done all the hard work for ushellip Dan Rokhsar httpwwwmetazomenet

Metazome Exercise

Go back to Entrez Gene and look for your favourite gene again

Pick probable ortholog vertebrate genes from common organisms (human mouse rat chicken frog fish) and paste their protein sequences into a temporary space

Go to Metazome (httpwwwmetazomenet) find the blast window open two versions of it and blast your sequences against the Tetrapod or Jawed vertebrate node

See if you get the same cluster ID as best top hit and have a look at the Metazome alignment(s)hellip

Part 3 Finding Sequence Similarities

We want computer programs which will compare sequences at all possible different alignments looking for a degree of similarity greater than we would expect to find by chance

But first we have to consider the implication of gapshellip

Insertions and deletions are other possible forms of mutations and they can really mess up our simple alignments

ATGCATGCTGCCAACGGATGTCCTG

ATGAAAGCCGCCTACGAAAGTCCTG||| | || ||| ||| | ||||||

ATGCATGCTGGCCAACGGATGTCCTG

ATGAAAGCCGCCTACGAAAGTCCTG||| | || ||| | | | |

Gaps in Alignments

Consider these two obviously similar sequences

TTCCCAACTCTCCTCTTTCACCATGAAGCTCAAGGACAGATTCCACTCGCCCCAAAATCAAGCTCACCCCGTCCAAGAA | || | || |||||||||||||||||||| ||||||||| ||| ||| | ||| | | |TTCCCACCTCTCCTCTTTGCACCATGAAGCTCAAGGACAAATTCCACTCCCCCAAAATCAAGCGCACCCCGTCCCAGAA

In fact we realise that the most probable alignment (regarding biological origin) is with a small gap in each sequence

TTCCCAACTCTCCTCTTT=CACCATGAAGCTCAAGGACAGATTCCACTCGCCCCAAAATCAAGCTCACCCCGTCCAAGAA |||||| ||||||||||| |||||||||||||||||||| ||||||||| |||||||||||||| |||||||||| ||||TTCCCACCTCTCCTCTTTGCACCATGAAGCTCAAGGACAAATTCCACTC=CCCCAAAATCAAGCGCACCCCGTCCCAGAA

So in general we allow ourselves to insert gaps until we find the optimal alignment

But where should this process stop

The Downside of GapsTake two random sequences with no lsquorealrsquo similarity

GACACTAGGTCGATGCGTGGTGGCGAGA

ACGCATCCGGATGTGCACCGTGGAACTG

And allow lsquocost freersquo gaps

GAC--ACT----AGGTCGATGC---GTGG---TGGCGAGA || | | | | | ||| |||| || ACGCA-TCCGGA--T-G-TGCACCGTGGAACTG

Clearly although the alignment has no mismatches it is obviously not biologically meaningful

To prevent this we assign a cost to adding gaps which is offset against the benefit of finding matches ndash and this is the essence of lsquofinding gapped alignmentsrsquo

We want to find the lsquoalignmentrsquo between the two (or more) sequences which shows the greatest degree of similarity while introducing the fewest gaps hellip

BLAST

gtqueryAGACGAACCTAGCACAAGCGCGTCTGGAAAGACCCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGTTCAGGAGTATTTGGACTGCAATATTGGCCCTCGTTCAAGGGCGCCTACCATCACCCGACGGTCATGCCGGTCCCCAGCAGCTGCTAATAACTTCCTTCGCTACTCAAGTTACCACGCTAGCAAAACCCACGGCATACCGTTTACCCTTTAAAATCAGCTTCAACCAGCAACGAA

There are many programs used to find similarities between sequencesThey range from relatively slow programs which find the exact best matching alignment through ones which take progressively inexact shortcuts to speed things up Of this latter class the best known and easily most widely used is BLAST developed by Stephen Altschul and others and continuously refined over the last 10-15 years

The essential idea is to compare your query sequence against a collection or lsquodatabasersquo of target sequences looking for the one(s) that match the query sequence the best

gttarget1AAAACAGGAATATTTACCGGGACCGGGTAATGATGCATCTCGAGGTACACAATATACCTG GAGAACCGAATTATGAGTTGGCCACCTTACTTAACGAAACCAGCAGAGAAAATCCAACAT GGCAACACCCCTCTGACTACACTAGAAGGAACTACTATGTAAGAAAACAGCCTGTCCCTT GCAGTTTGAATGACTGGGTGATGCGAAATGGGGGTCCTGCCATAGAGCGCTTCCATGGTT TACCTTGCACATTTCAGAGAAGTCCTATGCCAGGAGTCCTTCCTACAGGGCCTTCCTGAA ACTATATATGTGCTTATTCTTGTTTGATTTGGCTTTGCAGgttarget2CTCTTAATTTATTTCTCTTCCTGCAGCTCCCTCGCTTTTTCCTTTCCCTGTTACATTCAT CTGACTTGAAGAGTTGCAAATTTTCAGTGTTTCTGTTTTTGTTGCTGATATGTTGTAAAC TTTTTAATAAAATCTATTTCTATAG gttarget3GCAGTTTGAATGACTGGGTGATGCGAAATGGGGGTCCTGCCATAGAGCGCTTCCATGGTT TACCTTGCACATTTCAGAGAAGTCCTATGCCAGGAGTCCTTCCTACAGGGCCTTCCTGAA ACTATATATGTGCTTATTCTTGTTTGATTTGGCTTTGCAGCTAGGGTTTTCACCTTTTCT GGAAAAAAAAATACTGGCTTCC gttarget4CTGCTATTAATGGGCAAAACAACTCAAATAAAGTCCCTCTGCCACCCTCAGACACTGCCC CTGGCCCCCAGCTGCCCGCTGATCCTTGTAGCCAGAGCAGTAAAGTTTTGAAAGTGGAGC CCAAGGAGAATAAAGTTATTAAAGAAACTGGCTTTGAACAAGGTGAAAAGTCTTGTGCAG CACCTCTAGATCATACTGTGAAGGAAAATCTTGGACAAACTTCTAAAGAACAGGTGGTAG

query

database

COMPARE

LIST MATCHES

Flavours of BLAST

ACGATAGATCCCATCCATAAAT ATGACGATAGATCCCATCAT CGATAGGACCACCACA GATAGACCAGGATACATAGGATAATTA AGCTCGCTTGGCTCGATGGCT

MKJLSPWERSYTRGHYTWER MGHTVNBZY MKLPWRHGDBKJGMNDFD MBKLRPIUHDFRTASGSLKWWRTVBN

MKJLSPWERSYTRGHYTWER MGHTVNBZY MKLPWRHGDBKJGMNDFD MBKLRPIUHDFRTASGSLKWWRTVBN

ATGACGATAGATCCCATCAT CGATAGGACCACCACA GATAGACCAGGATACATAGGATAATTA AGCTCGCTTGGCTCGATGGCT

query sequence other operation database sequences

ATGACGATAGATCCCATCAT CGATAGGACCACCACA GATAGACCAGGATACATAGGATAATTA AGCTCGCTTGGCTCGATGGCT

BLASTn

BLASTp

BLASTx

tBLASTn

tBLASTx

ACGATAGATCCCATCCATAAAT

ACGATAGATCCCATCCATAAAT

MQWCGYRWTYQGYRW

MQWCGYRWTYQGYRW

FAST

FAST

SLOW

SLOWER

HORRIBLY

SLOW

6 fra

me

trans

latio

n

How does it work The main task of any sequence comparison program is to test all possible mutual alignments of two sequence and see how good the match is

CTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGT

CCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGTTC | | | | | ||||||||||||||||||||||||| CTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGTCTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGT || | | | | | | | | |||||||||||||||||||||||| | | | | | |

CCGAGCTTCTCATTGCTCTTCCTAACAGTG=TGATAGGCTAACCGTAATGGCGTTC||||||||||||||||||||||||| ||||||||||||||||||||||||

query

1st database sequence

This would actually be a very slow search process if implemented like thishellip

BLAST achieves its speed through two strategies

- it takes a WORD based approach- it pre-INDEXES database sequences

BLAST WORDS and INDEXING1 GACAAATCCAAACCCCTGAAGTTCTCCACCAGCAAAGCCA

2 TAAGCAAATTTAATTTTGTTTACATTTTC

3 GTTAAGACCTTCCCTGACATTTGCAGCAGTTTCAAATGTA

AAAAAAAA 00001 AAAAAAAC 00002 AAAAAAAG 00003 ACAAATCC 07967 ACAAATCC 07968 ACAAATCC 07979 GACAAATC 33568 GACAAATG 33569 TCCAAACC 64321 TCCAAACC 64322

sequence position word

1 1 33658

1 2 07967

1 3 16210

3 15 33568

3 16 07967

Database of sequences

Numbered list of all possible lsquowordsrsquo

Build a position index of all words in the database

Analyse the Query Sequence gtquery AGACAAATCCAAACCCCTGAAGTTCTCCACCAGCAAAGCCA

AAAAAAAA 00001 AAAAAAAC 00002 AAAAAAAG 00003 ACAAATCC 07967 ACAAATCC 07968 ACAAATCC 07979 GACAAATC 33568 GACAAATG 33569 TCCAAACC 64321 TCCAAACC 64322

QUERY SEQUENCE

Numbered list of all possible lsquowordsrsquo

position word

1 14236

2 33658

3 07967

Analyse QUERY SEQUENCE

sequence position word

1 1 33658

1 2 07967

1 3 16210

3 15 33568

3 16 07967

Index of database

Expand from Word Based Matches We lsquoinstantlyrsquo know which sequences in the database have at least a word length match with our query sequence and at what relative position

Next the potential alignments are expanded adding up a score for (total matches ndash mismatches ndash gap penalties) to make the best possible alignment But this is usually for a tiny proportion of the sequences in the database ndash so overall it is much quicker

The highest scoring alignments are reported

But we can potentially miss alignments with no word-size bits in common consider BLASTn with a default word-size of 11

TCGGAAGTGGAAGCTGAACCTGATTGTAGAGTTGGAGGCCAGTGTTCTGGCTGAGC||||||||| ||||| |||||||||| |||||||||| |||| ||||| ||||||| TCGGAAGTGTAAGCTCAACCTGATTGCAGAGTTGGAGTCCAGAGTTCTAGCTGAGC

Care is sometimes neededhellip

BLAST ndashTypical OutputINPUT

gtpartial cDNA sequence Xenopus tropicalisCGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGTTCCCACCTCTCCTCTTTCACCATGAAGCTCAAGGACAAATTCCACTCCCCCAAAATCAAGCGCACCCCGTCCAAGAAGGGGAAGCCGGCCGACCTCACCGTCAAAACAGAAGAGAAACCCGTCAACAAAACCTTAAGCCGCTTGGAGGAACAGGAGAAAGAAGTCGTTAATGCCTTGCGTTACTTTAAGACAATTGTTGACAAGATGGCGGTGGACAAGATGGTGCTGGTGATGCTGCCAGGGTCGGCGA

OUTPUTQuery= (311 letters) Database NCBI Protein Reference Sequences 954378 sequences 347895532 total letters

gtgi|41055060|ref|NP_9574201| similar to guanine nucleotide-releasing factor 2 (specific for crk proto-oncogene) [Danio rerio]

Length=691

Score = 133 bits (335)Expect = 6e-31 Identities = 7698 (77) Positives = 8298 (83) Gaps = 498 (4) Frame = +2

Query 26 MSGKIE-KADSQRSHLSSFTMKLKDKFHSPKIKRTPSKKGKPA--DLTVKTEEKPVNKTL 196 MSGKIE K +SQ+SHLSSFTMKL KFHSPKIKRTPSKKGK + VKT EKPVNK + Sbjct 1 MSGKIESKHESQKSHLSSFTMKLM-KFHSPKIKRTPSKKGKQLQPEPAVKTPEKPVNKKV 59

Query 197 SRLEEQEKEVVNALRYFKTIVDKMAVDKMVLVMLPGSA 310 SRLEEQEK+VV+ALRYFKTIVDKM VD VL MLPGSA Sbjct 60 SRLEEQEKDVVSALRYFKTIVDKMNVDTKVLQMLPGSA 97

When is a match significant

RFKISDCQHPCTYSHNQYMTNHMRECPYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV

NFSWKKTSEKETNCQFDYPNDYNEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFNMCWLEVNSS

RF---KISDCQHPCTYSH-NQYMTNHMREC----PYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV F K S+ + C + + N Y N +C P+ + +W +P + D I N M ++ NFSWKKTSEKETNCQFDYPNDY--NEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFN------MCWLEVNSS

Here is a lsquotypicalrsquo weak alignment from BLASTp

In fact the sequences were randomly generated so there is no biologically significant alignmenthellip

E-values

The number of matches like the discovered match that I would expect to find by chance

An E-value of 00 implies that I would expect no matches like this to arise by chance thereforehellip

An E-value of 1 implies I would expect 1 match like this to arise by chance so if I have a match with such an E-valuehellip

Also ldquoexpect valueldquo or ldquoexpectationrdquo

E-values From First Principles

Some database statistics (23rd July 2005)

Database NCBI RefSeq mRNA 272619 sequences 503566580 total letters (~50 x 108)

Database NCBI nr 3329110 sequences 14601814750 total letters (~14 x 1010)

Notation

12e-35 = 12 x 10-35

48 x 106 = 4800000

We will consider first searching a nucleotide sequence (lsquoACGTAGACGTrsquo) against a nucleotide database eg the RefSeq mRNA above

Then we will consider the more complex case of amino acid sequence (protein) searches Which is of course what we mostly do

Calculating an E-valueThe RefSeq mRNA database has ~ 50 x 108 letters There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA

Query = lsquoArsquo

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACrsquo

AC AC AC AC

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACGrsquo

ACG

Expected number of matches = (50 x 108) 4 = ~12 x 108

Expected number of matches = (50 x 108) (4x 4) = ~31 x 107

Expected number of matches = (50 x 108) (4 x 4 x 4) = ~81 x 106

Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer

Expected number of matches = (50 x 108) (4 x 4 x 4 x 4 hellip 60 times ) = (50 x 108) 1036 = 50 x 10-28

E-value = 50 x 10-28

E-values In PracticeSo if I take a 60 nt sequencegtsequenceACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA and actually BLAST it against the RefSeq mRNA database I get

BLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds

Length=6060 Score = 119 bits (60) Expect = 2e-26 Identities = 6060 (100) Gaps = 060 (0) Strand=PlusPlus

Query 1 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 60 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 2977 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 3036

What do I get if I BLAST it against the larger nr databaseBLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds

Length=6060 Score = 119 bits (60) Expect = 6e-25 Identities = 6060 (100) Gaps = 060 (0) Strand=PlusPlus

Query 1 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 60 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 2977 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 3036

theoretical value was 50e-28 -

E-value Exercise

Given a transcription factor binding site

ACC[TG]TA

How many would you expect to find by chance in a 10k promoter sequence

How would this differ if there was an optional additional base between the 4th and 5th positionsIeACC[TG]TAOR ACC[TG]TA

E-value Exercise AnswerACC[TG]TA

Expect lsquoArsquo every 4 ntExpect lsquoACCrsquo every 4x4x4 = 64 nt

Expect lsquoT or Grsquo every 2nd ntExpect lsquoACC[TG]rsquo every 64x2 nt = 128 nt

Expect lsquoTArsquo every 4x4 = 16 ntExpect lsquoACC[TG]TArsquo every 128x16 nt = 2048 nt (4x4x4x2x4x4)We would expect ~5 of these promoter sites every 10k by chance

If also ACC[TG]TAA allowed

The two motifs independently have the same E-valueTo allow either means we expect twice as many

E-values Effect of Database SizeThe nr mRNA database has ~ 14 x 1010 letters (was RefSeq and 50 x108)There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA

Query = lsquoArsquo

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACrsquo

AC AC AC AC

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACGrsquo

ACG

Expected number of matches = (14 x 1010) 4 = ~12 x 108

Expected number of matches = (14 x 1010) (4x 4) = ~31 x 107

Expected number of matches = (14 x 1010) (4 x 4 x 4) = ~81 x 106

Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer

Expected number of matches = (14 x 1010) (4 x 4 x 4 x 4 hellip 60 times ) = (14 x 1010) 1036 = 14 x 10-26

E-value = 14 x 10-26

(was E-value = 50 x 10-28)

E-values Effect of Database Size

The E-value is simply dependent on database size

RefSeq

nr

14 x 1010 letters

50 x108 letters

30 x bigger

BLAST the same sequenceagainst each E-value = 14e-26

E-value = 50e-28

The database was ~30 times bigger and so the E-value was ~30 times bigger

Why were the values differentOur calculated E-value for searching against the RefSeq mRNA database was 50 x 10-28But our actual BLAST search at NCBI gave a value of

20 x 10-26 - about 40x larger - why is this

Gapped alignments

If we were expecting N matches for a query sequence lsquoACGTACGTACGTrsquo imagine what would happen to N if we allowed gaps in our matches

ACGTACGTACGT

This would now give us additional possible alignments that would meet our lsquomatchrsquo criteria

ACGTACGTACGT ACGTACAGTACGT ACGTACCGTACGT etc|||||||||||| |||||| |||||| |||||| ||||||ACGTACGTACGT ACGTAC-GTACGT ACGTAC-GTACGT

We will expect many more matches in a given database if we allow our alignments to have gaps The E-value will be larger

E-values Effect of Query Length

Biologically itrsquos the same match Does it mean we are any less sure that this match didnrsquot occur by chance The E-value is simply dependent on match length

database

BLAST 500 nt sequence against a database

BLASTn Get a full length match with sequence XYZ at an E-value = 50e-160

gtsequence ACTAGTCTAGCTAGACATCGATCGATGATGCTACACAGATAGACGATAGATAGTAAGTCGATCGATCGCGCATCGATCGTCTAGATCGATCGCTCGCTGTGTAGATAGATCGGCGATAGA

database

BLAST half of the same sequence against the same database

BLASTn

gtsequence ACTAGTCTAGCTAGACATCGATCGATGATGCTACACAGATAGACGATAGATAGTAAGTCG

Get a match with sequence XYZ again but at an E-value = 50e-80

Why not just use identityAt some levels this a good question

But consider two very different searches both of which give a 75 identity match

Query1 was 60 nt longCGGAGCTCAGGGCTTAACGACTGATATCTCCGCGCATGTCGAGAAACGATACAGCCAGCG||||||||||| || | || | || || |||| | | | |||||| | ||||||||||CGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGWhich would have an E-value ~ 50 x 10-19

And Query2 only 16 nt longACGTACGTACGTACGT||| || | |||| ||ACGCACCTTCGTAGGTWhich would have an E-value ~ 30

And intuitively we feel we would expect to see that sort of number of matches in the database just by chancehellip

So whatrsquos the real problemBasically you are usually trying to answer the question

Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism

Are there any useful guidelines though at least for biological meaningfulness

Basically you are usually trying to answer the question

Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism

BLAST

The difficulty is because

ORTHOLOGY

BLAST Similarity + Probability

biological knowledge

nature of query sequence

phylogenetic relationship

match length PI size of databasehellip

Rules of ThumbHow good does an E-value have to be before we might even think we have an ortholog

largerworse smallerbetter

E-values 10-5 10-10 10-40 10-100 00

fantasy borderline encouraging

pretty good canrsquot get

better

But note that in some gene families with closely related members you can get an E-value of 00 for several different matches and then identity may be more sensitive Also bear in mind in cases like this that ideas of lsquofunctionalrsquo orthology may break down with more than one locus producing identical proteins which share the same functionhellip

Protein BLASTItrsquos (nearly) always better to make comparisons at the amino acid level between protein sequences than the DNA level because the amino acid sequence is more conserved than the underlying DNA sequenceDoes this cause us to treat expected values any differently

If we follow the argument as before then for an exact match of a 20 amino acid sequence in the RefSeq protein database each additional amino acid will reduce the E-value by 120th (there are 20 different amino acids) And as there are 347895532 letters in that databaseE-value = ~35 x 108 (20 x 20 x 20 hellip20 times) = ~35 x 10-18

But this is what we get if we run the blast at NCBI

Score = 431 bits (100) Expect = 8e-04 Identities = 2020 (100) Positives = 2020 (100) Gaps = 020 (0) Frame = +3

Query 3 SSSSFRAYRAALSEVEPPCI 62 SSSSFRAYRAALSEVEPPCISbjct 972 SSSSFRAYRAALSEVEPPCI 991

Really too big a discrepancy to easily explain with hand wavinghellip

Amino Acid Substitutions

A SC F LWYG I LMVL IMFVM ILVP V ILMW FY

N DHSQ REHKS ANTT SY HFW

H NQYK RQER QK

D NEE DQK

In fact we need to take into account both amino acid substitutability as well as as before allowing gapped alignments On average any residue can be substituted for by about 2 others so each position has about 17th chance of lsquomatchingrsquo rather than 120th

So now we getE-value = ~35 x 108 (7 x 7 x 7 hellip20 times) = ~44 x 10-9which is much closer to the actual BLAST value

These substitutabilities are dealt with by the BLOSUM and PAM matrices

ExercisesGo to the file random-DNA-sequenceshtml select one of the 20 randomly generated nucleotide sequences and do a BLASTx (translated DNA-gtprotein) at NCBI against the nr protein database

Did you find any lsquosignificantrsquo hits

Repeat with a second sequence

What conclusions might you draw from this exercise

Try the same sequence(s) against the nr nucleotide database

Is there any general difference

Part 4 Tweaking BLASTAlthough you normally see BLAST as a web page with boxes to place data in and tick boxes etc it is actually a command line program that can be run just by typing the appropriate command and options eg

promptgt blastall ndashp ltblast typegt ndashi ltinput sequencegt ndashd ltdatabasegt

This is the simplest form where the basic program lsquoblastallrsquo takes a number of different options or parameters indicated by the ndashx and followed by its value -p ltwhich blast flavour to rungt-i ltfile with query sequence ingt-d ltpre-indexed database namegt

Not All Parameters are hellipThere are many other parameters and if not listed explicitly they will use a default value most appropriate to the blast flavour requested Eg for ndashW ltword sizegt blastn uses ndashW 11 where blastx uses ndashW 3

There are also some options that appear on the web pages that are not really parameters but manage the job in a similar way One of the most useful of these is on the NCBI blast pages where you can use Entrez queries or pick from an organism list to modify your search

The Many Parameters of BLASTThere are almost literally hundreds of parameters but most are way too obscure even for die-hard techies like me Very few of them are regularly useful in any but their default value but just occasionally they are very necessary

Here are some of the ones that I have used

-e max expected value -m output format (graphical or tabularspreadsheet)-F filter query sequence for low complexity (default TRUE)-U use only upper case regions of query (default FALSE)-G gap opening cost-E gap extension cost-q nucleotide mismatch penalty (BLASTx uses matrices)-r nucleotide match reward-b number of matching sequences to report-g allow gaps (default TRUE)-W word size-z effective database size (removes effect of actual database size)-S query strands to search (default both directions)-l restrict database sequences to given list of lsquogilsquo numbers

BLAST Parameters Exercises1 BLASTn vs BLASTx

Open the file example-sequenceshtml copy the sequence gtblastn-vs-blastxThis is a Xenopus tropicalis cDNA sequence

Go to the NCBI BLAST Home PageNucleotide-nucleotide BLAST (blastn) section Paste your sequence into the box

Run BLASTn against the nr nucleotide database using all default optionsThen hit [format] to wait for the results in a new page(hint if you paste the sequence definition line lsquogtnamersquo into the box as well your results will be labelled accordingly which can be useful)

Now repeat but go to the TRANSLATED BLAST section and BLAST against the nr protein database using BLASTx

How might the different results help us view the presence of this gene in other vertebrates

Results for Exercise 1 BLASTn

BLASTx

BLAST Parameters Exercises2 Low complexity filtering

Open the file example-sequenceshtml copy the sequence gtlow-complexity-filtering-A This sequence contains a long AT tandem repeat

Go to the NCBI BLAST Home PageTRANSLATED BLAST sectionBLASTx Paste your sequence into the box

Carefully UNTICK the ldquoChoose filter [ ] Low complexityrdquo BOX in the second section And then run BLASTx against the nr database

What do you feel about these alignmentsRe-run but leave the low-complexity filter ON this timeDoes this change our view of the protein matches

Now continue with gtlow-complexity-filtering-B and ndashCC is an especially interesting case ndash what can we deduce about the cDNA sequence Annotators beware

Results for Exercise 2A (OFF) BLASTn ndash low complexity filtering OFF

Results for Exercise 2A (ON) BLASTn ndash low complexity filtering ON

Results for Exercise 2B

ON OFF

Results for Exercise 2C

There is a sequence error an extra G at position 117 in the sequence cDNA (117)AGAAAAGAAGAAACATGGCAATGGATCAGAA|||||||||||||||| ||||||||||||||AGAAAAGAAGAAACAT-GCAATGGATCAGAA

Genomic sequence

ON

OFF

BLAST Parameters Exercises3 Limit by Entrez query

Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching sequences in fruit fly enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT

Open the file example-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home Page TRANSLATED BLAST sectionBLASTx and paste the sequence

Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip

BLAST Parameters Exercises4 BLASTn vs tBLASTx and nucleotide mismatch penalties

Open the file example-sequenceshtml

Also open the NCBI BLAST Home PageSPECIAL ndash Align two sequences section

There are several Xenopus tropicalis cyclins in the examples fileCopy the sequence gtcyclin-A1-Xt to the Sequence 1 BLAST windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 BLAST window(i) Run the default comparison should be BLASTn Note the alignmentNow run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these

(ii) Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1 Then try reducing the gap open and gap extension penaltieshellipWhat do we learn from this

(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip

Results for Exercise 4 (i)

BLASTn tBLASTx

Results for Exercise 4 (ii)

Mismatch penalty = -2 (default) Mismatch penalty = -1

BLAST Parameters Exercises5 E-Value maximum for reporting

Open the file example-sequenceshtml

Copy the sequence gtsumo-binding-motif and go to the NCBI BLAST Home PageGo to the PROTEIN BLAST section BLASTp and paste the sequence

Run the search with the default values

Now re-run the search setting the maximum E-value in the box

Expect 100

What difference does this make

BLAST Parameters Exercises6 Word Size

Open the file example-sequenceshtml

Copy the sequence gtmorpholino and go to the NCBI BLAST Home PageGo to the NUCLEOTIDE BLAST section BLASTn and paste the sequence

Check OFF the low complexity filter and then run the search

Now re-run the search setting the following parameters

Low complexity OFFExpect 100Word Size 7Other advanced -q-1 (mismatch penalty -1 instead of default -3)

What difference does this make

END

  • Bioinformatics Workshop 1 Sequences and Similarity Searches
  • The Basic Questions
  • Part 1 Structural Genomics
  • Chromosomes and Genes
  • Gene to Protein
  • Sequence Signals
  • Genomic Signals
  • Derivative Sequences
  • Gene Models
  • Sequences and Genes (Accession Numbers and Names)
  • Gene Symbols Names Etc
  • A Gene-Centric View
  • Sequences and Accession Numbers
  • mRNA Splicing Signals
  • Gene Predictions
  • Supporting Evidence
  • TheoreticalPredicted Sequences
  • Sequences for a model organism
  • So Whatrsquos in the Databases Now
  • Part 2 Comparative Genomics
  • Speciation
  • Residual Similarity
  • Computers Can Detect Homology
  • Orthologs
  • Paralogs
  • lsquoOtherrsquo-logs
  • The Essential Paradigm
  • Function Conserved Longer than Detectable Similarity
  • Redundancy in the Genetic Code
  • Protein Similarity Persists Longer
  • Always Compare Protein Sequences
  • Exercise 1 nucleotide vs amino acid search
  • Answers Exercise 1
  • The Essential Task
  • Functional Orthologs
  • Finding Orthologs
  • Using Synteny is Better
  • Metazome
  • Metazome Exercise
  • Part 3 Finding Sequence Similarities
  • Gaps in Alignments
  • The Downside of Gaps
  • BLAST
  • Flavours of BLAST
  • How does it work
  • BLAST WORDS and INDEXING
  • Analyse the Query Sequence
  • Expand from Word Based Matches
  • BLAST ndashTypical Output
  • When is a match significant
  • E-values
  • E-values From First Principles
  • Calculating an E-value
  • E-values In Practice
  • E-value Exercise
  • E-value Exercise Answer
  • E-values Effect of Database Size
  • Slide 58
  • Why were the values different
  • E-values Effect of Query Length
  • Why not just use identity
  • So whatrsquos the real problem
  • Rules of Thumb
  • Protein BLAST
  • Amino Acid Substitutions
  • Exercises
  • Part 4 Tweaking BLAST
  • Not All Parameters are hellip
  • The Many Parameters of BLAST
  • Slide 70
  • BLAST Parameters Exercises
  • Results for Exercise 1
  • Slide 73
  • Results for Exercise 2A (OFF)
  • Results for Exercise 2A (ON)
  • Results for Exercise 2B
  • Results for Exercise 2C
  • Slide 78
  • Slide 79
  • Results for Exercise 4 (i)
  • Results for Exercise 4 (ii)
  • Slide 82
  • Slide 83
  • END
Page 26: Bioinformatics Workshop 1 Sequences and Similarity Searches

lsquoOtherrsquo-logsWhat about gene duplication after speciation

How can we describe the relationship(s) between the various copies of gene A in the two frogs

Bear in mind that understanding gene function is more important than semanticshellip

The two copies of A in the orange frog are sometimes called IN-PARALOGS

If they were also present in the green frog (and therefore were in the ancestor species) they would be OUT-PARALOGS

A

A

A

Arsquo A

The Essential Paradigm

1 any group of modern species can be traced back to some extinct common ancestor

A

A

2 in all likelihood they share orthologous genes which have the same function in the modern animal as in the extinct ancestor

3 If we can experimentally determine the function of a gene in one of these organisms then there is a good chance the ORTHOLOGOUS gene in another organism will have the same function

A A

cyclin b1

cyclin b1

Function Conserved Longer than Detectable Similarity

start from first self-replicating sequence

same function detectable similarity

living organisms

whole genome duplication local duplication

Redundancy in the Genetic Code

GCA A alanine GCC A GCG A GCT A

TGC C cystine TGT C

GAC D aspartate GAT D

GGA G glycine GGC G GGG G GGT G

lsquoSynonymousrsquo or lsquosilentrsquo mutations in the third position of the codon triplets have no effect on the amino acid coded for ndash so there is no evolutionary pressure against thishellip

Protein Similarity Persists Longer

CTATCACGAGAACCTGTGCTATCCCGAGAACCTGTGCTATCCCGAGAACCAGTGCTATCCCGTGAACCAGTGCTATCCCGTGAGCCAGTGCTATCCCGTGAGCCAGTTCTGTCCCGTGAGCCAGTT

CTATCACGAGAACCTGTG

CTGTCCCGTGAGCCAGTT|| || || || || ||

LSREPV

LSREPV||||||

CTATCACGAGAACCTGTG

TTGTCCCGGTCGCCAGTT | || | || ||

LSREPV

LSRFPV||| ||

67 100

44 80

Always Compare Protein Sequences

ATGAATGCAGCCTATGATTGCCGAGCCAGAATGCTAAGG MNAAYDCRARMLR ||||| || || || || || || || ||||| || || | ||||||||+||ATGAAGGCCGCATACGACTGTCGTGCTAGAATCCTGAGA MKAAYDCRARILR

DNA comparison amino acid comparison

The DNA sequence can change while the amino acid sequence stays the same so always look for similarities by comparing amino acid sequences

Exercise 1nucleotide vs amino acid search

Go to the file example-sequenceshtml and locate the section for this exercise There should be two sequences lsquosurfeit1rsquo for frog and fly

Go to NCBI Blast home page then lsquoAlign two sequencesrsquo (bottom left lsquospecialrsquo panel) paste one sequence into each window and hit lsquoAlignrsquo ndash this will do a direct DNADNA comparison

Now find the open reading frames of the two genes and translate them into amino acid protein sequences then repeat the two sequences comparison

Go to NCBI ORF Finder ndash paste sequence ndash hit OrfFind ndash identify longest ORF ndash click on it ndash next screen hit Accept ndash change View to Fasta protein ndash hit View ndash copy sequence to Blast2Seqs Do the same with the other sequence

Before you hit lsquoAlignrsquo change the lsquoProgramrsquo (top left) to blastphellip

Answers Exercise 1

The Essential Taskexperiment data mining

gene sequence what is its function

database of proteins in other species

Cyclin-AFoxA1

cdc25

alpha-tubulin

Predicted protein

Gravin-like

Sprouty-2

calmodulin

KIAA10786568

frizzled

Wint8

Troponin T3

Gravin-like

we can only do this because of implied function based on orthology

Functional Orthologs

function known annotation lsquoGravinrsquo available

Human geneXenopus genefunction unknown

sequence similarityorthologs

same function But we know that function is largely determined by shape

similar shape

Which in general we cannot determine ndash but it is probably SHAPE not SEQUENCE that is conserved

We make an assumption that the same gene function is likely to be present in the two organisms and the ones that have this function are likely to be the most similar in sequence

Finding OrthologsSo how do we find orthologs and can we know when we have

The simplest is Reciprocal Best BLAST but it implicitly relies on having all the protein sequences of you own organism and the one you wish to find an ortholog in

frog proteindatabase of human proteins

best match human protein

database of frog proteins

x

Using Synteny is Better

We know that large regions of (say) vertebrate genomes have preserved their overall organisation from one organism to another

And we find the same genes (ie orthologs) in more or less the same order in the syntenic sectionsThese of course represent chromosomal re-arrangements since these organisms diverged

Human chromosome 5

Mouse chromosome 10

Mouse chromosome 2

MetazomeFortunately someone has done all the hard work for ushellip Dan Rokhsar httpwwwmetazomenet

Metazome Exercise

Go back to Entrez Gene and look for your favourite gene again

Pick probable ortholog vertebrate genes from common organisms (human mouse rat chicken frog fish) and paste their protein sequences into a temporary space

Go to Metazome (httpwwwmetazomenet) find the blast window open two versions of it and blast your sequences against the Tetrapod or Jawed vertebrate node

See if you get the same cluster ID as best top hit and have a look at the Metazome alignment(s)hellip

Part 3 Finding Sequence Similarities

We want computer programs which will compare sequences at all possible different alignments looking for a degree of similarity greater than we would expect to find by chance

But first we have to consider the implication of gapshellip

Insertions and deletions are other possible forms of mutations and they can really mess up our simple alignments

ATGCATGCTGCCAACGGATGTCCTG

ATGAAAGCCGCCTACGAAAGTCCTG||| | || ||| ||| | ||||||

ATGCATGCTGGCCAACGGATGTCCTG

ATGAAAGCCGCCTACGAAAGTCCTG||| | || ||| | | | |

Gaps in Alignments

Consider these two obviously similar sequences

TTCCCAACTCTCCTCTTTCACCATGAAGCTCAAGGACAGATTCCACTCGCCCCAAAATCAAGCTCACCCCGTCCAAGAA | || | || |||||||||||||||||||| ||||||||| ||| ||| | ||| | | |TTCCCACCTCTCCTCTTTGCACCATGAAGCTCAAGGACAAATTCCACTCCCCCAAAATCAAGCGCACCCCGTCCCAGAA

In fact we realise that the most probable alignment (regarding biological origin) is with a small gap in each sequence

TTCCCAACTCTCCTCTTT=CACCATGAAGCTCAAGGACAGATTCCACTCGCCCCAAAATCAAGCTCACCCCGTCCAAGAA |||||| ||||||||||| |||||||||||||||||||| ||||||||| |||||||||||||| |||||||||| ||||TTCCCACCTCTCCTCTTTGCACCATGAAGCTCAAGGACAAATTCCACTC=CCCCAAAATCAAGCGCACCCCGTCCCAGAA

So in general we allow ourselves to insert gaps until we find the optimal alignment

But where should this process stop

The Downside of GapsTake two random sequences with no lsquorealrsquo similarity

GACACTAGGTCGATGCGTGGTGGCGAGA

ACGCATCCGGATGTGCACCGTGGAACTG

And allow lsquocost freersquo gaps

GAC--ACT----AGGTCGATGC---GTGG---TGGCGAGA || | | | | | ||| |||| || ACGCA-TCCGGA--T-G-TGCACCGTGGAACTG

Clearly although the alignment has no mismatches it is obviously not biologically meaningful

To prevent this we assign a cost to adding gaps which is offset against the benefit of finding matches ndash and this is the essence of lsquofinding gapped alignmentsrsquo

We want to find the lsquoalignmentrsquo between the two (or more) sequences which shows the greatest degree of similarity while introducing the fewest gaps hellip

BLAST

gtqueryAGACGAACCTAGCACAAGCGCGTCTGGAAAGACCCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGTTCAGGAGTATTTGGACTGCAATATTGGCCCTCGTTCAAGGGCGCCTACCATCACCCGACGGTCATGCCGGTCCCCAGCAGCTGCTAATAACTTCCTTCGCTACTCAAGTTACCACGCTAGCAAAACCCACGGCATACCGTTTACCCTTTAAAATCAGCTTCAACCAGCAACGAA

There are many programs used to find similarities between sequencesThey range from relatively slow programs which find the exact best matching alignment through ones which take progressively inexact shortcuts to speed things up Of this latter class the best known and easily most widely used is BLAST developed by Stephen Altschul and others and continuously refined over the last 10-15 years

The essential idea is to compare your query sequence against a collection or lsquodatabasersquo of target sequences looking for the one(s) that match the query sequence the best

gttarget1AAAACAGGAATATTTACCGGGACCGGGTAATGATGCATCTCGAGGTACACAATATACCTG GAGAACCGAATTATGAGTTGGCCACCTTACTTAACGAAACCAGCAGAGAAAATCCAACAT GGCAACACCCCTCTGACTACACTAGAAGGAACTACTATGTAAGAAAACAGCCTGTCCCTT GCAGTTTGAATGACTGGGTGATGCGAAATGGGGGTCCTGCCATAGAGCGCTTCCATGGTT TACCTTGCACATTTCAGAGAAGTCCTATGCCAGGAGTCCTTCCTACAGGGCCTTCCTGAA ACTATATATGTGCTTATTCTTGTTTGATTTGGCTTTGCAGgttarget2CTCTTAATTTATTTCTCTTCCTGCAGCTCCCTCGCTTTTTCCTTTCCCTGTTACATTCAT CTGACTTGAAGAGTTGCAAATTTTCAGTGTTTCTGTTTTTGTTGCTGATATGTTGTAAAC TTTTTAATAAAATCTATTTCTATAG gttarget3GCAGTTTGAATGACTGGGTGATGCGAAATGGGGGTCCTGCCATAGAGCGCTTCCATGGTT TACCTTGCACATTTCAGAGAAGTCCTATGCCAGGAGTCCTTCCTACAGGGCCTTCCTGAA ACTATATATGTGCTTATTCTTGTTTGATTTGGCTTTGCAGCTAGGGTTTTCACCTTTTCT GGAAAAAAAAATACTGGCTTCC gttarget4CTGCTATTAATGGGCAAAACAACTCAAATAAAGTCCCTCTGCCACCCTCAGACACTGCCC CTGGCCCCCAGCTGCCCGCTGATCCTTGTAGCCAGAGCAGTAAAGTTTTGAAAGTGGAGC CCAAGGAGAATAAAGTTATTAAAGAAACTGGCTTTGAACAAGGTGAAAAGTCTTGTGCAG CACCTCTAGATCATACTGTGAAGGAAAATCTTGGACAAACTTCTAAAGAACAGGTGGTAG

query

database

COMPARE

LIST MATCHES

Flavours of BLAST

ACGATAGATCCCATCCATAAAT ATGACGATAGATCCCATCAT CGATAGGACCACCACA GATAGACCAGGATACATAGGATAATTA AGCTCGCTTGGCTCGATGGCT

MKJLSPWERSYTRGHYTWER MGHTVNBZY MKLPWRHGDBKJGMNDFD MBKLRPIUHDFRTASGSLKWWRTVBN

MKJLSPWERSYTRGHYTWER MGHTVNBZY MKLPWRHGDBKJGMNDFD MBKLRPIUHDFRTASGSLKWWRTVBN

ATGACGATAGATCCCATCAT CGATAGGACCACCACA GATAGACCAGGATACATAGGATAATTA AGCTCGCTTGGCTCGATGGCT

query sequence other operation database sequences

ATGACGATAGATCCCATCAT CGATAGGACCACCACA GATAGACCAGGATACATAGGATAATTA AGCTCGCTTGGCTCGATGGCT

BLASTn

BLASTp

BLASTx

tBLASTn

tBLASTx

ACGATAGATCCCATCCATAAAT

ACGATAGATCCCATCCATAAAT

MQWCGYRWTYQGYRW

MQWCGYRWTYQGYRW

FAST

FAST

SLOW

SLOWER

HORRIBLY

SLOW

6 fra

me

trans

latio

n

How does it work The main task of any sequence comparison program is to test all possible mutual alignments of two sequence and see how good the match is

CTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGT

CCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGTTC | | | | | ||||||||||||||||||||||||| CTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGTCTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGT || | | | | | | | | |||||||||||||||||||||||| | | | | | |

CCGAGCTTCTCATTGCTCTTCCTAACAGTG=TGATAGGCTAACCGTAATGGCGTTC||||||||||||||||||||||||| ||||||||||||||||||||||||

query

1st database sequence

This would actually be a very slow search process if implemented like thishellip

BLAST achieves its speed through two strategies

- it takes a WORD based approach- it pre-INDEXES database sequences

BLAST WORDS and INDEXING1 GACAAATCCAAACCCCTGAAGTTCTCCACCAGCAAAGCCA

2 TAAGCAAATTTAATTTTGTTTACATTTTC

3 GTTAAGACCTTCCCTGACATTTGCAGCAGTTTCAAATGTA

AAAAAAAA 00001 AAAAAAAC 00002 AAAAAAAG 00003 ACAAATCC 07967 ACAAATCC 07968 ACAAATCC 07979 GACAAATC 33568 GACAAATG 33569 TCCAAACC 64321 TCCAAACC 64322

sequence position word

1 1 33658

1 2 07967

1 3 16210

3 15 33568

3 16 07967

Database of sequences

Numbered list of all possible lsquowordsrsquo

Build a position index of all words in the database

Analyse the Query Sequence gtquery AGACAAATCCAAACCCCTGAAGTTCTCCACCAGCAAAGCCA

AAAAAAAA 00001 AAAAAAAC 00002 AAAAAAAG 00003 ACAAATCC 07967 ACAAATCC 07968 ACAAATCC 07979 GACAAATC 33568 GACAAATG 33569 TCCAAACC 64321 TCCAAACC 64322

QUERY SEQUENCE

Numbered list of all possible lsquowordsrsquo

position word

1 14236

2 33658

3 07967

Analyse QUERY SEQUENCE

sequence position word

1 1 33658

1 2 07967

1 3 16210

3 15 33568

3 16 07967

Index of database

Expand from Word Based Matches We lsquoinstantlyrsquo know which sequences in the database have at least a word length match with our query sequence and at what relative position

Next the potential alignments are expanded adding up a score for (total matches ndash mismatches ndash gap penalties) to make the best possible alignment But this is usually for a tiny proportion of the sequences in the database ndash so overall it is much quicker

The highest scoring alignments are reported

But we can potentially miss alignments with no word-size bits in common consider BLASTn with a default word-size of 11

TCGGAAGTGGAAGCTGAACCTGATTGTAGAGTTGGAGGCCAGTGTTCTGGCTGAGC||||||||| ||||| |||||||||| |||||||||| |||| ||||| ||||||| TCGGAAGTGTAAGCTCAACCTGATTGCAGAGTTGGAGTCCAGAGTTCTAGCTGAGC

Care is sometimes neededhellip

BLAST ndashTypical OutputINPUT

gtpartial cDNA sequence Xenopus tropicalisCGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGTTCCCACCTCTCCTCTTTCACCATGAAGCTCAAGGACAAATTCCACTCCCCCAAAATCAAGCGCACCCCGTCCAAGAAGGGGAAGCCGGCCGACCTCACCGTCAAAACAGAAGAGAAACCCGTCAACAAAACCTTAAGCCGCTTGGAGGAACAGGAGAAAGAAGTCGTTAATGCCTTGCGTTACTTTAAGACAATTGTTGACAAGATGGCGGTGGACAAGATGGTGCTGGTGATGCTGCCAGGGTCGGCGA

OUTPUTQuery= (311 letters) Database NCBI Protein Reference Sequences 954378 sequences 347895532 total letters

gtgi|41055060|ref|NP_9574201| similar to guanine nucleotide-releasing factor 2 (specific for crk proto-oncogene) [Danio rerio]

Length=691

Score = 133 bits (335)Expect = 6e-31 Identities = 7698 (77) Positives = 8298 (83) Gaps = 498 (4) Frame = +2

Query 26 MSGKIE-KADSQRSHLSSFTMKLKDKFHSPKIKRTPSKKGKPA--DLTVKTEEKPVNKTL 196 MSGKIE K +SQ+SHLSSFTMKL KFHSPKIKRTPSKKGK + VKT EKPVNK + Sbjct 1 MSGKIESKHESQKSHLSSFTMKLM-KFHSPKIKRTPSKKGKQLQPEPAVKTPEKPVNKKV 59

Query 197 SRLEEQEKEVVNALRYFKTIVDKMAVDKMVLVMLPGSA 310 SRLEEQEK+VV+ALRYFKTIVDKM VD VL MLPGSA Sbjct 60 SRLEEQEKDVVSALRYFKTIVDKMNVDTKVLQMLPGSA 97

When is a match significant

RFKISDCQHPCTYSHNQYMTNHMRECPYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV

NFSWKKTSEKETNCQFDYPNDYNEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFNMCWLEVNSS

RF---KISDCQHPCTYSH-NQYMTNHMREC----PYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV F K S+ + C + + N Y N +C P+ + +W +P + D I N M ++ NFSWKKTSEKETNCQFDYPNDY--NEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFN------MCWLEVNSS

Here is a lsquotypicalrsquo weak alignment from BLASTp

In fact the sequences were randomly generated so there is no biologically significant alignmenthellip

E-values

The number of matches like the discovered match that I would expect to find by chance

An E-value of 00 implies that I would expect no matches like this to arise by chance thereforehellip

An E-value of 1 implies I would expect 1 match like this to arise by chance so if I have a match with such an E-valuehellip

Also ldquoexpect valueldquo or ldquoexpectationrdquo

E-values From First Principles

Some database statistics (23rd July 2005)

Database NCBI RefSeq mRNA 272619 sequences 503566580 total letters (~50 x 108)

Database NCBI nr 3329110 sequences 14601814750 total letters (~14 x 1010)

Notation

12e-35 = 12 x 10-35

48 x 106 = 4800000

We will consider first searching a nucleotide sequence (lsquoACGTAGACGTrsquo) against a nucleotide database eg the RefSeq mRNA above

Then we will consider the more complex case of amino acid sequence (protein) searches Which is of course what we mostly do

Calculating an E-valueThe RefSeq mRNA database has ~ 50 x 108 letters There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA

Query = lsquoArsquo

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACrsquo

AC AC AC AC

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACGrsquo

ACG

Expected number of matches = (50 x 108) 4 = ~12 x 108

Expected number of matches = (50 x 108) (4x 4) = ~31 x 107

Expected number of matches = (50 x 108) (4 x 4 x 4) = ~81 x 106

Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer

Expected number of matches = (50 x 108) (4 x 4 x 4 x 4 hellip 60 times ) = (50 x 108) 1036 = 50 x 10-28

E-value = 50 x 10-28

E-values In PracticeSo if I take a 60 nt sequencegtsequenceACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA and actually BLAST it against the RefSeq mRNA database I get

BLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds

Length=6060 Score = 119 bits (60) Expect = 2e-26 Identities = 6060 (100) Gaps = 060 (0) Strand=PlusPlus

Query 1 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 60 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 2977 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 3036

What do I get if I BLAST it against the larger nr databaseBLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds

Length=6060 Score = 119 bits (60) Expect = 6e-25 Identities = 6060 (100) Gaps = 060 (0) Strand=PlusPlus

Query 1 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 60 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 2977 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 3036

theoretical value was 50e-28 -

E-value Exercise

Given a transcription factor binding site

ACC[TG]TA

How many would you expect to find by chance in a 10k promoter sequence

How would this differ if there was an optional additional base between the 4th and 5th positionsIeACC[TG]TAOR ACC[TG]TA

E-value Exercise AnswerACC[TG]TA

Expect lsquoArsquo every 4 ntExpect lsquoACCrsquo every 4x4x4 = 64 nt

Expect lsquoT or Grsquo every 2nd ntExpect lsquoACC[TG]rsquo every 64x2 nt = 128 nt

Expect lsquoTArsquo every 4x4 = 16 ntExpect lsquoACC[TG]TArsquo every 128x16 nt = 2048 nt (4x4x4x2x4x4)We would expect ~5 of these promoter sites every 10k by chance

If also ACC[TG]TAA allowed

The two motifs independently have the same E-valueTo allow either means we expect twice as many

E-values Effect of Database SizeThe nr mRNA database has ~ 14 x 1010 letters (was RefSeq and 50 x108)There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA

Query = lsquoArsquo

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACrsquo

AC AC AC AC

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACGrsquo

ACG

Expected number of matches = (14 x 1010) 4 = ~12 x 108

Expected number of matches = (14 x 1010) (4x 4) = ~31 x 107

Expected number of matches = (14 x 1010) (4 x 4 x 4) = ~81 x 106

Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer

Expected number of matches = (14 x 1010) (4 x 4 x 4 x 4 hellip 60 times ) = (14 x 1010) 1036 = 14 x 10-26

E-value = 14 x 10-26

(was E-value = 50 x 10-28)

E-values Effect of Database Size

The E-value is simply dependent on database size

RefSeq

nr

14 x 1010 letters

50 x108 letters

30 x bigger

BLAST the same sequenceagainst each E-value = 14e-26

E-value = 50e-28

The database was ~30 times bigger and so the E-value was ~30 times bigger

Why were the values differentOur calculated E-value for searching against the RefSeq mRNA database was 50 x 10-28But our actual BLAST search at NCBI gave a value of

20 x 10-26 - about 40x larger - why is this

Gapped alignments

If we were expecting N matches for a query sequence lsquoACGTACGTACGTrsquo imagine what would happen to N if we allowed gaps in our matches

ACGTACGTACGT

This would now give us additional possible alignments that would meet our lsquomatchrsquo criteria

ACGTACGTACGT ACGTACAGTACGT ACGTACCGTACGT etc|||||||||||| |||||| |||||| |||||| ||||||ACGTACGTACGT ACGTAC-GTACGT ACGTAC-GTACGT

We will expect many more matches in a given database if we allow our alignments to have gaps The E-value will be larger

E-values Effect of Query Length

Biologically itrsquos the same match Does it mean we are any less sure that this match didnrsquot occur by chance The E-value is simply dependent on match length

database

BLAST 500 nt sequence against a database

BLASTn Get a full length match with sequence XYZ at an E-value = 50e-160

gtsequence ACTAGTCTAGCTAGACATCGATCGATGATGCTACACAGATAGACGATAGATAGTAAGTCGATCGATCGCGCATCGATCGTCTAGATCGATCGCTCGCTGTGTAGATAGATCGGCGATAGA

database

BLAST half of the same sequence against the same database

BLASTn

gtsequence ACTAGTCTAGCTAGACATCGATCGATGATGCTACACAGATAGACGATAGATAGTAAGTCG

Get a match with sequence XYZ again but at an E-value = 50e-80

Why not just use identityAt some levels this a good question

But consider two very different searches both of which give a 75 identity match

Query1 was 60 nt longCGGAGCTCAGGGCTTAACGACTGATATCTCCGCGCATGTCGAGAAACGATACAGCCAGCG||||||||||| || | || | || || |||| | | | |||||| | ||||||||||CGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGWhich would have an E-value ~ 50 x 10-19

And Query2 only 16 nt longACGTACGTACGTACGT||| || | |||| ||ACGCACCTTCGTAGGTWhich would have an E-value ~ 30

And intuitively we feel we would expect to see that sort of number of matches in the database just by chancehellip

So whatrsquos the real problemBasically you are usually trying to answer the question

Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism

Are there any useful guidelines though at least for biological meaningfulness

Basically you are usually trying to answer the question

Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism

BLAST

The difficulty is because

ORTHOLOGY

BLAST Similarity + Probability

biological knowledge

nature of query sequence

phylogenetic relationship

match length PI size of databasehellip

Rules of ThumbHow good does an E-value have to be before we might even think we have an ortholog

largerworse smallerbetter

E-values 10-5 10-10 10-40 10-100 00

fantasy borderline encouraging

pretty good canrsquot get

better

But note that in some gene families with closely related members you can get an E-value of 00 for several different matches and then identity may be more sensitive Also bear in mind in cases like this that ideas of lsquofunctionalrsquo orthology may break down with more than one locus producing identical proteins which share the same functionhellip

Protein BLASTItrsquos (nearly) always better to make comparisons at the amino acid level between protein sequences than the DNA level because the amino acid sequence is more conserved than the underlying DNA sequenceDoes this cause us to treat expected values any differently

If we follow the argument as before then for an exact match of a 20 amino acid sequence in the RefSeq protein database each additional amino acid will reduce the E-value by 120th (there are 20 different amino acids) And as there are 347895532 letters in that databaseE-value = ~35 x 108 (20 x 20 x 20 hellip20 times) = ~35 x 10-18

But this is what we get if we run the blast at NCBI

Score = 431 bits (100) Expect = 8e-04 Identities = 2020 (100) Positives = 2020 (100) Gaps = 020 (0) Frame = +3

Query 3 SSSSFRAYRAALSEVEPPCI 62 SSSSFRAYRAALSEVEPPCISbjct 972 SSSSFRAYRAALSEVEPPCI 991

Really too big a discrepancy to easily explain with hand wavinghellip

Amino Acid Substitutions

A SC F LWYG I LMVL IMFVM ILVP V ILMW FY

N DHSQ REHKS ANTT SY HFW

H NQYK RQER QK

D NEE DQK

In fact we need to take into account both amino acid substitutability as well as as before allowing gapped alignments On average any residue can be substituted for by about 2 others so each position has about 17th chance of lsquomatchingrsquo rather than 120th

So now we getE-value = ~35 x 108 (7 x 7 x 7 hellip20 times) = ~44 x 10-9which is much closer to the actual BLAST value

These substitutabilities are dealt with by the BLOSUM and PAM matrices

ExercisesGo to the file random-DNA-sequenceshtml select one of the 20 randomly generated nucleotide sequences and do a BLASTx (translated DNA-gtprotein) at NCBI against the nr protein database

Did you find any lsquosignificantrsquo hits

Repeat with a second sequence

What conclusions might you draw from this exercise

Try the same sequence(s) against the nr nucleotide database

Is there any general difference

Part 4 Tweaking BLASTAlthough you normally see BLAST as a web page with boxes to place data in and tick boxes etc it is actually a command line program that can be run just by typing the appropriate command and options eg

promptgt blastall ndashp ltblast typegt ndashi ltinput sequencegt ndashd ltdatabasegt

This is the simplest form where the basic program lsquoblastallrsquo takes a number of different options or parameters indicated by the ndashx and followed by its value -p ltwhich blast flavour to rungt-i ltfile with query sequence ingt-d ltpre-indexed database namegt

Not All Parameters are hellipThere are many other parameters and if not listed explicitly they will use a default value most appropriate to the blast flavour requested Eg for ndashW ltword sizegt blastn uses ndashW 11 where blastx uses ndashW 3

There are also some options that appear on the web pages that are not really parameters but manage the job in a similar way One of the most useful of these is on the NCBI blast pages where you can use Entrez queries or pick from an organism list to modify your search

The Many Parameters of BLASTThere are almost literally hundreds of parameters but most are way too obscure even for die-hard techies like me Very few of them are regularly useful in any but their default value but just occasionally they are very necessary

Here are some of the ones that I have used

-e max expected value -m output format (graphical or tabularspreadsheet)-F filter query sequence for low complexity (default TRUE)-U use only upper case regions of query (default FALSE)-G gap opening cost-E gap extension cost-q nucleotide mismatch penalty (BLASTx uses matrices)-r nucleotide match reward-b number of matching sequences to report-g allow gaps (default TRUE)-W word size-z effective database size (removes effect of actual database size)-S query strands to search (default both directions)-l restrict database sequences to given list of lsquogilsquo numbers

BLAST Parameters Exercises1 BLASTn vs BLASTx

Open the file example-sequenceshtml copy the sequence gtblastn-vs-blastxThis is a Xenopus tropicalis cDNA sequence

Go to the NCBI BLAST Home PageNucleotide-nucleotide BLAST (blastn) section Paste your sequence into the box

Run BLASTn against the nr nucleotide database using all default optionsThen hit [format] to wait for the results in a new page(hint if you paste the sequence definition line lsquogtnamersquo into the box as well your results will be labelled accordingly which can be useful)

Now repeat but go to the TRANSLATED BLAST section and BLAST against the nr protein database using BLASTx

How might the different results help us view the presence of this gene in other vertebrates

Results for Exercise 1 BLASTn

BLASTx

BLAST Parameters Exercises2 Low complexity filtering

Open the file example-sequenceshtml copy the sequence gtlow-complexity-filtering-A This sequence contains a long AT tandem repeat

Go to the NCBI BLAST Home PageTRANSLATED BLAST sectionBLASTx Paste your sequence into the box

Carefully UNTICK the ldquoChoose filter [ ] Low complexityrdquo BOX in the second section And then run BLASTx against the nr database

What do you feel about these alignmentsRe-run but leave the low-complexity filter ON this timeDoes this change our view of the protein matches

Now continue with gtlow-complexity-filtering-B and ndashCC is an especially interesting case ndash what can we deduce about the cDNA sequence Annotators beware

Results for Exercise 2A (OFF) BLASTn ndash low complexity filtering OFF

Results for Exercise 2A (ON) BLASTn ndash low complexity filtering ON

Results for Exercise 2B

ON OFF

Results for Exercise 2C

There is a sequence error an extra G at position 117 in the sequence cDNA (117)AGAAAAGAAGAAACATGGCAATGGATCAGAA|||||||||||||||| ||||||||||||||AGAAAAGAAGAAACAT-GCAATGGATCAGAA

Genomic sequence

ON

OFF

BLAST Parameters Exercises3 Limit by Entrez query

Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching sequences in fruit fly enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT

Open the file example-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home Page TRANSLATED BLAST sectionBLASTx and paste the sequence

Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip

BLAST Parameters Exercises4 BLASTn vs tBLASTx and nucleotide mismatch penalties

Open the file example-sequenceshtml

Also open the NCBI BLAST Home PageSPECIAL ndash Align two sequences section

There are several Xenopus tropicalis cyclins in the examples fileCopy the sequence gtcyclin-A1-Xt to the Sequence 1 BLAST windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 BLAST window(i) Run the default comparison should be BLASTn Note the alignmentNow run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these

(ii) Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1 Then try reducing the gap open and gap extension penaltieshellipWhat do we learn from this

(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip

Results for Exercise 4 (i)

BLASTn tBLASTx

Results for Exercise 4 (ii)

Mismatch penalty = -2 (default) Mismatch penalty = -1

BLAST Parameters Exercises5 E-Value maximum for reporting

Open the file example-sequenceshtml

Copy the sequence gtsumo-binding-motif and go to the NCBI BLAST Home PageGo to the PROTEIN BLAST section BLASTp and paste the sequence

Run the search with the default values

Now re-run the search setting the maximum E-value in the box

Expect 100

What difference does this make

BLAST Parameters Exercises6 Word Size

Open the file example-sequenceshtml

Copy the sequence gtmorpholino and go to the NCBI BLAST Home PageGo to the NUCLEOTIDE BLAST section BLASTn and paste the sequence

Check OFF the low complexity filter and then run the search

Now re-run the search setting the following parameters

Low complexity OFFExpect 100Word Size 7Other advanced -q-1 (mismatch penalty -1 instead of default -3)

What difference does this make

END

  • Bioinformatics Workshop 1 Sequences and Similarity Searches
  • The Basic Questions
  • Part 1 Structural Genomics
  • Chromosomes and Genes
  • Gene to Protein
  • Sequence Signals
  • Genomic Signals
  • Derivative Sequences
  • Gene Models
  • Sequences and Genes (Accession Numbers and Names)
  • Gene Symbols Names Etc
  • A Gene-Centric View
  • Sequences and Accession Numbers
  • mRNA Splicing Signals
  • Gene Predictions
  • Supporting Evidence
  • TheoreticalPredicted Sequences
  • Sequences for a model organism
  • So Whatrsquos in the Databases Now
  • Part 2 Comparative Genomics
  • Speciation
  • Residual Similarity
  • Computers Can Detect Homology
  • Orthologs
  • Paralogs
  • lsquoOtherrsquo-logs
  • The Essential Paradigm
  • Function Conserved Longer than Detectable Similarity
  • Redundancy in the Genetic Code
  • Protein Similarity Persists Longer
  • Always Compare Protein Sequences
  • Exercise 1 nucleotide vs amino acid search
  • Answers Exercise 1
  • The Essential Task
  • Functional Orthologs
  • Finding Orthologs
  • Using Synteny is Better
  • Metazome
  • Metazome Exercise
  • Part 3 Finding Sequence Similarities
  • Gaps in Alignments
  • The Downside of Gaps
  • BLAST
  • Flavours of BLAST
  • How does it work
  • BLAST WORDS and INDEXING
  • Analyse the Query Sequence
  • Expand from Word Based Matches
  • BLAST ndashTypical Output
  • When is a match significant
  • E-values
  • E-values From First Principles
  • Calculating an E-value
  • E-values In Practice
  • E-value Exercise
  • E-value Exercise Answer
  • E-values Effect of Database Size
  • Slide 58
  • Why were the values different
  • E-values Effect of Query Length
  • Why not just use identity
  • So whatrsquos the real problem
  • Rules of Thumb
  • Protein BLAST
  • Amino Acid Substitutions
  • Exercises
  • Part 4 Tweaking BLAST
  • Not All Parameters are hellip
  • The Many Parameters of BLAST
  • Slide 70
  • BLAST Parameters Exercises
  • Results for Exercise 1
  • Slide 73
  • Results for Exercise 2A (OFF)
  • Results for Exercise 2A (ON)
  • Results for Exercise 2B
  • Results for Exercise 2C
  • Slide 78
  • Slide 79
  • Results for Exercise 4 (i)
  • Results for Exercise 4 (ii)
  • Slide 82
  • Slide 83
  • END
Page 27: Bioinformatics Workshop 1 Sequences and Similarity Searches

The Essential Paradigm

1 any group of modern species can be traced back to some extinct common ancestor

A

A

2 in all likelihood they share orthologous genes which have the same function in the modern animal as in the extinct ancestor

3 If we can experimentally determine the function of a gene in one of these organisms then there is a good chance the ORTHOLOGOUS gene in another organism will have the same function

A A

cyclin b1

cyclin b1

Function Conserved Longer than Detectable Similarity

start from first self-replicating sequence

same function detectable similarity

living organisms

whole genome duplication local duplication

Redundancy in the Genetic Code

GCA A alanine GCC A GCG A GCT A

TGC C cystine TGT C

GAC D aspartate GAT D

GGA G glycine GGC G GGG G GGT G

lsquoSynonymousrsquo or lsquosilentrsquo mutations in the third position of the codon triplets have no effect on the amino acid coded for ndash so there is no evolutionary pressure against thishellip

Protein Similarity Persists Longer

CTATCACGAGAACCTGTGCTATCCCGAGAACCTGTGCTATCCCGAGAACCAGTGCTATCCCGTGAACCAGTGCTATCCCGTGAGCCAGTGCTATCCCGTGAGCCAGTTCTGTCCCGTGAGCCAGTT

CTATCACGAGAACCTGTG

CTGTCCCGTGAGCCAGTT|| || || || || ||

LSREPV

LSREPV||||||

CTATCACGAGAACCTGTG

TTGTCCCGGTCGCCAGTT | || | || ||

LSREPV

LSRFPV||| ||

67 100

44 80

Always Compare Protein Sequences

ATGAATGCAGCCTATGATTGCCGAGCCAGAATGCTAAGG MNAAYDCRARMLR ||||| || || || || || || || ||||| || || | ||||||||+||ATGAAGGCCGCATACGACTGTCGTGCTAGAATCCTGAGA MKAAYDCRARILR

DNA comparison amino acid comparison

The DNA sequence can change while the amino acid sequence stays the same so always look for similarities by comparing amino acid sequences

Exercise 1nucleotide vs amino acid search

Go to the file example-sequenceshtml and locate the section for this exercise There should be two sequences lsquosurfeit1rsquo for frog and fly

Go to NCBI Blast home page then lsquoAlign two sequencesrsquo (bottom left lsquospecialrsquo panel) paste one sequence into each window and hit lsquoAlignrsquo ndash this will do a direct DNADNA comparison

Now find the open reading frames of the two genes and translate them into amino acid protein sequences then repeat the two sequences comparison

Go to NCBI ORF Finder ndash paste sequence ndash hit OrfFind ndash identify longest ORF ndash click on it ndash next screen hit Accept ndash change View to Fasta protein ndash hit View ndash copy sequence to Blast2Seqs Do the same with the other sequence

Before you hit lsquoAlignrsquo change the lsquoProgramrsquo (top left) to blastphellip

Answers Exercise 1

The Essential Taskexperiment data mining

gene sequence what is its function

database of proteins in other species

Cyclin-AFoxA1

cdc25

alpha-tubulin

Predicted protein

Gravin-like

Sprouty-2

calmodulin

KIAA10786568

frizzled

Wint8

Troponin T3

Gravin-like

we can only do this because of implied function based on orthology

Functional Orthologs

function known annotation lsquoGravinrsquo available

Human geneXenopus genefunction unknown

sequence similarityorthologs

same function But we know that function is largely determined by shape

similar shape

Which in general we cannot determine ndash but it is probably SHAPE not SEQUENCE that is conserved

We make an assumption that the same gene function is likely to be present in the two organisms and the ones that have this function are likely to be the most similar in sequence

Finding OrthologsSo how do we find orthologs and can we know when we have

The simplest is Reciprocal Best BLAST but it implicitly relies on having all the protein sequences of you own organism and the one you wish to find an ortholog in

frog proteindatabase of human proteins

best match human protein

database of frog proteins

x

Using Synteny is Better

We know that large regions of (say) vertebrate genomes have preserved their overall organisation from one organism to another

And we find the same genes (ie orthologs) in more or less the same order in the syntenic sectionsThese of course represent chromosomal re-arrangements since these organisms diverged

Human chromosome 5

Mouse chromosome 10

Mouse chromosome 2

MetazomeFortunately someone has done all the hard work for ushellip Dan Rokhsar httpwwwmetazomenet

Metazome Exercise

Go back to Entrez Gene and look for your favourite gene again

Pick probable ortholog vertebrate genes from common organisms (human mouse rat chicken frog fish) and paste their protein sequences into a temporary space

Go to Metazome (httpwwwmetazomenet) find the blast window open two versions of it and blast your sequences against the Tetrapod or Jawed vertebrate node

See if you get the same cluster ID as best top hit and have a look at the Metazome alignment(s)hellip

Part 3 Finding Sequence Similarities

We want computer programs which will compare sequences at all possible different alignments looking for a degree of similarity greater than we would expect to find by chance

But first we have to consider the implication of gapshellip

Insertions and deletions are other possible forms of mutations and they can really mess up our simple alignments

ATGCATGCTGCCAACGGATGTCCTG

ATGAAAGCCGCCTACGAAAGTCCTG||| | || ||| ||| | ||||||

ATGCATGCTGGCCAACGGATGTCCTG

ATGAAAGCCGCCTACGAAAGTCCTG||| | || ||| | | | |

Gaps in Alignments

Consider these two obviously similar sequences

TTCCCAACTCTCCTCTTTCACCATGAAGCTCAAGGACAGATTCCACTCGCCCCAAAATCAAGCTCACCCCGTCCAAGAA | || | || |||||||||||||||||||| ||||||||| ||| ||| | ||| | | |TTCCCACCTCTCCTCTTTGCACCATGAAGCTCAAGGACAAATTCCACTCCCCCAAAATCAAGCGCACCCCGTCCCAGAA

In fact we realise that the most probable alignment (regarding biological origin) is with a small gap in each sequence

TTCCCAACTCTCCTCTTT=CACCATGAAGCTCAAGGACAGATTCCACTCGCCCCAAAATCAAGCTCACCCCGTCCAAGAA |||||| ||||||||||| |||||||||||||||||||| ||||||||| |||||||||||||| |||||||||| ||||TTCCCACCTCTCCTCTTTGCACCATGAAGCTCAAGGACAAATTCCACTC=CCCCAAAATCAAGCGCACCCCGTCCCAGAA

So in general we allow ourselves to insert gaps until we find the optimal alignment

But where should this process stop

The Downside of GapsTake two random sequences with no lsquorealrsquo similarity

GACACTAGGTCGATGCGTGGTGGCGAGA

ACGCATCCGGATGTGCACCGTGGAACTG

And allow lsquocost freersquo gaps

GAC--ACT----AGGTCGATGC---GTGG---TGGCGAGA || | | | | | ||| |||| || ACGCA-TCCGGA--T-G-TGCACCGTGGAACTG

Clearly although the alignment has no mismatches it is obviously not biologically meaningful

To prevent this we assign a cost to adding gaps which is offset against the benefit of finding matches ndash and this is the essence of lsquofinding gapped alignmentsrsquo

We want to find the lsquoalignmentrsquo between the two (or more) sequences which shows the greatest degree of similarity while introducing the fewest gaps hellip

BLAST

gtqueryAGACGAACCTAGCACAAGCGCGTCTGGAAAGACCCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGTTCAGGAGTATTTGGACTGCAATATTGGCCCTCGTTCAAGGGCGCCTACCATCACCCGACGGTCATGCCGGTCCCCAGCAGCTGCTAATAACTTCCTTCGCTACTCAAGTTACCACGCTAGCAAAACCCACGGCATACCGTTTACCCTTTAAAATCAGCTTCAACCAGCAACGAA

There are many programs used to find similarities between sequencesThey range from relatively slow programs which find the exact best matching alignment through ones which take progressively inexact shortcuts to speed things up Of this latter class the best known and easily most widely used is BLAST developed by Stephen Altschul and others and continuously refined over the last 10-15 years

The essential idea is to compare your query sequence against a collection or lsquodatabasersquo of target sequences looking for the one(s) that match the query sequence the best

gttarget1AAAACAGGAATATTTACCGGGACCGGGTAATGATGCATCTCGAGGTACACAATATACCTG GAGAACCGAATTATGAGTTGGCCACCTTACTTAACGAAACCAGCAGAGAAAATCCAACAT GGCAACACCCCTCTGACTACACTAGAAGGAACTACTATGTAAGAAAACAGCCTGTCCCTT GCAGTTTGAATGACTGGGTGATGCGAAATGGGGGTCCTGCCATAGAGCGCTTCCATGGTT TACCTTGCACATTTCAGAGAAGTCCTATGCCAGGAGTCCTTCCTACAGGGCCTTCCTGAA ACTATATATGTGCTTATTCTTGTTTGATTTGGCTTTGCAGgttarget2CTCTTAATTTATTTCTCTTCCTGCAGCTCCCTCGCTTTTTCCTTTCCCTGTTACATTCAT CTGACTTGAAGAGTTGCAAATTTTCAGTGTTTCTGTTTTTGTTGCTGATATGTTGTAAAC TTTTTAATAAAATCTATTTCTATAG gttarget3GCAGTTTGAATGACTGGGTGATGCGAAATGGGGGTCCTGCCATAGAGCGCTTCCATGGTT TACCTTGCACATTTCAGAGAAGTCCTATGCCAGGAGTCCTTCCTACAGGGCCTTCCTGAA ACTATATATGTGCTTATTCTTGTTTGATTTGGCTTTGCAGCTAGGGTTTTCACCTTTTCT GGAAAAAAAAATACTGGCTTCC gttarget4CTGCTATTAATGGGCAAAACAACTCAAATAAAGTCCCTCTGCCACCCTCAGACACTGCCC CTGGCCCCCAGCTGCCCGCTGATCCTTGTAGCCAGAGCAGTAAAGTTTTGAAAGTGGAGC CCAAGGAGAATAAAGTTATTAAAGAAACTGGCTTTGAACAAGGTGAAAAGTCTTGTGCAG CACCTCTAGATCATACTGTGAAGGAAAATCTTGGACAAACTTCTAAAGAACAGGTGGTAG

query

database

COMPARE

LIST MATCHES

Flavours of BLAST

ACGATAGATCCCATCCATAAAT ATGACGATAGATCCCATCAT CGATAGGACCACCACA GATAGACCAGGATACATAGGATAATTA AGCTCGCTTGGCTCGATGGCT

MKJLSPWERSYTRGHYTWER MGHTVNBZY MKLPWRHGDBKJGMNDFD MBKLRPIUHDFRTASGSLKWWRTVBN

MKJLSPWERSYTRGHYTWER MGHTVNBZY MKLPWRHGDBKJGMNDFD MBKLRPIUHDFRTASGSLKWWRTVBN

ATGACGATAGATCCCATCAT CGATAGGACCACCACA GATAGACCAGGATACATAGGATAATTA AGCTCGCTTGGCTCGATGGCT

query sequence other operation database sequences

ATGACGATAGATCCCATCAT CGATAGGACCACCACA GATAGACCAGGATACATAGGATAATTA AGCTCGCTTGGCTCGATGGCT

BLASTn

BLASTp

BLASTx

tBLASTn

tBLASTx

ACGATAGATCCCATCCATAAAT

ACGATAGATCCCATCCATAAAT

MQWCGYRWTYQGYRW

MQWCGYRWTYQGYRW

FAST

FAST

SLOW

SLOWER

HORRIBLY

SLOW

6 fra

me

trans

latio

n

How does it work The main task of any sequence comparison program is to test all possible mutual alignments of two sequence and see how good the match is

CTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGT

CCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGTTC | | | | | ||||||||||||||||||||||||| CTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGTCTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGT || | | | | | | | | |||||||||||||||||||||||| | | | | | |

CCGAGCTTCTCATTGCTCTTCCTAACAGTG=TGATAGGCTAACCGTAATGGCGTTC||||||||||||||||||||||||| ||||||||||||||||||||||||

query

1st database sequence

This would actually be a very slow search process if implemented like thishellip

BLAST achieves its speed through two strategies

- it takes a WORD based approach- it pre-INDEXES database sequences

BLAST WORDS and INDEXING1 GACAAATCCAAACCCCTGAAGTTCTCCACCAGCAAAGCCA

2 TAAGCAAATTTAATTTTGTTTACATTTTC

3 GTTAAGACCTTCCCTGACATTTGCAGCAGTTTCAAATGTA

AAAAAAAA 00001 AAAAAAAC 00002 AAAAAAAG 00003 ACAAATCC 07967 ACAAATCC 07968 ACAAATCC 07979 GACAAATC 33568 GACAAATG 33569 TCCAAACC 64321 TCCAAACC 64322

sequence position word

1 1 33658

1 2 07967

1 3 16210

3 15 33568

3 16 07967

Database of sequences

Numbered list of all possible lsquowordsrsquo

Build a position index of all words in the database

Analyse the Query Sequence gtquery AGACAAATCCAAACCCCTGAAGTTCTCCACCAGCAAAGCCA

AAAAAAAA 00001 AAAAAAAC 00002 AAAAAAAG 00003 ACAAATCC 07967 ACAAATCC 07968 ACAAATCC 07979 GACAAATC 33568 GACAAATG 33569 TCCAAACC 64321 TCCAAACC 64322

QUERY SEQUENCE

Numbered list of all possible lsquowordsrsquo

position word

1 14236

2 33658

3 07967

Analyse QUERY SEQUENCE

sequence position word

1 1 33658

1 2 07967

1 3 16210

3 15 33568

3 16 07967

Index of database

Expand from Word Based Matches We lsquoinstantlyrsquo know which sequences in the database have at least a word length match with our query sequence and at what relative position

Next the potential alignments are expanded adding up a score for (total matches ndash mismatches ndash gap penalties) to make the best possible alignment But this is usually for a tiny proportion of the sequences in the database ndash so overall it is much quicker

The highest scoring alignments are reported

But we can potentially miss alignments with no word-size bits in common consider BLASTn with a default word-size of 11

TCGGAAGTGGAAGCTGAACCTGATTGTAGAGTTGGAGGCCAGTGTTCTGGCTGAGC||||||||| ||||| |||||||||| |||||||||| |||| ||||| ||||||| TCGGAAGTGTAAGCTCAACCTGATTGCAGAGTTGGAGTCCAGAGTTCTAGCTGAGC

Care is sometimes neededhellip

BLAST ndashTypical OutputINPUT

gtpartial cDNA sequence Xenopus tropicalisCGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGTTCCCACCTCTCCTCTTTCACCATGAAGCTCAAGGACAAATTCCACTCCCCCAAAATCAAGCGCACCCCGTCCAAGAAGGGGAAGCCGGCCGACCTCACCGTCAAAACAGAAGAGAAACCCGTCAACAAAACCTTAAGCCGCTTGGAGGAACAGGAGAAAGAAGTCGTTAATGCCTTGCGTTACTTTAAGACAATTGTTGACAAGATGGCGGTGGACAAGATGGTGCTGGTGATGCTGCCAGGGTCGGCGA

OUTPUTQuery= (311 letters) Database NCBI Protein Reference Sequences 954378 sequences 347895532 total letters

gtgi|41055060|ref|NP_9574201| similar to guanine nucleotide-releasing factor 2 (specific for crk proto-oncogene) [Danio rerio]

Length=691

Score = 133 bits (335)Expect = 6e-31 Identities = 7698 (77) Positives = 8298 (83) Gaps = 498 (4) Frame = +2

Query 26 MSGKIE-KADSQRSHLSSFTMKLKDKFHSPKIKRTPSKKGKPA--DLTVKTEEKPVNKTL 196 MSGKIE K +SQ+SHLSSFTMKL KFHSPKIKRTPSKKGK + VKT EKPVNK + Sbjct 1 MSGKIESKHESQKSHLSSFTMKLM-KFHSPKIKRTPSKKGKQLQPEPAVKTPEKPVNKKV 59

Query 197 SRLEEQEKEVVNALRYFKTIVDKMAVDKMVLVMLPGSA 310 SRLEEQEK+VV+ALRYFKTIVDKM VD VL MLPGSA Sbjct 60 SRLEEQEKDVVSALRYFKTIVDKMNVDTKVLQMLPGSA 97

When is a match significant

RFKISDCQHPCTYSHNQYMTNHMRECPYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV

NFSWKKTSEKETNCQFDYPNDYNEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFNMCWLEVNSS

RF---KISDCQHPCTYSH-NQYMTNHMREC----PYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV F K S+ + C + + N Y N +C P+ + +W +P + D I N M ++ NFSWKKTSEKETNCQFDYPNDY--NEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFN------MCWLEVNSS

Here is a lsquotypicalrsquo weak alignment from BLASTp

In fact the sequences were randomly generated so there is no biologically significant alignmenthellip

E-values

The number of matches like the discovered match that I would expect to find by chance

An E-value of 00 implies that I would expect no matches like this to arise by chance thereforehellip

An E-value of 1 implies I would expect 1 match like this to arise by chance so if I have a match with such an E-valuehellip

Also ldquoexpect valueldquo or ldquoexpectationrdquo

E-values From First Principles

Some database statistics (23rd July 2005)

Database NCBI RefSeq mRNA 272619 sequences 503566580 total letters (~50 x 108)

Database NCBI nr 3329110 sequences 14601814750 total letters (~14 x 1010)

Notation

12e-35 = 12 x 10-35

48 x 106 = 4800000

We will consider first searching a nucleotide sequence (lsquoACGTAGACGTrsquo) against a nucleotide database eg the RefSeq mRNA above

Then we will consider the more complex case of amino acid sequence (protein) searches Which is of course what we mostly do

Calculating an E-valueThe RefSeq mRNA database has ~ 50 x 108 letters There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA

Query = lsquoArsquo

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACrsquo

AC AC AC AC

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACGrsquo

ACG

Expected number of matches = (50 x 108) 4 = ~12 x 108

Expected number of matches = (50 x 108) (4x 4) = ~31 x 107

Expected number of matches = (50 x 108) (4 x 4 x 4) = ~81 x 106

Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer

Expected number of matches = (50 x 108) (4 x 4 x 4 x 4 hellip 60 times ) = (50 x 108) 1036 = 50 x 10-28

E-value = 50 x 10-28

E-values In PracticeSo if I take a 60 nt sequencegtsequenceACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA and actually BLAST it against the RefSeq mRNA database I get

BLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds

Length=6060 Score = 119 bits (60) Expect = 2e-26 Identities = 6060 (100) Gaps = 060 (0) Strand=PlusPlus

Query 1 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 60 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 2977 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 3036

What do I get if I BLAST it against the larger nr databaseBLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds

Length=6060 Score = 119 bits (60) Expect = 6e-25 Identities = 6060 (100) Gaps = 060 (0) Strand=PlusPlus

Query 1 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 60 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 2977 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 3036

theoretical value was 50e-28 -

E-value Exercise

Given a transcription factor binding site

ACC[TG]TA

How many would you expect to find by chance in a 10k promoter sequence

How would this differ if there was an optional additional base between the 4th and 5th positionsIeACC[TG]TAOR ACC[TG]TA

E-value Exercise AnswerACC[TG]TA

Expect lsquoArsquo every 4 ntExpect lsquoACCrsquo every 4x4x4 = 64 nt

Expect lsquoT or Grsquo every 2nd ntExpect lsquoACC[TG]rsquo every 64x2 nt = 128 nt

Expect lsquoTArsquo every 4x4 = 16 ntExpect lsquoACC[TG]TArsquo every 128x16 nt = 2048 nt (4x4x4x2x4x4)We would expect ~5 of these promoter sites every 10k by chance

If also ACC[TG]TAA allowed

The two motifs independently have the same E-valueTo allow either means we expect twice as many

E-values Effect of Database SizeThe nr mRNA database has ~ 14 x 1010 letters (was RefSeq and 50 x108)There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA

Query = lsquoArsquo

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACrsquo

AC AC AC AC

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACGrsquo

ACG

Expected number of matches = (14 x 1010) 4 = ~12 x 108

Expected number of matches = (14 x 1010) (4x 4) = ~31 x 107

Expected number of matches = (14 x 1010) (4 x 4 x 4) = ~81 x 106

Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer

Expected number of matches = (14 x 1010) (4 x 4 x 4 x 4 hellip 60 times ) = (14 x 1010) 1036 = 14 x 10-26

E-value = 14 x 10-26

(was E-value = 50 x 10-28)

E-values Effect of Database Size

The E-value is simply dependent on database size

RefSeq

nr

14 x 1010 letters

50 x108 letters

30 x bigger

BLAST the same sequenceagainst each E-value = 14e-26

E-value = 50e-28

The database was ~30 times bigger and so the E-value was ~30 times bigger

Why were the values differentOur calculated E-value for searching against the RefSeq mRNA database was 50 x 10-28But our actual BLAST search at NCBI gave a value of

20 x 10-26 - about 40x larger - why is this

Gapped alignments

If we were expecting N matches for a query sequence lsquoACGTACGTACGTrsquo imagine what would happen to N if we allowed gaps in our matches

ACGTACGTACGT

This would now give us additional possible alignments that would meet our lsquomatchrsquo criteria

ACGTACGTACGT ACGTACAGTACGT ACGTACCGTACGT etc|||||||||||| |||||| |||||| |||||| ||||||ACGTACGTACGT ACGTAC-GTACGT ACGTAC-GTACGT

We will expect many more matches in a given database if we allow our alignments to have gaps The E-value will be larger

E-values Effect of Query Length

Biologically itrsquos the same match Does it mean we are any less sure that this match didnrsquot occur by chance The E-value is simply dependent on match length

database

BLAST 500 nt sequence against a database

BLASTn Get a full length match with sequence XYZ at an E-value = 50e-160

gtsequence ACTAGTCTAGCTAGACATCGATCGATGATGCTACACAGATAGACGATAGATAGTAAGTCGATCGATCGCGCATCGATCGTCTAGATCGATCGCTCGCTGTGTAGATAGATCGGCGATAGA

database

BLAST half of the same sequence against the same database

BLASTn

gtsequence ACTAGTCTAGCTAGACATCGATCGATGATGCTACACAGATAGACGATAGATAGTAAGTCG

Get a match with sequence XYZ again but at an E-value = 50e-80

Why not just use identityAt some levels this a good question

But consider two very different searches both of which give a 75 identity match

Query1 was 60 nt longCGGAGCTCAGGGCTTAACGACTGATATCTCCGCGCATGTCGAGAAACGATACAGCCAGCG||||||||||| || | || | || || |||| | | | |||||| | ||||||||||CGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGWhich would have an E-value ~ 50 x 10-19

And Query2 only 16 nt longACGTACGTACGTACGT||| || | |||| ||ACGCACCTTCGTAGGTWhich would have an E-value ~ 30

And intuitively we feel we would expect to see that sort of number of matches in the database just by chancehellip

So whatrsquos the real problemBasically you are usually trying to answer the question

Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism

Are there any useful guidelines though at least for biological meaningfulness

Basically you are usually trying to answer the question

Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism

BLAST

The difficulty is because

ORTHOLOGY

BLAST Similarity + Probability

biological knowledge

nature of query sequence

phylogenetic relationship

match length PI size of databasehellip

Rules of ThumbHow good does an E-value have to be before we might even think we have an ortholog

largerworse smallerbetter

E-values 10-5 10-10 10-40 10-100 00

fantasy borderline encouraging

pretty good canrsquot get

better

But note that in some gene families with closely related members you can get an E-value of 00 for several different matches and then identity may be more sensitive Also bear in mind in cases like this that ideas of lsquofunctionalrsquo orthology may break down with more than one locus producing identical proteins which share the same functionhellip

Protein BLASTItrsquos (nearly) always better to make comparisons at the amino acid level between protein sequences than the DNA level because the amino acid sequence is more conserved than the underlying DNA sequenceDoes this cause us to treat expected values any differently

If we follow the argument as before then for an exact match of a 20 amino acid sequence in the RefSeq protein database each additional amino acid will reduce the E-value by 120th (there are 20 different amino acids) And as there are 347895532 letters in that databaseE-value = ~35 x 108 (20 x 20 x 20 hellip20 times) = ~35 x 10-18

But this is what we get if we run the blast at NCBI

Score = 431 bits (100) Expect = 8e-04 Identities = 2020 (100) Positives = 2020 (100) Gaps = 020 (0) Frame = +3

Query 3 SSSSFRAYRAALSEVEPPCI 62 SSSSFRAYRAALSEVEPPCISbjct 972 SSSSFRAYRAALSEVEPPCI 991

Really too big a discrepancy to easily explain with hand wavinghellip

Amino Acid Substitutions

A SC F LWYG I LMVL IMFVM ILVP V ILMW FY

N DHSQ REHKS ANTT SY HFW

H NQYK RQER QK

D NEE DQK

In fact we need to take into account both amino acid substitutability as well as as before allowing gapped alignments On average any residue can be substituted for by about 2 others so each position has about 17th chance of lsquomatchingrsquo rather than 120th

So now we getE-value = ~35 x 108 (7 x 7 x 7 hellip20 times) = ~44 x 10-9which is much closer to the actual BLAST value

These substitutabilities are dealt with by the BLOSUM and PAM matrices

ExercisesGo to the file random-DNA-sequenceshtml select one of the 20 randomly generated nucleotide sequences and do a BLASTx (translated DNA-gtprotein) at NCBI against the nr protein database

Did you find any lsquosignificantrsquo hits

Repeat with a second sequence

What conclusions might you draw from this exercise

Try the same sequence(s) against the nr nucleotide database

Is there any general difference

Part 4 Tweaking BLASTAlthough you normally see BLAST as a web page with boxes to place data in and tick boxes etc it is actually a command line program that can be run just by typing the appropriate command and options eg

promptgt blastall ndashp ltblast typegt ndashi ltinput sequencegt ndashd ltdatabasegt

This is the simplest form where the basic program lsquoblastallrsquo takes a number of different options or parameters indicated by the ndashx and followed by its value -p ltwhich blast flavour to rungt-i ltfile with query sequence ingt-d ltpre-indexed database namegt

Not All Parameters are hellipThere are many other parameters and if not listed explicitly they will use a default value most appropriate to the blast flavour requested Eg for ndashW ltword sizegt blastn uses ndashW 11 where blastx uses ndashW 3

There are also some options that appear on the web pages that are not really parameters but manage the job in a similar way One of the most useful of these is on the NCBI blast pages where you can use Entrez queries or pick from an organism list to modify your search

The Many Parameters of BLASTThere are almost literally hundreds of parameters but most are way too obscure even for die-hard techies like me Very few of them are regularly useful in any but their default value but just occasionally they are very necessary

Here are some of the ones that I have used

-e max expected value -m output format (graphical or tabularspreadsheet)-F filter query sequence for low complexity (default TRUE)-U use only upper case regions of query (default FALSE)-G gap opening cost-E gap extension cost-q nucleotide mismatch penalty (BLASTx uses matrices)-r nucleotide match reward-b number of matching sequences to report-g allow gaps (default TRUE)-W word size-z effective database size (removes effect of actual database size)-S query strands to search (default both directions)-l restrict database sequences to given list of lsquogilsquo numbers

BLAST Parameters Exercises1 BLASTn vs BLASTx

Open the file example-sequenceshtml copy the sequence gtblastn-vs-blastxThis is a Xenopus tropicalis cDNA sequence

Go to the NCBI BLAST Home PageNucleotide-nucleotide BLAST (blastn) section Paste your sequence into the box

Run BLASTn against the nr nucleotide database using all default optionsThen hit [format] to wait for the results in a new page(hint if you paste the sequence definition line lsquogtnamersquo into the box as well your results will be labelled accordingly which can be useful)

Now repeat but go to the TRANSLATED BLAST section and BLAST against the nr protein database using BLASTx

How might the different results help us view the presence of this gene in other vertebrates

Results for Exercise 1 BLASTn

BLASTx

BLAST Parameters Exercises2 Low complexity filtering

Open the file example-sequenceshtml copy the sequence gtlow-complexity-filtering-A This sequence contains a long AT tandem repeat

Go to the NCBI BLAST Home PageTRANSLATED BLAST sectionBLASTx Paste your sequence into the box

Carefully UNTICK the ldquoChoose filter [ ] Low complexityrdquo BOX in the second section And then run BLASTx against the nr database

What do you feel about these alignmentsRe-run but leave the low-complexity filter ON this timeDoes this change our view of the protein matches

Now continue with gtlow-complexity-filtering-B and ndashCC is an especially interesting case ndash what can we deduce about the cDNA sequence Annotators beware

Results for Exercise 2A (OFF) BLASTn ndash low complexity filtering OFF

Results for Exercise 2A (ON) BLASTn ndash low complexity filtering ON

Results for Exercise 2B

ON OFF

Results for Exercise 2C

There is a sequence error an extra G at position 117 in the sequence cDNA (117)AGAAAAGAAGAAACATGGCAATGGATCAGAA|||||||||||||||| ||||||||||||||AGAAAAGAAGAAACAT-GCAATGGATCAGAA

Genomic sequence

ON

OFF

BLAST Parameters Exercises3 Limit by Entrez query

Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching sequences in fruit fly enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT

Open the file example-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home Page TRANSLATED BLAST sectionBLASTx and paste the sequence

Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip

BLAST Parameters Exercises4 BLASTn vs tBLASTx and nucleotide mismatch penalties

Open the file example-sequenceshtml

Also open the NCBI BLAST Home PageSPECIAL ndash Align two sequences section

There are several Xenopus tropicalis cyclins in the examples fileCopy the sequence gtcyclin-A1-Xt to the Sequence 1 BLAST windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 BLAST window(i) Run the default comparison should be BLASTn Note the alignmentNow run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these

(ii) Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1 Then try reducing the gap open and gap extension penaltieshellipWhat do we learn from this

(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip

Results for Exercise 4 (i)

BLASTn tBLASTx

Results for Exercise 4 (ii)

Mismatch penalty = -2 (default) Mismatch penalty = -1

BLAST Parameters Exercises5 E-Value maximum for reporting

Open the file example-sequenceshtml

Copy the sequence gtsumo-binding-motif and go to the NCBI BLAST Home PageGo to the PROTEIN BLAST section BLASTp and paste the sequence

Run the search with the default values

Now re-run the search setting the maximum E-value in the box

Expect 100

What difference does this make

BLAST Parameters Exercises6 Word Size

Open the file example-sequenceshtml

Copy the sequence gtmorpholino and go to the NCBI BLAST Home PageGo to the NUCLEOTIDE BLAST section BLASTn and paste the sequence

Check OFF the low complexity filter and then run the search

Now re-run the search setting the following parameters

Low complexity OFFExpect 100Word Size 7Other advanced -q-1 (mismatch penalty -1 instead of default -3)

What difference does this make

END

  • Bioinformatics Workshop 1 Sequences and Similarity Searches
  • The Basic Questions
  • Part 1 Structural Genomics
  • Chromosomes and Genes
  • Gene to Protein
  • Sequence Signals
  • Genomic Signals
  • Derivative Sequences
  • Gene Models
  • Sequences and Genes (Accession Numbers and Names)
  • Gene Symbols Names Etc
  • A Gene-Centric View
  • Sequences and Accession Numbers
  • mRNA Splicing Signals
  • Gene Predictions
  • Supporting Evidence
  • TheoreticalPredicted Sequences
  • Sequences for a model organism
  • So Whatrsquos in the Databases Now
  • Part 2 Comparative Genomics
  • Speciation
  • Residual Similarity
  • Computers Can Detect Homology
  • Orthologs
  • Paralogs
  • lsquoOtherrsquo-logs
  • The Essential Paradigm
  • Function Conserved Longer than Detectable Similarity
  • Redundancy in the Genetic Code
  • Protein Similarity Persists Longer
  • Always Compare Protein Sequences
  • Exercise 1 nucleotide vs amino acid search
  • Answers Exercise 1
  • The Essential Task
  • Functional Orthologs
  • Finding Orthologs
  • Using Synteny is Better
  • Metazome
  • Metazome Exercise
  • Part 3 Finding Sequence Similarities
  • Gaps in Alignments
  • The Downside of Gaps
  • BLAST
  • Flavours of BLAST
  • How does it work
  • BLAST WORDS and INDEXING
  • Analyse the Query Sequence
  • Expand from Word Based Matches
  • BLAST ndashTypical Output
  • When is a match significant
  • E-values
  • E-values From First Principles
  • Calculating an E-value
  • E-values In Practice
  • E-value Exercise
  • E-value Exercise Answer
  • E-values Effect of Database Size
  • Slide 58
  • Why were the values different
  • E-values Effect of Query Length
  • Why not just use identity
  • So whatrsquos the real problem
  • Rules of Thumb
  • Protein BLAST
  • Amino Acid Substitutions
  • Exercises
  • Part 4 Tweaking BLAST
  • Not All Parameters are hellip
  • The Many Parameters of BLAST
  • Slide 70
  • BLAST Parameters Exercises
  • Results for Exercise 1
  • Slide 73
  • Results for Exercise 2A (OFF)
  • Results for Exercise 2A (ON)
  • Results for Exercise 2B
  • Results for Exercise 2C
  • Slide 78
  • Slide 79
  • Results for Exercise 4 (i)
  • Results for Exercise 4 (ii)
  • Slide 82
  • Slide 83
  • END
Page 28: Bioinformatics Workshop 1 Sequences and Similarity Searches

Function Conserved Longer than Detectable Similarity

start from first self-replicating sequence

same function detectable similarity

living organisms

whole genome duplication local duplication

Redundancy in the Genetic Code

GCA A alanine GCC A GCG A GCT A

TGC C cystine TGT C

GAC D aspartate GAT D

GGA G glycine GGC G GGG G GGT G

lsquoSynonymousrsquo or lsquosilentrsquo mutations in the third position of the codon triplets have no effect on the amino acid coded for ndash so there is no evolutionary pressure against thishellip

Protein Similarity Persists Longer

CTATCACGAGAACCTGTGCTATCCCGAGAACCTGTGCTATCCCGAGAACCAGTGCTATCCCGTGAACCAGTGCTATCCCGTGAGCCAGTGCTATCCCGTGAGCCAGTTCTGTCCCGTGAGCCAGTT

CTATCACGAGAACCTGTG

CTGTCCCGTGAGCCAGTT|| || || || || ||

LSREPV

LSREPV||||||

CTATCACGAGAACCTGTG

TTGTCCCGGTCGCCAGTT | || | || ||

LSREPV

LSRFPV||| ||

67 100

44 80

Always Compare Protein Sequences

ATGAATGCAGCCTATGATTGCCGAGCCAGAATGCTAAGG MNAAYDCRARMLR ||||| || || || || || || || ||||| || || | ||||||||+||ATGAAGGCCGCATACGACTGTCGTGCTAGAATCCTGAGA MKAAYDCRARILR

DNA comparison amino acid comparison

The DNA sequence can change while the amino acid sequence stays the same so always look for similarities by comparing amino acid sequences

Exercise 1nucleotide vs amino acid search

Go to the file example-sequenceshtml and locate the section for this exercise There should be two sequences lsquosurfeit1rsquo for frog and fly

Go to NCBI Blast home page then lsquoAlign two sequencesrsquo (bottom left lsquospecialrsquo panel) paste one sequence into each window and hit lsquoAlignrsquo ndash this will do a direct DNADNA comparison

Now find the open reading frames of the two genes and translate them into amino acid protein sequences then repeat the two sequences comparison

Go to NCBI ORF Finder ndash paste sequence ndash hit OrfFind ndash identify longest ORF ndash click on it ndash next screen hit Accept ndash change View to Fasta protein ndash hit View ndash copy sequence to Blast2Seqs Do the same with the other sequence

Before you hit lsquoAlignrsquo change the lsquoProgramrsquo (top left) to blastphellip

Answers Exercise 1

The Essential Taskexperiment data mining

gene sequence what is its function

database of proteins in other species

Cyclin-AFoxA1

cdc25

alpha-tubulin

Predicted protein

Gravin-like

Sprouty-2

calmodulin

KIAA10786568

frizzled

Wint8

Troponin T3

Gravin-like

we can only do this because of implied function based on orthology

Functional Orthologs

function known annotation lsquoGravinrsquo available

Human geneXenopus genefunction unknown

sequence similarityorthologs

same function But we know that function is largely determined by shape

similar shape

Which in general we cannot determine ndash but it is probably SHAPE not SEQUENCE that is conserved

We make an assumption that the same gene function is likely to be present in the two organisms and the ones that have this function are likely to be the most similar in sequence

Finding OrthologsSo how do we find orthologs and can we know when we have

The simplest is Reciprocal Best BLAST but it implicitly relies on having all the protein sequences of you own organism and the one you wish to find an ortholog in

frog proteindatabase of human proteins

best match human protein

database of frog proteins

x

Using Synteny is Better

We know that large regions of (say) vertebrate genomes have preserved their overall organisation from one organism to another

And we find the same genes (ie orthologs) in more or less the same order in the syntenic sectionsThese of course represent chromosomal re-arrangements since these organisms diverged

Human chromosome 5

Mouse chromosome 10

Mouse chromosome 2

MetazomeFortunately someone has done all the hard work for ushellip Dan Rokhsar httpwwwmetazomenet

Metazome Exercise

Go back to Entrez Gene and look for your favourite gene again

Pick probable ortholog vertebrate genes from common organisms (human mouse rat chicken frog fish) and paste their protein sequences into a temporary space

Go to Metazome (httpwwwmetazomenet) find the blast window open two versions of it and blast your sequences against the Tetrapod or Jawed vertebrate node

See if you get the same cluster ID as best top hit and have a look at the Metazome alignment(s)hellip

Part 3 Finding Sequence Similarities

We want computer programs which will compare sequences at all possible different alignments looking for a degree of similarity greater than we would expect to find by chance

But first we have to consider the implication of gapshellip

Insertions and deletions are other possible forms of mutations and they can really mess up our simple alignments

ATGCATGCTGCCAACGGATGTCCTG

ATGAAAGCCGCCTACGAAAGTCCTG||| | || ||| ||| | ||||||

ATGCATGCTGGCCAACGGATGTCCTG

ATGAAAGCCGCCTACGAAAGTCCTG||| | || ||| | | | |

Gaps in Alignments

Consider these two obviously similar sequences

TTCCCAACTCTCCTCTTTCACCATGAAGCTCAAGGACAGATTCCACTCGCCCCAAAATCAAGCTCACCCCGTCCAAGAA | || | || |||||||||||||||||||| ||||||||| ||| ||| | ||| | | |TTCCCACCTCTCCTCTTTGCACCATGAAGCTCAAGGACAAATTCCACTCCCCCAAAATCAAGCGCACCCCGTCCCAGAA

In fact we realise that the most probable alignment (regarding biological origin) is with a small gap in each sequence

TTCCCAACTCTCCTCTTT=CACCATGAAGCTCAAGGACAGATTCCACTCGCCCCAAAATCAAGCTCACCCCGTCCAAGAA |||||| ||||||||||| |||||||||||||||||||| ||||||||| |||||||||||||| |||||||||| ||||TTCCCACCTCTCCTCTTTGCACCATGAAGCTCAAGGACAAATTCCACTC=CCCCAAAATCAAGCGCACCCCGTCCCAGAA

So in general we allow ourselves to insert gaps until we find the optimal alignment

But where should this process stop

The Downside of GapsTake two random sequences with no lsquorealrsquo similarity

GACACTAGGTCGATGCGTGGTGGCGAGA

ACGCATCCGGATGTGCACCGTGGAACTG

And allow lsquocost freersquo gaps

GAC--ACT----AGGTCGATGC---GTGG---TGGCGAGA || | | | | | ||| |||| || ACGCA-TCCGGA--T-G-TGCACCGTGGAACTG

Clearly although the alignment has no mismatches it is obviously not biologically meaningful

To prevent this we assign a cost to adding gaps which is offset against the benefit of finding matches ndash and this is the essence of lsquofinding gapped alignmentsrsquo

We want to find the lsquoalignmentrsquo between the two (or more) sequences which shows the greatest degree of similarity while introducing the fewest gaps hellip

BLAST

gtqueryAGACGAACCTAGCACAAGCGCGTCTGGAAAGACCCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGTTCAGGAGTATTTGGACTGCAATATTGGCCCTCGTTCAAGGGCGCCTACCATCACCCGACGGTCATGCCGGTCCCCAGCAGCTGCTAATAACTTCCTTCGCTACTCAAGTTACCACGCTAGCAAAACCCACGGCATACCGTTTACCCTTTAAAATCAGCTTCAACCAGCAACGAA

There are many programs used to find similarities between sequencesThey range from relatively slow programs which find the exact best matching alignment through ones which take progressively inexact shortcuts to speed things up Of this latter class the best known and easily most widely used is BLAST developed by Stephen Altschul and others and continuously refined over the last 10-15 years

The essential idea is to compare your query sequence against a collection or lsquodatabasersquo of target sequences looking for the one(s) that match the query sequence the best

gttarget1AAAACAGGAATATTTACCGGGACCGGGTAATGATGCATCTCGAGGTACACAATATACCTG GAGAACCGAATTATGAGTTGGCCACCTTACTTAACGAAACCAGCAGAGAAAATCCAACAT GGCAACACCCCTCTGACTACACTAGAAGGAACTACTATGTAAGAAAACAGCCTGTCCCTT GCAGTTTGAATGACTGGGTGATGCGAAATGGGGGTCCTGCCATAGAGCGCTTCCATGGTT TACCTTGCACATTTCAGAGAAGTCCTATGCCAGGAGTCCTTCCTACAGGGCCTTCCTGAA ACTATATATGTGCTTATTCTTGTTTGATTTGGCTTTGCAGgttarget2CTCTTAATTTATTTCTCTTCCTGCAGCTCCCTCGCTTTTTCCTTTCCCTGTTACATTCAT CTGACTTGAAGAGTTGCAAATTTTCAGTGTTTCTGTTTTTGTTGCTGATATGTTGTAAAC TTTTTAATAAAATCTATTTCTATAG gttarget3GCAGTTTGAATGACTGGGTGATGCGAAATGGGGGTCCTGCCATAGAGCGCTTCCATGGTT TACCTTGCACATTTCAGAGAAGTCCTATGCCAGGAGTCCTTCCTACAGGGCCTTCCTGAA ACTATATATGTGCTTATTCTTGTTTGATTTGGCTTTGCAGCTAGGGTTTTCACCTTTTCT GGAAAAAAAAATACTGGCTTCC gttarget4CTGCTATTAATGGGCAAAACAACTCAAATAAAGTCCCTCTGCCACCCTCAGACACTGCCC CTGGCCCCCAGCTGCCCGCTGATCCTTGTAGCCAGAGCAGTAAAGTTTTGAAAGTGGAGC CCAAGGAGAATAAAGTTATTAAAGAAACTGGCTTTGAACAAGGTGAAAAGTCTTGTGCAG CACCTCTAGATCATACTGTGAAGGAAAATCTTGGACAAACTTCTAAAGAACAGGTGGTAG

query

database

COMPARE

LIST MATCHES

Flavours of BLAST

ACGATAGATCCCATCCATAAAT ATGACGATAGATCCCATCAT CGATAGGACCACCACA GATAGACCAGGATACATAGGATAATTA AGCTCGCTTGGCTCGATGGCT

MKJLSPWERSYTRGHYTWER MGHTVNBZY MKLPWRHGDBKJGMNDFD MBKLRPIUHDFRTASGSLKWWRTVBN

MKJLSPWERSYTRGHYTWER MGHTVNBZY MKLPWRHGDBKJGMNDFD MBKLRPIUHDFRTASGSLKWWRTVBN

ATGACGATAGATCCCATCAT CGATAGGACCACCACA GATAGACCAGGATACATAGGATAATTA AGCTCGCTTGGCTCGATGGCT

query sequence other operation database sequences

ATGACGATAGATCCCATCAT CGATAGGACCACCACA GATAGACCAGGATACATAGGATAATTA AGCTCGCTTGGCTCGATGGCT

BLASTn

BLASTp

BLASTx

tBLASTn

tBLASTx

ACGATAGATCCCATCCATAAAT

ACGATAGATCCCATCCATAAAT

MQWCGYRWTYQGYRW

MQWCGYRWTYQGYRW

FAST

FAST

SLOW

SLOWER

HORRIBLY

SLOW

6 fra

me

trans

latio

n

How does it work The main task of any sequence comparison program is to test all possible mutual alignments of two sequence and see how good the match is

CTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGT

CCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGTTC | | | | | ||||||||||||||||||||||||| CTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGTCTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGT || | | | | | | | | |||||||||||||||||||||||| | | | | | |

CCGAGCTTCTCATTGCTCTTCCTAACAGTG=TGATAGGCTAACCGTAATGGCGTTC||||||||||||||||||||||||| ||||||||||||||||||||||||

query

1st database sequence

This would actually be a very slow search process if implemented like thishellip

BLAST achieves its speed through two strategies

- it takes a WORD based approach- it pre-INDEXES database sequences

BLAST WORDS and INDEXING1 GACAAATCCAAACCCCTGAAGTTCTCCACCAGCAAAGCCA

2 TAAGCAAATTTAATTTTGTTTACATTTTC

3 GTTAAGACCTTCCCTGACATTTGCAGCAGTTTCAAATGTA

AAAAAAAA 00001 AAAAAAAC 00002 AAAAAAAG 00003 ACAAATCC 07967 ACAAATCC 07968 ACAAATCC 07979 GACAAATC 33568 GACAAATG 33569 TCCAAACC 64321 TCCAAACC 64322

sequence position word

1 1 33658

1 2 07967

1 3 16210

3 15 33568

3 16 07967

Database of sequences

Numbered list of all possible lsquowordsrsquo

Build a position index of all words in the database

Analyse the Query Sequence gtquery AGACAAATCCAAACCCCTGAAGTTCTCCACCAGCAAAGCCA

AAAAAAAA 00001 AAAAAAAC 00002 AAAAAAAG 00003 ACAAATCC 07967 ACAAATCC 07968 ACAAATCC 07979 GACAAATC 33568 GACAAATG 33569 TCCAAACC 64321 TCCAAACC 64322

QUERY SEQUENCE

Numbered list of all possible lsquowordsrsquo

position word

1 14236

2 33658

3 07967

Analyse QUERY SEQUENCE

sequence position word

1 1 33658

1 2 07967

1 3 16210

3 15 33568

3 16 07967

Index of database

Expand from Word Based Matches We lsquoinstantlyrsquo know which sequences in the database have at least a word length match with our query sequence and at what relative position

Next the potential alignments are expanded adding up a score for (total matches ndash mismatches ndash gap penalties) to make the best possible alignment But this is usually for a tiny proportion of the sequences in the database ndash so overall it is much quicker

The highest scoring alignments are reported

But we can potentially miss alignments with no word-size bits in common consider BLASTn with a default word-size of 11

TCGGAAGTGGAAGCTGAACCTGATTGTAGAGTTGGAGGCCAGTGTTCTGGCTGAGC||||||||| ||||| |||||||||| |||||||||| |||| ||||| ||||||| TCGGAAGTGTAAGCTCAACCTGATTGCAGAGTTGGAGTCCAGAGTTCTAGCTGAGC

Care is sometimes neededhellip

BLAST ndashTypical OutputINPUT

gtpartial cDNA sequence Xenopus tropicalisCGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGTTCCCACCTCTCCTCTTTCACCATGAAGCTCAAGGACAAATTCCACTCCCCCAAAATCAAGCGCACCCCGTCCAAGAAGGGGAAGCCGGCCGACCTCACCGTCAAAACAGAAGAGAAACCCGTCAACAAAACCTTAAGCCGCTTGGAGGAACAGGAGAAAGAAGTCGTTAATGCCTTGCGTTACTTTAAGACAATTGTTGACAAGATGGCGGTGGACAAGATGGTGCTGGTGATGCTGCCAGGGTCGGCGA

OUTPUTQuery= (311 letters) Database NCBI Protein Reference Sequences 954378 sequences 347895532 total letters

gtgi|41055060|ref|NP_9574201| similar to guanine nucleotide-releasing factor 2 (specific for crk proto-oncogene) [Danio rerio]

Length=691

Score = 133 bits (335)Expect = 6e-31 Identities = 7698 (77) Positives = 8298 (83) Gaps = 498 (4) Frame = +2

Query 26 MSGKIE-KADSQRSHLSSFTMKLKDKFHSPKIKRTPSKKGKPA--DLTVKTEEKPVNKTL 196 MSGKIE K +SQ+SHLSSFTMKL KFHSPKIKRTPSKKGK + VKT EKPVNK + Sbjct 1 MSGKIESKHESQKSHLSSFTMKLM-KFHSPKIKRTPSKKGKQLQPEPAVKTPEKPVNKKV 59

Query 197 SRLEEQEKEVVNALRYFKTIVDKMAVDKMVLVMLPGSA 310 SRLEEQEK+VV+ALRYFKTIVDKM VD VL MLPGSA Sbjct 60 SRLEEQEKDVVSALRYFKTIVDKMNVDTKVLQMLPGSA 97

When is a match significant

RFKISDCQHPCTYSHNQYMTNHMRECPYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV

NFSWKKTSEKETNCQFDYPNDYNEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFNMCWLEVNSS

RF---KISDCQHPCTYSH-NQYMTNHMREC----PYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV F K S+ + C + + N Y N +C P+ + +W +P + D I N M ++ NFSWKKTSEKETNCQFDYPNDY--NEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFN------MCWLEVNSS

Here is a lsquotypicalrsquo weak alignment from BLASTp

In fact the sequences were randomly generated so there is no biologically significant alignmenthellip

E-values

The number of matches like the discovered match that I would expect to find by chance

An E-value of 00 implies that I would expect no matches like this to arise by chance thereforehellip

An E-value of 1 implies I would expect 1 match like this to arise by chance so if I have a match with such an E-valuehellip

Also ldquoexpect valueldquo or ldquoexpectationrdquo

E-values From First Principles

Some database statistics (23rd July 2005)

Database NCBI RefSeq mRNA 272619 sequences 503566580 total letters (~50 x 108)

Database NCBI nr 3329110 sequences 14601814750 total letters (~14 x 1010)

Notation

12e-35 = 12 x 10-35

48 x 106 = 4800000

We will consider first searching a nucleotide sequence (lsquoACGTAGACGTrsquo) against a nucleotide database eg the RefSeq mRNA above

Then we will consider the more complex case of amino acid sequence (protein) searches Which is of course what we mostly do

Calculating an E-valueThe RefSeq mRNA database has ~ 50 x 108 letters There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA

Query = lsquoArsquo

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACrsquo

AC AC AC AC

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACGrsquo

ACG

Expected number of matches = (50 x 108) 4 = ~12 x 108

Expected number of matches = (50 x 108) (4x 4) = ~31 x 107

Expected number of matches = (50 x 108) (4 x 4 x 4) = ~81 x 106

Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer

Expected number of matches = (50 x 108) (4 x 4 x 4 x 4 hellip 60 times ) = (50 x 108) 1036 = 50 x 10-28

E-value = 50 x 10-28

E-values In PracticeSo if I take a 60 nt sequencegtsequenceACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA and actually BLAST it against the RefSeq mRNA database I get

BLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds

Length=6060 Score = 119 bits (60) Expect = 2e-26 Identities = 6060 (100) Gaps = 060 (0) Strand=PlusPlus

Query 1 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 60 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 2977 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 3036

What do I get if I BLAST it against the larger nr databaseBLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds

Length=6060 Score = 119 bits (60) Expect = 6e-25 Identities = 6060 (100) Gaps = 060 (0) Strand=PlusPlus

Query 1 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 60 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 2977 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 3036

theoretical value was 50e-28 -

E-value Exercise

Given a transcription factor binding site

ACC[TG]TA

How many would you expect to find by chance in a 10k promoter sequence

How would this differ if there was an optional additional base between the 4th and 5th positionsIeACC[TG]TAOR ACC[TG]TA

E-value Exercise AnswerACC[TG]TA

Expect lsquoArsquo every 4 ntExpect lsquoACCrsquo every 4x4x4 = 64 nt

Expect lsquoT or Grsquo every 2nd ntExpect lsquoACC[TG]rsquo every 64x2 nt = 128 nt

Expect lsquoTArsquo every 4x4 = 16 ntExpect lsquoACC[TG]TArsquo every 128x16 nt = 2048 nt (4x4x4x2x4x4)We would expect ~5 of these promoter sites every 10k by chance

If also ACC[TG]TAA allowed

The two motifs independently have the same E-valueTo allow either means we expect twice as many

E-values Effect of Database SizeThe nr mRNA database has ~ 14 x 1010 letters (was RefSeq and 50 x108)There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA

Query = lsquoArsquo

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACrsquo

AC AC AC AC

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACGrsquo

ACG

Expected number of matches = (14 x 1010) 4 = ~12 x 108

Expected number of matches = (14 x 1010) (4x 4) = ~31 x 107

Expected number of matches = (14 x 1010) (4 x 4 x 4) = ~81 x 106

Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer

Expected number of matches = (14 x 1010) (4 x 4 x 4 x 4 hellip 60 times ) = (14 x 1010) 1036 = 14 x 10-26

E-value = 14 x 10-26

(was E-value = 50 x 10-28)

E-values Effect of Database Size

The E-value is simply dependent on database size

RefSeq

nr

14 x 1010 letters

50 x108 letters

30 x bigger

BLAST the same sequenceagainst each E-value = 14e-26

E-value = 50e-28

The database was ~30 times bigger and so the E-value was ~30 times bigger

Why were the values differentOur calculated E-value for searching against the RefSeq mRNA database was 50 x 10-28But our actual BLAST search at NCBI gave a value of

20 x 10-26 - about 40x larger - why is this

Gapped alignments

If we were expecting N matches for a query sequence lsquoACGTACGTACGTrsquo imagine what would happen to N if we allowed gaps in our matches

ACGTACGTACGT

This would now give us additional possible alignments that would meet our lsquomatchrsquo criteria

ACGTACGTACGT ACGTACAGTACGT ACGTACCGTACGT etc|||||||||||| |||||| |||||| |||||| ||||||ACGTACGTACGT ACGTAC-GTACGT ACGTAC-GTACGT

We will expect many more matches in a given database if we allow our alignments to have gaps The E-value will be larger

E-values Effect of Query Length

Biologically itrsquos the same match Does it mean we are any less sure that this match didnrsquot occur by chance The E-value is simply dependent on match length

database

BLAST 500 nt sequence against a database

BLASTn Get a full length match with sequence XYZ at an E-value = 50e-160

gtsequence ACTAGTCTAGCTAGACATCGATCGATGATGCTACACAGATAGACGATAGATAGTAAGTCGATCGATCGCGCATCGATCGTCTAGATCGATCGCTCGCTGTGTAGATAGATCGGCGATAGA

database

BLAST half of the same sequence against the same database

BLASTn

gtsequence ACTAGTCTAGCTAGACATCGATCGATGATGCTACACAGATAGACGATAGATAGTAAGTCG

Get a match with sequence XYZ again but at an E-value = 50e-80

Why not just use identityAt some levels this a good question

But consider two very different searches both of which give a 75 identity match

Query1 was 60 nt longCGGAGCTCAGGGCTTAACGACTGATATCTCCGCGCATGTCGAGAAACGATACAGCCAGCG||||||||||| || | || | || || |||| | | | |||||| | ||||||||||CGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGWhich would have an E-value ~ 50 x 10-19

And Query2 only 16 nt longACGTACGTACGTACGT||| || | |||| ||ACGCACCTTCGTAGGTWhich would have an E-value ~ 30

And intuitively we feel we would expect to see that sort of number of matches in the database just by chancehellip

So whatrsquos the real problemBasically you are usually trying to answer the question

Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism

Are there any useful guidelines though at least for biological meaningfulness

Basically you are usually trying to answer the question

Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism

BLAST

The difficulty is because

ORTHOLOGY

BLAST Similarity + Probability

biological knowledge

nature of query sequence

phylogenetic relationship

match length PI size of databasehellip

Rules of ThumbHow good does an E-value have to be before we might even think we have an ortholog

largerworse smallerbetter

E-values 10-5 10-10 10-40 10-100 00

fantasy borderline encouraging

pretty good canrsquot get

better

But note that in some gene families with closely related members you can get an E-value of 00 for several different matches and then identity may be more sensitive Also bear in mind in cases like this that ideas of lsquofunctionalrsquo orthology may break down with more than one locus producing identical proteins which share the same functionhellip

Protein BLASTItrsquos (nearly) always better to make comparisons at the amino acid level between protein sequences than the DNA level because the amino acid sequence is more conserved than the underlying DNA sequenceDoes this cause us to treat expected values any differently

If we follow the argument as before then for an exact match of a 20 amino acid sequence in the RefSeq protein database each additional amino acid will reduce the E-value by 120th (there are 20 different amino acids) And as there are 347895532 letters in that databaseE-value = ~35 x 108 (20 x 20 x 20 hellip20 times) = ~35 x 10-18

But this is what we get if we run the blast at NCBI

Score = 431 bits (100) Expect = 8e-04 Identities = 2020 (100) Positives = 2020 (100) Gaps = 020 (0) Frame = +3

Query 3 SSSSFRAYRAALSEVEPPCI 62 SSSSFRAYRAALSEVEPPCISbjct 972 SSSSFRAYRAALSEVEPPCI 991

Really too big a discrepancy to easily explain with hand wavinghellip

Amino Acid Substitutions

A SC F LWYG I LMVL IMFVM ILVP V ILMW FY

N DHSQ REHKS ANTT SY HFW

H NQYK RQER QK

D NEE DQK

In fact we need to take into account both amino acid substitutability as well as as before allowing gapped alignments On average any residue can be substituted for by about 2 others so each position has about 17th chance of lsquomatchingrsquo rather than 120th

So now we getE-value = ~35 x 108 (7 x 7 x 7 hellip20 times) = ~44 x 10-9which is much closer to the actual BLAST value

These substitutabilities are dealt with by the BLOSUM and PAM matrices

ExercisesGo to the file random-DNA-sequenceshtml select one of the 20 randomly generated nucleotide sequences and do a BLASTx (translated DNA-gtprotein) at NCBI against the nr protein database

Did you find any lsquosignificantrsquo hits

Repeat with a second sequence

What conclusions might you draw from this exercise

Try the same sequence(s) against the nr nucleotide database

Is there any general difference

Part 4 Tweaking BLASTAlthough you normally see BLAST as a web page with boxes to place data in and tick boxes etc it is actually a command line program that can be run just by typing the appropriate command and options eg

promptgt blastall ndashp ltblast typegt ndashi ltinput sequencegt ndashd ltdatabasegt

This is the simplest form where the basic program lsquoblastallrsquo takes a number of different options or parameters indicated by the ndashx and followed by its value -p ltwhich blast flavour to rungt-i ltfile with query sequence ingt-d ltpre-indexed database namegt

Not All Parameters are hellipThere are many other parameters and if not listed explicitly they will use a default value most appropriate to the blast flavour requested Eg for ndashW ltword sizegt blastn uses ndashW 11 where blastx uses ndashW 3

There are also some options that appear on the web pages that are not really parameters but manage the job in a similar way One of the most useful of these is on the NCBI blast pages where you can use Entrez queries or pick from an organism list to modify your search

The Many Parameters of BLASTThere are almost literally hundreds of parameters but most are way too obscure even for die-hard techies like me Very few of them are regularly useful in any but their default value but just occasionally they are very necessary

Here are some of the ones that I have used

-e max expected value -m output format (graphical or tabularspreadsheet)-F filter query sequence for low complexity (default TRUE)-U use only upper case regions of query (default FALSE)-G gap opening cost-E gap extension cost-q nucleotide mismatch penalty (BLASTx uses matrices)-r nucleotide match reward-b number of matching sequences to report-g allow gaps (default TRUE)-W word size-z effective database size (removes effect of actual database size)-S query strands to search (default both directions)-l restrict database sequences to given list of lsquogilsquo numbers

BLAST Parameters Exercises1 BLASTn vs BLASTx

Open the file example-sequenceshtml copy the sequence gtblastn-vs-blastxThis is a Xenopus tropicalis cDNA sequence

Go to the NCBI BLAST Home PageNucleotide-nucleotide BLAST (blastn) section Paste your sequence into the box

Run BLASTn against the nr nucleotide database using all default optionsThen hit [format] to wait for the results in a new page(hint if you paste the sequence definition line lsquogtnamersquo into the box as well your results will be labelled accordingly which can be useful)

Now repeat but go to the TRANSLATED BLAST section and BLAST against the nr protein database using BLASTx

How might the different results help us view the presence of this gene in other vertebrates

Results for Exercise 1 BLASTn

BLASTx

BLAST Parameters Exercises2 Low complexity filtering

Open the file example-sequenceshtml copy the sequence gtlow-complexity-filtering-A This sequence contains a long AT tandem repeat

Go to the NCBI BLAST Home PageTRANSLATED BLAST sectionBLASTx Paste your sequence into the box

Carefully UNTICK the ldquoChoose filter [ ] Low complexityrdquo BOX in the second section And then run BLASTx against the nr database

What do you feel about these alignmentsRe-run but leave the low-complexity filter ON this timeDoes this change our view of the protein matches

Now continue with gtlow-complexity-filtering-B and ndashCC is an especially interesting case ndash what can we deduce about the cDNA sequence Annotators beware

Results for Exercise 2A (OFF) BLASTn ndash low complexity filtering OFF

Results for Exercise 2A (ON) BLASTn ndash low complexity filtering ON

Results for Exercise 2B

ON OFF

Results for Exercise 2C

There is a sequence error an extra G at position 117 in the sequence cDNA (117)AGAAAAGAAGAAACATGGCAATGGATCAGAA|||||||||||||||| ||||||||||||||AGAAAAGAAGAAACAT-GCAATGGATCAGAA

Genomic sequence

ON

OFF

BLAST Parameters Exercises3 Limit by Entrez query

Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching sequences in fruit fly enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT

Open the file example-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home Page TRANSLATED BLAST sectionBLASTx and paste the sequence

Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip

BLAST Parameters Exercises4 BLASTn vs tBLASTx and nucleotide mismatch penalties

Open the file example-sequenceshtml

Also open the NCBI BLAST Home PageSPECIAL ndash Align two sequences section

There are several Xenopus tropicalis cyclins in the examples fileCopy the sequence gtcyclin-A1-Xt to the Sequence 1 BLAST windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 BLAST window(i) Run the default comparison should be BLASTn Note the alignmentNow run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these

(ii) Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1 Then try reducing the gap open and gap extension penaltieshellipWhat do we learn from this

(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip

Results for Exercise 4 (i)

BLASTn tBLASTx

Results for Exercise 4 (ii)

Mismatch penalty = -2 (default) Mismatch penalty = -1

BLAST Parameters Exercises5 E-Value maximum for reporting

Open the file example-sequenceshtml

Copy the sequence gtsumo-binding-motif and go to the NCBI BLAST Home PageGo to the PROTEIN BLAST section BLASTp and paste the sequence

Run the search with the default values

Now re-run the search setting the maximum E-value in the box

Expect 100

What difference does this make

BLAST Parameters Exercises6 Word Size

Open the file example-sequenceshtml

Copy the sequence gtmorpholino and go to the NCBI BLAST Home PageGo to the NUCLEOTIDE BLAST section BLASTn and paste the sequence

Check OFF the low complexity filter and then run the search

Now re-run the search setting the following parameters

Low complexity OFFExpect 100Word Size 7Other advanced -q-1 (mismatch penalty -1 instead of default -3)

What difference does this make

END

  • Bioinformatics Workshop 1 Sequences and Similarity Searches
  • The Basic Questions
  • Part 1 Structural Genomics
  • Chromosomes and Genes
  • Gene to Protein
  • Sequence Signals
  • Genomic Signals
  • Derivative Sequences
  • Gene Models
  • Sequences and Genes (Accession Numbers and Names)
  • Gene Symbols Names Etc
  • A Gene-Centric View
  • Sequences and Accession Numbers
  • mRNA Splicing Signals
  • Gene Predictions
  • Supporting Evidence
  • TheoreticalPredicted Sequences
  • Sequences for a model organism
  • So Whatrsquos in the Databases Now
  • Part 2 Comparative Genomics
  • Speciation
  • Residual Similarity
  • Computers Can Detect Homology
  • Orthologs
  • Paralogs
  • lsquoOtherrsquo-logs
  • The Essential Paradigm
  • Function Conserved Longer than Detectable Similarity
  • Redundancy in the Genetic Code
  • Protein Similarity Persists Longer
  • Always Compare Protein Sequences
  • Exercise 1 nucleotide vs amino acid search
  • Answers Exercise 1
  • The Essential Task
  • Functional Orthologs
  • Finding Orthologs
  • Using Synteny is Better
  • Metazome
  • Metazome Exercise
  • Part 3 Finding Sequence Similarities
  • Gaps in Alignments
  • The Downside of Gaps
  • BLAST
  • Flavours of BLAST
  • How does it work
  • BLAST WORDS and INDEXING
  • Analyse the Query Sequence
  • Expand from Word Based Matches
  • BLAST ndashTypical Output
  • When is a match significant
  • E-values
  • E-values From First Principles
  • Calculating an E-value
  • E-values In Practice
  • E-value Exercise
  • E-value Exercise Answer
  • E-values Effect of Database Size
  • Slide 58
  • Why were the values different
  • E-values Effect of Query Length
  • Why not just use identity
  • So whatrsquos the real problem
  • Rules of Thumb
  • Protein BLAST
  • Amino Acid Substitutions
  • Exercises
  • Part 4 Tweaking BLAST
  • Not All Parameters are hellip
  • The Many Parameters of BLAST
  • Slide 70
  • BLAST Parameters Exercises
  • Results for Exercise 1
  • Slide 73
  • Results for Exercise 2A (OFF)
  • Results for Exercise 2A (ON)
  • Results for Exercise 2B
  • Results for Exercise 2C
  • Slide 78
  • Slide 79
  • Results for Exercise 4 (i)
  • Results for Exercise 4 (ii)
  • Slide 82
  • Slide 83
  • END
Page 29: Bioinformatics Workshop 1 Sequences and Similarity Searches

Redundancy in the Genetic Code

GCA A alanine GCC A GCG A GCT A

TGC C cystine TGT C

GAC D aspartate GAT D

GGA G glycine GGC G GGG G GGT G

lsquoSynonymousrsquo or lsquosilentrsquo mutations in the third position of the codon triplets have no effect on the amino acid coded for ndash so there is no evolutionary pressure against thishellip

Protein Similarity Persists Longer

CTATCACGAGAACCTGTGCTATCCCGAGAACCTGTGCTATCCCGAGAACCAGTGCTATCCCGTGAACCAGTGCTATCCCGTGAGCCAGTGCTATCCCGTGAGCCAGTTCTGTCCCGTGAGCCAGTT

CTATCACGAGAACCTGTG

CTGTCCCGTGAGCCAGTT|| || || || || ||

LSREPV

LSREPV||||||

CTATCACGAGAACCTGTG

TTGTCCCGGTCGCCAGTT | || | || ||

LSREPV

LSRFPV||| ||

67 100

44 80

Always Compare Protein Sequences

ATGAATGCAGCCTATGATTGCCGAGCCAGAATGCTAAGG MNAAYDCRARMLR ||||| || || || || || || || ||||| || || | ||||||||+||ATGAAGGCCGCATACGACTGTCGTGCTAGAATCCTGAGA MKAAYDCRARILR

DNA comparison amino acid comparison

The DNA sequence can change while the amino acid sequence stays the same so always look for similarities by comparing amino acid sequences

Exercise 1nucleotide vs amino acid search

Go to the file example-sequenceshtml and locate the section for this exercise There should be two sequences lsquosurfeit1rsquo for frog and fly

Go to NCBI Blast home page then lsquoAlign two sequencesrsquo (bottom left lsquospecialrsquo panel) paste one sequence into each window and hit lsquoAlignrsquo ndash this will do a direct DNADNA comparison

Now find the open reading frames of the two genes and translate them into amino acid protein sequences then repeat the two sequences comparison

Go to NCBI ORF Finder ndash paste sequence ndash hit OrfFind ndash identify longest ORF ndash click on it ndash next screen hit Accept ndash change View to Fasta protein ndash hit View ndash copy sequence to Blast2Seqs Do the same with the other sequence

Before you hit lsquoAlignrsquo change the lsquoProgramrsquo (top left) to blastphellip

Answers Exercise 1

The Essential Taskexperiment data mining

gene sequence what is its function

database of proteins in other species

Cyclin-AFoxA1

cdc25

alpha-tubulin

Predicted protein

Gravin-like

Sprouty-2

calmodulin

KIAA10786568

frizzled

Wint8

Troponin T3

Gravin-like

we can only do this because of implied function based on orthology

Functional Orthologs

function known annotation lsquoGravinrsquo available

Human geneXenopus genefunction unknown

sequence similarityorthologs

same function But we know that function is largely determined by shape

similar shape

Which in general we cannot determine ndash but it is probably SHAPE not SEQUENCE that is conserved

We make an assumption that the same gene function is likely to be present in the two organisms and the ones that have this function are likely to be the most similar in sequence

Finding OrthologsSo how do we find orthologs and can we know when we have

The simplest is Reciprocal Best BLAST but it implicitly relies on having all the protein sequences of you own organism and the one you wish to find an ortholog in

frog proteindatabase of human proteins

best match human protein

database of frog proteins

x

Using Synteny is Better

We know that large regions of (say) vertebrate genomes have preserved their overall organisation from one organism to another

And we find the same genes (ie orthologs) in more or less the same order in the syntenic sectionsThese of course represent chromosomal re-arrangements since these organisms diverged

Human chromosome 5

Mouse chromosome 10

Mouse chromosome 2

MetazomeFortunately someone has done all the hard work for ushellip Dan Rokhsar httpwwwmetazomenet

Metazome Exercise

Go back to Entrez Gene and look for your favourite gene again

Pick probable ortholog vertebrate genes from common organisms (human mouse rat chicken frog fish) and paste their protein sequences into a temporary space

Go to Metazome (httpwwwmetazomenet) find the blast window open two versions of it and blast your sequences against the Tetrapod or Jawed vertebrate node

See if you get the same cluster ID as best top hit and have a look at the Metazome alignment(s)hellip

Part 3 Finding Sequence Similarities

We want computer programs which will compare sequences at all possible different alignments looking for a degree of similarity greater than we would expect to find by chance

But first we have to consider the implication of gapshellip

Insertions and deletions are other possible forms of mutations and they can really mess up our simple alignments

ATGCATGCTGCCAACGGATGTCCTG

ATGAAAGCCGCCTACGAAAGTCCTG||| | || ||| ||| | ||||||

ATGCATGCTGGCCAACGGATGTCCTG

ATGAAAGCCGCCTACGAAAGTCCTG||| | || ||| | | | |

Gaps in Alignments

Consider these two obviously similar sequences

TTCCCAACTCTCCTCTTTCACCATGAAGCTCAAGGACAGATTCCACTCGCCCCAAAATCAAGCTCACCCCGTCCAAGAA | || | || |||||||||||||||||||| ||||||||| ||| ||| | ||| | | |TTCCCACCTCTCCTCTTTGCACCATGAAGCTCAAGGACAAATTCCACTCCCCCAAAATCAAGCGCACCCCGTCCCAGAA

In fact we realise that the most probable alignment (regarding biological origin) is with a small gap in each sequence

TTCCCAACTCTCCTCTTT=CACCATGAAGCTCAAGGACAGATTCCACTCGCCCCAAAATCAAGCTCACCCCGTCCAAGAA |||||| ||||||||||| |||||||||||||||||||| ||||||||| |||||||||||||| |||||||||| ||||TTCCCACCTCTCCTCTTTGCACCATGAAGCTCAAGGACAAATTCCACTC=CCCCAAAATCAAGCGCACCCCGTCCCAGAA

So in general we allow ourselves to insert gaps until we find the optimal alignment

But where should this process stop

The Downside of GapsTake two random sequences with no lsquorealrsquo similarity

GACACTAGGTCGATGCGTGGTGGCGAGA

ACGCATCCGGATGTGCACCGTGGAACTG

And allow lsquocost freersquo gaps

GAC--ACT----AGGTCGATGC---GTGG---TGGCGAGA || | | | | | ||| |||| || ACGCA-TCCGGA--T-G-TGCACCGTGGAACTG

Clearly although the alignment has no mismatches it is obviously not biologically meaningful

To prevent this we assign a cost to adding gaps which is offset against the benefit of finding matches ndash and this is the essence of lsquofinding gapped alignmentsrsquo

We want to find the lsquoalignmentrsquo between the two (or more) sequences which shows the greatest degree of similarity while introducing the fewest gaps hellip

BLAST

gtqueryAGACGAACCTAGCACAAGCGCGTCTGGAAAGACCCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGTTCAGGAGTATTTGGACTGCAATATTGGCCCTCGTTCAAGGGCGCCTACCATCACCCGACGGTCATGCCGGTCCCCAGCAGCTGCTAATAACTTCCTTCGCTACTCAAGTTACCACGCTAGCAAAACCCACGGCATACCGTTTACCCTTTAAAATCAGCTTCAACCAGCAACGAA

There are many programs used to find similarities between sequencesThey range from relatively slow programs which find the exact best matching alignment through ones which take progressively inexact shortcuts to speed things up Of this latter class the best known and easily most widely used is BLAST developed by Stephen Altschul and others and continuously refined over the last 10-15 years

The essential idea is to compare your query sequence against a collection or lsquodatabasersquo of target sequences looking for the one(s) that match the query sequence the best

gttarget1AAAACAGGAATATTTACCGGGACCGGGTAATGATGCATCTCGAGGTACACAATATACCTG GAGAACCGAATTATGAGTTGGCCACCTTACTTAACGAAACCAGCAGAGAAAATCCAACAT GGCAACACCCCTCTGACTACACTAGAAGGAACTACTATGTAAGAAAACAGCCTGTCCCTT GCAGTTTGAATGACTGGGTGATGCGAAATGGGGGTCCTGCCATAGAGCGCTTCCATGGTT TACCTTGCACATTTCAGAGAAGTCCTATGCCAGGAGTCCTTCCTACAGGGCCTTCCTGAA ACTATATATGTGCTTATTCTTGTTTGATTTGGCTTTGCAGgttarget2CTCTTAATTTATTTCTCTTCCTGCAGCTCCCTCGCTTTTTCCTTTCCCTGTTACATTCAT CTGACTTGAAGAGTTGCAAATTTTCAGTGTTTCTGTTTTTGTTGCTGATATGTTGTAAAC TTTTTAATAAAATCTATTTCTATAG gttarget3GCAGTTTGAATGACTGGGTGATGCGAAATGGGGGTCCTGCCATAGAGCGCTTCCATGGTT TACCTTGCACATTTCAGAGAAGTCCTATGCCAGGAGTCCTTCCTACAGGGCCTTCCTGAA ACTATATATGTGCTTATTCTTGTTTGATTTGGCTTTGCAGCTAGGGTTTTCACCTTTTCT GGAAAAAAAAATACTGGCTTCC gttarget4CTGCTATTAATGGGCAAAACAACTCAAATAAAGTCCCTCTGCCACCCTCAGACACTGCCC CTGGCCCCCAGCTGCCCGCTGATCCTTGTAGCCAGAGCAGTAAAGTTTTGAAAGTGGAGC CCAAGGAGAATAAAGTTATTAAAGAAACTGGCTTTGAACAAGGTGAAAAGTCTTGTGCAG CACCTCTAGATCATACTGTGAAGGAAAATCTTGGACAAACTTCTAAAGAACAGGTGGTAG

query

database

COMPARE

LIST MATCHES

Flavours of BLAST

ACGATAGATCCCATCCATAAAT ATGACGATAGATCCCATCAT CGATAGGACCACCACA GATAGACCAGGATACATAGGATAATTA AGCTCGCTTGGCTCGATGGCT

MKJLSPWERSYTRGHYTWER MGHTVNBZY MKLPWRHGDBKJGMNDFD MBKLRPIUHDFRTASGSLKWWRTVBN

MKJLSPWERSYTRGHYTWER MGHTVNBZY MKLPWRHGDBKJGMNDFD MBKLRPIUHDFRTASGSLKWWRTVBN

ATGACGATAGATCCCATCAT CGATAGGACCACCACA GATAGACCAGGATACATAGGATAATTA AGCTCGCTTGGCTCGATGGCT

query sequence other operation database sequences

ATGACGATAGATCCCATCAT CGATAGGACCACCACA GATAGACCAGGATACATAGGATAATTA AGCTCGCTTGGCTCGATGGCT

BLASTn

BLASTp

BLASTx

tBLASTn

tBLASTx

ACGATAGATCCCATCCATAAAT

ACGATAGATCCCATCCATAAAT

MQWCGYRWTYQGYRW

MQWCGYRWTYQGYRW

FAST

FAST

SLOW

SLOWER

HORRIBLY

SLOW

6 fra

me

trans

latio

n

How does it work The main task of any sequence comparison program is to test all possible mutual alignments of two sequence and see how good the match is

CTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGT

CCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGTTC | | | | | ||||||||||||||||||||||||| CTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGTCTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGT || | | | | | | | | |||||||||||||||||||||||| | | | | | |

CCGAGCTTCTCATTGCTCTTCCTAACAGTG=TGATAGGCTAACCGTAATGGCGTTC||||||||||||||||||||||||| ||||||||||||||||||||||||

query

1st database sequence

This would actually be a very slow search process if implemented like thishellip

BLAST achieves its speed through two strategies

- it takes a WORD based approach- it pre-INDEXES database sequences

BLAST WORDS and INDEXING1 GACAAATCCAAACCCCTGAAGTTCTCCACCAGCAAAGCCA

2 TAAGCAAATTTAATTTTGTTTACATTTTC

3 GTTAAGACCTTCCCTGACATTTGCAGCAGTTTCAAATGTA

AAAAAAAA 00001 AAAAAAAC 00002 AAAAAAAG 00003 ACAAATCC 07967 ACAAATCC 07968 ACAAATCC 07979 GACAAATC 33568 GACAAATG 33569 TCCAAACC 64321 TCCAAACC 64322

sequence position word

1 1 33658

1 2 07967

1 3 16210

3 15 33568

3 16 07967

Database of sequences

Numbered list of all possible lsquowordsrsquo

Build a position index of all words in the database

Analyse the Query Sequence gtquery AGACAAATCCAAACCCCTGAAGTTCTCCACCAGCAAAGCCA

AAAAAAAA 00001 AAAAAAAC 00002 AAAAAAAG 00003 ACAAATCC 07967 ACAAATCC 07968 ACAAATCC 07979 GACAAATC 33568 GACAAATG 33569 TCCAAACC 64321 TCCAAACC 64322

QUERY SEQUENCE

Numbered list of all possible lsquowordsrsquo

position word

1 14236

2 33658

3 07967

Analyse QUERY SEQUENCE

sequence position word

1 1 33658

1 2 07967

1 3 16210

3 15 33568

3 16 07967

Index of database

Expand from Word Based Matches We lsquoinstantlyrsquo know which sequences in the database have at least a word length match with our query sequence and at what relative position

Next the potential alignments are expanded adding up a score for (total matches ndash mismatches ndash gap penalties) to make the best possible alignment But this is usually for a tiny proportion of the sequences in the database ndash so overall it is much quicker

The highest scoring alignments are reported

But we can potentially miss alignments with no word-size bits in common consider BLASTn with a default word-size of 11

TCGGAAGTGGAAGCTGAACCTGATTGTAGAGTTGGAGGCCAGTGTTCTGGCTGAGC||||||||| ||||| |||||||||| |||||||||| |||| ||||| ||||||| TCGGAAGTGTAAGCTCAACCTGATTGCAGAGTTGGAGTCCAGAGTTCTAGCTGAGC

Care is sometimes neededhellip

BLAST ndashTypical OutputINPUT

gtpartial cDNA sequence Xenopus tropicalisCGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGTTCCCACCTCTCCTCTTTCACCATGAAGCTCAAGGACAAATTCCACTCCCCCAAAATCAAGCGCACCCCGTCCAAGAAGGGGAAGCCGGCCGACCTCACCGTCAAAACAGAAGAGAAACCCGTCAACAAAACCTTAAGCCGCTTGGAGGAACAGGAGAAAGAAGTCGTTAATGCCTTGCGTTACTTTAAGACAATTGTTGACAAGATGGCGGTGGACAAGATGGTGCTGGTGATGCTGCCAGGGTCGGCGA

OUTPUTQuery= (311 letters) Database NCBI Protein Reference Sequences 954378 sequences 347895532 total letters

gtgi|41055060|ref|NP_9574201| similar to guanine nucleotide-releasing factor 2 (specific for crk proto-oncogene) [Danio rerio]

Length=691

Score = 133 bits (335)Expect = 6e-31 Identities = 7698 (77) Positives = 8298 (83) Gaps = 498 (4) Frame = +2

Query 26 MSGKIE-KADSQRSHLSSFTMKLKDKFHSPKIKRTPSKKGKPA--DLTVKTEEKPVNKTL 196 MSGKIE K +SQ+SHLSSFTMKL KFHSPKIKRTPSKKGK + VKT EKPVNK + Sbjct 1 MSGKIESKHESQKSHLSSFTMKLM-KFHSPKIKRTPSKKGKQLQPEPAVKTPEKPVNKKV 59

Query 197 SRLEEQEKEVVNALRYFKTIVDKMAVDKMVLVMLPGSA 310 SRLEEQEK+VV+ALRYFKTIVDKM VD VL MLPGSA Sbjct 60 SRLEEQEKDVVSALRYFKTIVDKMNVDTKVLQMLPGSA 97

When is a match significant

RFKISDCQHPCTYSHNQYMTNHMRECPYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV

NFSWKKTSEKETNCQFDYPNDYNEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFNMCWLEVNSS

RF---KISDCQHPCTYSH-NQYMTNHMREC----PYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV F K S+ + C + + N Y N +C P+ + +W +P + D I N M ++ NFSWKKTSEKETNCQFDYPNDY--NEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFN------MCWLEVNSS

Here is a lsquotypicalrsquo weak alignment from BLASTp

In fact the sequences were randomly generated so there is no biologically significant alignmenthellip

E-values

The number of matches like the discovered match that I would expect to find by chance

An E-value of 00 implies that I would expect no matches like this to arise by chance thereforehellip

An E-value of 1 implies I would expect 1 match like this to arise by chance so if I have a match with such an E-valuehellip

Also ldquoexpect valueldquo or ldquoexpectationrdquo

E-values From First Principles

Some database statistics (23rd July 2005)

Database NCBI RefSeq mRNA 272619 sequences 503566580 total letters (~50 x 108)

Database NCBI nr 3329110 sequences 14601814750 total letters (~14 x 1010)

Notation

12e-35 = 12 x 10-35

48 x 106 = 4800000

We will consider first searching a nucleotide sequence (lsquoACGTAGACGTrsquo) against a nucleotide database eg the RefSeq mRNA above

Then we will consider the more complex case of amino acid sequence (protein) searches Which is of course what we mostly do

Calculating an E-valueThe RefSeq mRNA database has ~ 50 x 108 letters There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA

Query = lsquoArsquo

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACrsquo

AC AC AC AC

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACGrsquo

ACG

Expected number of matches = (50 x 108) 4 = ~12 x 108

Expected number of matches = (50 x 108) (4x 4) = ~31 x 107

Expected number of matches = (50 x 108) (4 x 4 x 4) = ~81 x 106

Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer

Expected number of matches = (50 x 108) (4 x 4 x 4 x 4 hellip 60 times ) = (50 x 108) 1036 = 50 x 10-28

E-value = 50 x 10-28

E-values In PracticeSo if I take a 60 nt sequencegtsequenceACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA and actually BLAST it against the RefSeq mRNA database I get

BLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds

Length=6060 Score = 119 bits (60) Expect = 2e-26 Identities = 6060 (100) Gaps = 060 (0) Strand=PlusPlus

Query 1 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 60 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 2977 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 3036

What do I get if I BLAST it against the larger nr databaseBLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds

Length=6060 Score = 119 bits (60) Expect = 6e-25 Identities = 6060 (100) Gaps = 060 (0) Strand=PlusPlus

Query 1 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 60 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 2977 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 3036

theoretical value was 50e-28 -

E-value Exercise

Given a transcription factor binding site

ACC[TG]TA

How many would you expect to find by chance in a 10k promoter sequence

How would this differ if there was an optional additional base between the 4th and 5th positionsIeACC[TG]TAOR ACC[TG]TA

E-value Exercise AnswerACC[TG]TA

Expect lsquoArsquo every 4 ntExpect lsquoACCrsquo every 4x4x4 = 64 nt

Expect lsquoT or Grsquo every 2nd ntExpect lsquoACC[TG]rsquo every 64x2 nt = 128 nt

Expect lsquoTArsquo every 4x4 = 16 ntExpect lsquoACC[TG]TArsquo every 128x16 nt = 2048 nt (4x4x4x2x4x4)We would expect ~5 of these promoter sites every 10k by chance

If also ACC[TG]TAA allowed

The two motifs independently have the same E-valueTo allow either means we expect twice as many

E-values Effect of Database SizeThe nr mRNA database has ~ 14 x 1010 letters (was RefSeq and 50 x108)There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA

Query = lsquoArsquo

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACrsquo

AC AC AC AC

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACGrsquo

ACG

Expected number of matches = (14 x 1010) 4 = ~12 x 108

Expected number of matches = (14 x 1010) (4x 4) = ~31 x 107

Expected number of matches = (14 x 1010) (4 x 4 x 4) = ~81 x 106

Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer

Expected number of matches = (14 x 1010) (4 x 4 x 4 x 4 hellip 60 times ) = (14 x 1010) 1036 = 14 x 10-26

E-value = 14 x 10-26

(was E-value = 50 x 10-28)

E-values Effect of Database Size

The E-value is simply dependent on database size

RefSeq

nr

14 x 1010 letters

50 x108 letters

30 x bigger

BLAST the same sequenceagainst each E-value = 14e-26

E-value = 50e-28

The database was ~30 times bigger and so the E-value was ~30 times bigger

Why were the values differentOur calculated E-value for searching against the RefSeq mRNA database was 50 x 10-28But our actual BLAST search at NCBI gave a value of

20 x 10-26 - about 40x larger - why is this

Gapped alignments

If we were expecting N matches for a query sequence lsquoACGTACGTACGTrsquo imagine what would happen to N if we allowed gaps in our matches

ACGTACGTACGT

This would now give us additional possible alignments that would meet our lsquomatchrsquo criteria

ACGTACGTACGT ACGTACAGTACGT ACGTACCGTACGT etc|||||||||||| |||||| |||||| |||||| ||||||ACGTACGTACGT ACGTAC-GTACGT ACGTAC-GTACGT

We will expect many more matches in a given database if we allow our alignments to have gaps The E-value will be larger

E-values Effect of Query Length

Biologically itrsquos the same match Does it mean we are any less sure that this match didnrsquot occur by chance The E-value is simply dependent on match length

database

BLAST 500 nt sequence against a database

BLASTn Get a full length match with sequence XYZ at an E-value = 50e-160

gtsequence ACTAGTCTAGCTAGACATCGATCGATGATGCTACACAGATAGACGATAGATAGTAAGTCGATCGATCGCGCATCGATCGTCTAGATCGATCGCTCGCTGTGTAGATAGATCGGCGATAGA

database

BLAST half of the same sequence against the same database

BLASTn

gtsequence ACTAGTCTAGCTAGACATCGATCGATGATGCTACACAGATAGACGATAGATAGTAAGTCG

Get a match with sequence XYZ again but at an E-value = 50e-80

Why not just use identityAt some levels this a good question

But consider two very different searches both of which give a 75 identity match

Query1 was 60 nt longCGGAGCTCAGGGCTTAACGACTGATATCTCCGCGCATGTCGAGAAACGATACAGCCAGCG||||||||||| || | || | || || |||| | | | |||||| | ||||||||||CGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGWhich would have an E-value ~ 50 x 10-19

And Query2 only 16 nt longACGTACGTACGTACGT||| || | |||| ||ACGCACCTTCGTAGGTWhich would have an E-value ~ 30

And intuitively we feel we would expect to see that sort of number of matches in the database just by chancehellip

So whatrsquos the real problemBasically you are usually trying to answer the question

Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism

Are there any useful guidelines though at least for biological meaningfulness

Basically you are usually trying to answer the question

Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism

BLAST

The difficulty is because

ORTHOLOGY

BLAST Similarity + Probability

biological knowledge

nature of query sequence

phylogenetic relationship

match length PI size of databasehellip

Rules of ThumbHow good does an E-value have to be before we might even think we have an ortholog

largerworse smallerbetter

E-values 10-5 10-10 10-40 10-100 00

fantasy borderline encouraging

pretty good canrsquot get

better

But note that in some gene families with closely related members you can get an E-value of 00 for several different matches and then identity may be more sensitive Also bear in mind in cases like this that ideas of lsquofunctionalrsquo orthology may break down with more than one locus producing identical proteins which share the same functionhellip

Protein BLASTItrsquos (nearly) always better to make comparisons at the amino acid level between protein sequences than the DNA level because the amino acid sequence is more conserved than the underlying DNA sequenceDoes this cause us to treat expected values any differently

If we follow the argument as before then for an exact match of a 20 amino acid sequence in the RefSeq protein database each additional amino acid will reduce the E-value by 120th (there are 20 different amino acids) And as there are 347895532 letters in that databaseE-value = ~35 x 108 (20 x 20 x 20 hellip20 times) = ~35 x 10-18

But this is what we get if we run the blast at NCBI

Score = 431 bits (100) Expect = 8e-04 Identities = 2020 (100) Positives = 2020 (100) Gaps = 020 (0) Frame = +3

Query 3 SSSSFRAYRAALSEVEPPCI 62 SSSSFRAYRAALSEVEPPCISbjct 972 SSSSFRAYRAALSEVEPPCI 991

Really too big a discrepancy to easily explain with hand wavinghellip

Amino Acid Substitutions

A SC F LWYG I LMVL IMFVM ILVP V ILMW FY

N DHSQ REHKS ANTT SY HFW

H NQYK RQER QK

D NEE DQK

In fact we need to take into account both amino acid substitutability as well as as before allowing gapped alignments On average any residue can be substituted for by about 2 others so each position has about 17th chance of lsquomatchingrsquo rather than 120th

So now we getE-value = ~35 x 108 (7 x 7 x 7 hellip20 times) = ~44 x 10-9which is much closer to the actual BLAST value

These substitutabilities are dealt with by the BLOSUM and PAM matrices

ExercisesGo to the file random-DNA-sequenceshtml select one of the 20 randomly generated nucleotide sequences and do a BLASTx (translated DNA-gtprotein) at NCBI against the nr protein database

Did you find any lsquosignificantrsquo hits

Repeat with a second sequence

What conclusions might you draw from this exercise

Try the same sequence(s) against the nr nucleotide database

Is there any general difference

Part 4 Tweaking BLASTAlthough you normally see BLAST as a web page with boxes to place data in and tick boxes etc it is actually a command line program that can be run just by typing the appropriate command and options eg

promptgt blastall ndashp ltblast typegt ndashi ltinput sequencegt ndashd ltdatabasegt

This is the simplest form where the basic program lsquoblastallrsquo takes a number of different options or parameters indicated by the ndashx and followed by its value -p ltwhich blast flavour to rungt-i ltfile with query sequence ingt-d ltpre-indexed database namegt

Not All Parameters are hellipThere are many other parameters and if not listed explicitly they will use a default value most appropriate to the blast flavour requested Eg for ndashW ltword sizegt blastn uses ndashW 11 where blastx uses ndashW 3

There are also some options that appear on the web pages that are not really parameters but manage the job in a similar way One of the most useful of these is on the NCBI blast pages where you can use Entrez queries or pick from an organism list to modify your search

The Many Parameters of BLASTThere are almost literally hundreds of parameters but most are way too obscure even for die-hard techies like me Very few of them are regularly useful in any but their default value but just occasionally they are very necessary

Here are some of the ones that I have used

-e max expected value -m output format (graphical or tabularspreadsheet)-F filter query sequence for low complexity (default TRUE)-U use only upper case regions of query (default FALSE)-G gap opening cost-E gap extension cost-q nucleotide mismatch penalty (BLASTx uses matrices)-r nucleotide match reward-b number of matching sequences to report-g allow gaps (default TRUE)-W word size-z effective database size (removes effect of actual database size)-S query strands to search (default both directions)-l restrict database sequences to given list of lsquogilsquo numbers

BLAST Parameters Exercises1 BLASTn vs BLASTx

Open the file example-sequenceshtml copy the sequence gtblastn-vs-blastxThis is a Xenopus tropicalis cDNA sequence

Go to the NCBI BLAST Home PageNucleotide-nucleotide BLAST (blastn) section Paste your sequence into the box

Run BLASTn against the nr nucleotide database using all default optionsThen hit [format] to wait for the results in a new page(hint if you paste the sequence definition line lsquogtnamersquo into the box as well your results will be labelled accordingly which can be useful)

Now repeat but go to the TRANSLATED BLAST section and BLAST against the nr protein database using BLASTx

How might the different results help us view the presence of this gene in other vertebrates

Results for Exercise 1 BLASTn

BLASTx

BLAST Parameters Exercises2 Low complexity filtering

Open the file example-sequenceshtml copy the sequence gtlow-complexity-filtering-A This sequence contains a long AT tandem repeat

Go to the NCBI BLAST Home PageTRANSLATED BLAST sectionBLASTx Paste your sequence into the box

Carefully UNTICK the ldquoChoose filter [ ] Low complexityrdquo BOX in the second section And then run BLASTx against the nr database

What do you feel about these alignmentsRe-run but leave the low-complexity filter ON this timeDoes this change our view of the protein matches

Now continue with gtlow-complexity-filtering-B and ndashCC is an especially interesting case ndash what can we deduce about the cDNA sequence Annotators beware

Results for Exercise 2A (OFF) BLASTn ndash low complexity filtering OFF

Results for Exercise 2A (ON) BLASTn ndash low complexity filtering ON

Results for Exercise 2B

ON OFF

Results for Exercise 2C

There is a sequence error an extra G at position 117 in the sequence cDNA (117)AGAAAAGAAGAAACATGGCAATGGATCAGAA|||||||||||||||| ||||||||||||||AGAAAAGAAGAAACAT-GCAATGGATCAGAA

Genomic sequence

ON

OFF

BLAST Parameters Exercises3 Limit by Entrez query

Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching sequences in fruit fly enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT

Open the file example-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home Page TRANSLATED BLAST sectionBLASTx and paste the sequence

Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip

BLAST Parameters Exercises4 BLASTn vs tBLASTx and nucleotide mismatch penalties

Open the file example-sequenceshtml

Also open the NCBI BLAST Home PageSPECIAL ndash Align two sequences section

There are several Xenopus tropicalis cyclins in the examples fileCopy the sequence gtcyclin-A1-Xt to the Sequence 1 BLAST windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 BLAST window(i) Run the default comparison should be BLASTn Note the alignmentNow run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these

(ii) Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1 Then try reducing the gap open and gap extension penaltieshellipWhat do we learn from this

(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip

Results for Exercise 4 (i)

BLASTn tBLASTx

Results for Exercise 4 (ii)

Mismatch penalty = -2 (default) Mismatch penalty = -1

BLAST Parameters Exercises5 E-Value maximum for reporting

Open the file example-sequenceshtml

Copy the sequence gtsumo-binding-motif and go to the NCBI BLAST Home PageGo to the PROTEIN BLAST section BLASTp and paste the sequence

Run the search with the default values

Now re-run the search setting the maximum E-value in the box

Expect 100

What difference does this make

BLAST Parameters Exercises6 Word Size

Open the file example-sequenceshtml

Copy the sequence gtmorpholino and go to the NCBI BLAST Home PageGo to the NUCLEOTIDE BLAST section BLASTn and paste the sequence

Check OFF the low complexity filter and then run the search

Now re-run the search setting the following parameters

Low complexity OFFExpect 100Word Size 7Other advanced -q-1 (mismatch penalty -1 instead of default -3)

What difference does this make

END

  • Bioinformatics Workshop 1 Sequences and Similarity Searches
  • The Basic Questions
  • Part 1 Structural Genomics
  • Chromosomes and Genes
  • Gene to Protein
  • Sequence Signals
  • Genomic Signals
  • Derivative Sequences
  • Gene Models
  • Sequences and Genes (Accession Numbers and Names)
  • Gene Symbols Names Etc
  • A Gene-Centric View
  • Sequences and Accession Numbers
  • mRNA Splicing Signals
  • Gene Predictions
  • Supporting Evidence
  • TheoreticalPredicted Sequences
  • Sequences for a model organism
  • So Whatrsquos in the Databases Now
  • Part 2 Comparative Genomics
  • Speciation
  • Residual Similarity
  • Computers Can Detect Homology
  • Orthologs
  • Paralogs
  • lsquoOtherrsquo-logs
  • The Essential Paradigm
  • Function Conserved Longer than Detectable Similarity
  • Redundancy in the Genetic Code
  • Protein Similarity Persists Longer
  • Always Compare Protein Sequences
  • Exercise 1 nucleotide vs amino acid search
  • Answers Exercise 1
  • The Essential Task
  • Functional Orthologs
  • Finding Orthologs
  • Using Synteny is Better
  • Metazome
  • Metazome Exercise
  • Part 3 Finding Sequence Similarities
  • Gaps in Alignments
  • The Downside of Gaps
  • BLAST
  • Flavours of BLAST
  • How does it work
  • BLAST WORDS and INDEXING
  • Analyse the Query Sequence
  • Expand from Word Based Matches
  • BLAST ndashTypical Output
  • When is a match significant
  • E-values
  • E-values From First Principles
  • Calculating an E-value
  • E-values In Practice
  • E-value Exercise
  • E-value Exercise Answer
  • E-values Effect of Database Size
  • Slide 58
  • Why were the values different
  • E-values Effect of Query Length
  • Why not just use identity
  • So whatrsquos the real problem
  • Rules of Thumb
  • Protein BLAST
  • Amino Acid Substitutions
  • Exercises
  • Part 4 Tweaking BLAST
  • Not All Parameters are hellip
  • The Many Parameters of BLAST
  • Slide 70
  • BLAST Parameters Exercises
  • Results for Exercise 1
  • Slide 73
  • Results for Exercise 2A (OFF)
  • Results for Exercise 2A (ON)
  • Results for Exercise 2B
  • Results for Exercise 2C
  • Slide 78
  • Slide 79
  • Results for Exercise 4 (i)
  • Results for Exercise 4 (ii)
  • Slide 82
  • Slide 83
  • END
Page 30: Bioinformatics Workshop 1 Sequences and Similarity Searches

Protein Similarity Persists Longer

CTATCACGAGAACCTGTGCTATCCCGAGAACCTGTGCTATCCCGAGAACCAGTGCTATCCCGTGAACCAGTGCTATCCCGTGAGCCAGTGCTATCCCGTGAGCCAGTTCTGTCCCGTGAGCCAGTT

CTATCACGAGAACCTGTG

CTGTCCCGTGAGCCAGTT|| || || || || ||

LSREPV

LSREPV||||||

CTATCACGAGAACCTGTG

TTGTCCCGGTCGCCAGTT | || | || ||

LSREPV

LSRFPV||| ||

67 100

44 80

Always Compare Protein Sequences

ATGAATGCAGCCTATGATTGCCGAGCCAGAATGCTAAGG MNAAYDCRARMLR ||||| || || || || || || || ||||| || || | ||||||||+||ATGAAGGCCGCATACGACTGTCGTGCTAGAATCCTGAGA MKAAYDCRARILR

DNA comparison amino acid comparison

The DNA sequence can change while the amino acid sequence stays the same so always look for similarities by comparing amino acid sequences

Exercise 1nucleotide vs amino acid search

Go to the file example-sequenceshtml and locate the section for this exercise There should be two sequences lsquosurfeit1rsquo for frog and fly

Go to NCBI Blast home page then lsquoAlign two sequencesrsquo (bottom left lsquospecialrsquo panel) paste one sequence into each window and hit lsquoAlignrsquo ndash this will do a direct DNADNA comparison

Now find the open reading frames of the two genes and translate them into amino acid protein sequences then repeat the two sequences comparison

Go to NCBI ORF Finder ndash paste sequence ndash hit OrfFind ndash identify longest ORF ndash click on it ndash next screen hit Accept ndash change View to Fasta protein ndash hit View ndash copy sequence to Blast2Seqs Do the same with the other sequence

Before you hit lsquoAlignrsquo change the lsquoProgramrsquo (top left) to blastphellip

Answers Exercise 1

The Essential Taskexperiment data mining

gene sequence what is its function

database of proteins in other species

Cyclin-AFoxA1

cdc25

alpha-tubulin

Predicted protein

Gravin-like

Sprouty-2

calmodulin

KIAA10786568

frizzled

Wint8

Troponin T3

Gravin-like

we can only do this because of implied function based on orthology

Functional Orthologs

function known annotation lsquoGravinrsquo available

Human geneXenopus genefunction unknown

sequence similarityorthologs

same function But we know that function is largely determined by shape

similar shape

Which in general we cannot determine ndash but it is probably SHAPE not SEQUENCE that is conserved

We make an assumption that the same gene function is likely to be present in the two organisms and the ones that have this function are likely to be the most similar in sequence

Finding OrthologsSo how do we find orthologs and can we know when we have

The simplest is Reciprocal Best BLAST but it implicitly relies on having all the protein sequences of you own organism and the one you wish to find an ortholog in

frog proteindatabase of human proteins

best match human protein

database of frog proteins

x

Using Synteny is Better

We know that large regions of (say) vertebrate genomes have preserved their overall organisation from one organism to another

And we find the same genes (ie orthologs) in more or less the same order in the syntenic sectionsThese of course represent chromosomal re-arrangements since these organisms diverged

Human chromosome 5

Mouse chromosome 10

Mouse chromosome 2

MetazomeFortunately someone has done all the hard work for ushellip Dan Rokhsar httpwwwmetazomenet

Metazome Exercise

Go back to Entrez Gene and look for your favourite gene again

Pick probable ortholog vertebrate genes from common organisms (human mouse rat chicken frog fish) and paste their protein sequences into a temporary space

Go to Metazome (httpwwwmetazomenet) find the blast window open two versions of it and blast your sequences against the Tetrapod or Jawed vertebrate node

See if you get the same cluster ID as best top hit and have a look at the Metazome alignment(s)hellip

Part 3 Finding Sequence Similarities

We want computer programs which will compare sequences at all possible different alignments looking for a degree of similarity greater than we would expect to find by chance

But first we have to consider the implication of gapshellip

Insertions and deletions are other possible forms of mutations and they can really mess up our simple alignments

ATGCATGCTGCCAACGGATGTCCTG

ATGAAAGCCGCCTACGAAAGTCCTG||| | || ||| ||| | ||||||

ATGCATGCTGGCCAACGGATGTCCTG

ATGAAAGCCGCCTACGAAAGTCCTG||| | || ||| | | | |

Gaps in Alignments

Consider these two obviously similar sequences

TTCCCAACTCTCCTCTTTCACCATGAAGCTCAAGGACAGATTCCACTCGCCCCAAAATCAAGCTCACCCCGTCCAAGAA | || | || |||||||||||||||||||| ||||||||| ||| ||| | ||| | | |TTCCCACCTCTCCTCTTTGCACCATGAAGCTCAAGGACAAATTCCACTCCCCCAAAATCAAGCGCACCCCGTCCCAGAA

In fact we realise that the most probable alignment (regarding biological origin) is with a small gap in each sequence

TTCCCAACTCTCCTCTTT=CACCATGAAGCTCAAGGACAGATTCCACTCGCCCCAAAATCAAGCTCACCCCGTCCAAGAA |||||| ||||||||||| |||||||||||||||||||| ||||||||| |||||||||||||| |||||||||| ||||TTCCCACCTCTCCTCTTTGCACCATGAAGCTCAAGGACAAATTCCACTC=CCCCAAAATCAAGCGCACCCCGTCCCAGAA

So in general we allow ourselves to insert gaps until we find the optimal alignment

But where should this process stop

The Downside of GapsTake two random sequences with no lsquorealrsquo similarity

GACACTAGGTCGATGCGTGGTGGCGAGA

ACGCATCCGGATGTGCACCGTGGAACTG

And allow lsquocost freersquo gaps

GAC--ACT----AGGTCGATGC---GTGG---TGGCGAGA || | | | | | ||| |||| || ACGCA-TCCGGA--T-G-TGCACCGTGGAACTG

Clearly although the alignment has no mismatches it is obviously not biologically meaningful

To prevent this we assign a cost to adding gaps which is offset against the benefit of finding matches ndash and this is the essence of lsquofinding gapped alignmentsrsquo

We want to find the lsquoalignmentrsquo between the two (or more) sequences which shows the greatest degree of similarity while introducing the fewest gaps hellip

BLAST

gtqueryAGACGAACCTAGCACAAGCGCGTCTGGAAAGACCCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGTTCAGGAGTATTTGGACTGCAATATTGGCCCTCGTTCAAGGGCGCCTACCATCACCCGACGGTCATGCCGGTCCCCAGCAGCTGCTAATAACTTCCTTCGCTACTCAAGTTACCACGCTAGCAAAACCCACGGCATACCGTTTACCCTTTAAAATCAGCTTCAACCAGCAACGAA

There are many programs used to find similarities between sequencesThey range from relatively slow programs which find the exact best matching alignment through ones which take progressively inexact shortcuts to speed things up Of this latter class the best known and easily most widely used is BLAST developed by Stephen Altschul and others and continuously refined over the last 10-15 years

The essential idea is to compare your query sequence against a collection or lsquodatabasersquo of target sequences looking for the one(s) that match the query sequence the best

gttarget1AAAACAGGAATATTTACCGGGACCGGGTAATGATGCATCTCGAGGTACACAATATACCTG GAGAACCGAATTATGAGTTGGCCACCTTACTTAACGAAACCAGCAGAGAAAATCCAACAT GGCAACACCCCTCTGACTACACTAGAAGGAACTACTATGTAAGAAAACAGCCTGTCCCTT GCAGTTTGAATGACTGGGTGATGCGAAATGGGGGTCCTGCCATAGAGCGCTTCCATGGTT TACCTTGCACATTTCAGAGAAGTCCTATGCCAGGAGTCCTTCCTACAGGGCCTTCCTGAA ACTATATATGTGCTTATTCTTGTTTGATTTGGCTTTGCAGgttarget2CTCTTAATTTATTTCTCTTCCTGCAGCTCCCTCGCTTTTTCCTTTCCCTGTTACATTCAT CTGACTTGAAGAGTTGCAAATTTTCAGTGTTTCTGTTTTTGTTGCTGATATGTTGTAAAC TTTTTAATAAAATCTATTTCTATAG gttarget3GCAGTTTGAATGACTGGGTGATGCGAAATGGGGGTCCTGCCATAGAGCGCTTCCATGGTT TACCTTGCACATTTCAGAGAAGTCCTATGCCAGGAGTCCTTCCTACAGGGCCTTCCTGAA ACTATATATGTGCTTATTCTTGTTTGATTTGGCTTTGCAGCTAGGGTTTTCACCTTTTCT GGAAAAAAAAATACTGGCTTCC gttarget4CTGCTATTAATGGGCAAAACAACTCAAATAAAGTCCCTCTGCCACCCTCAGACACTGCCC CTGGCCCCCAGCTGCCCGCTGATCCTTGTAGCCAGAGCAGTAAAGTTTTGAAAGTGGAGC CCAAGGAGAATAAAGTTATTAAAGAAACTGGCTTTGAACAAGGTGAAAAGTCTTGTGCAG CACCTCTAGATCATACTGTGAAGGAAAATCTTGGACAAACTTCTAAAGAACAGGTGGTAG

query

database

COMPARE

LIST MATCHES

Flavours of BLAST

ACGATAGATCCCATCCATAAAT ATGACGATAGATCCCATCAT CGATAGGACCACCACA GATAGACCAGGATACATAGGATAATTA AGCTCGCTTGGCTCGATGGCT

MKJLSPWERSYTRGHYTWER MGHTVNBZY MKLPWRHGDBKJGMNDFD MBKLRPIUHDFRTASGSLKWWRTVBN

MKJLSPWERSYTRGHYTWER MGHTVNBZY MKLPWRHGDBKJGMNDFD MBKLRPIUHDFRTASGSLKWWRTVBN

ATGACGATAGATCCCATCAT CGATAGGACCACCACA GATAGACCAGGATACATAGGATAATTA AGCTCGCTTGGCTCGATGGCT

query sequence other operation database sequences

ATGACGATAGATCCCATCAT CGATAGGACCACCACA GATAGACCAGGATACATAGGATAATTA AGCTCGCTTGGCTCGATGGCT

BLASTn

BLASTp

BLASTx

tBLASTn

tBLASTx

ACGATAGATCCCATCCATAAAT

ACGATAGATCCCATCCATAAAT

MQWCGYRWTYQGYRW

MQWCGYRWTYQGYRW

FAST

FAST

SLOW

SLOWER

HORRIBLY

SLOW

6 fra

me

trans

latio

n

How does it work The main task of any sequence comparison program is to test all possible mutual alignments of two sequence and see how good the match is

CTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGT

CCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGTTC | | | | | ||||||||||||||||||||||||| CTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGTCTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGT || | | | | | | | | |||||||||||||||||||||||| | | | | | |

CCGAGCTTCTCATTGCTCTTCCTAACAGTG=TGATAGGCTAACCGTAATGGCGTTC||||||||||||||||||||||||| ||||||||||||||||||||||||

query

1st database sequence

This would actually be a very slow search process if implemented like thishellip

BLAST achieves its speed through two strategies

- it takes a WORD based approach- it pre-INDEXES database sequences

BLAST WORDS and INDEXING1 GACAAATCCAAACCCCTGAAGTTCTCCACCAGCAAAGCCA

2 TAAGCAAATTTAATTTTGTTTACATTTTC

3 GTTAAGACCTTCCCTGACATTTGCAGCAGTTTCAAATGTA

AAAAAAAA 00001 AAAAAAAC 00002 AAAAAAAG 00003 ACAAATCC 07967 ACAAATCC 07968 ACAAATCC 07979 GACAAATC 33568 GACAAATG 33569 TCCAAACC 64321 TCCAAACC 64322

sequence position word

1 1 33658

1 2 07967

1 3 16210

3 15 33568

3 16 07967

Database of sequences

Numbered list of all possible lsquowordsrsquo

Build a position index of all words in the database

Analyse the Query Sequence gtquery AGACAAATCCAAACCCCTGAAGTTCTCCACCAGCAAAGCCA

AAAAAAAA 00001 AAAAAAAC 00002 AAAAAAAG 00003 ACAAATCC 07967 ACAAATCC 07968 ACAAATCC 07979 GACAAATC 33568 GACAAATG 33569 TCCAAACC 64321 TCCAAACC 64322

QUERY SEQUENCE

Numbered list of all possible lsquowordsrsquo

position word

1 14236

2 33658

3 07967

Analyse QUERY SEQUENCE

sequence position word

1 1 33658

1 2 07967

1 3 16210

3 15 33568

3 16 07967

Index of database

Expand from Word Based Matches We lsquoinstantlyrsquo know which sequences in the database have at least a word length match with our query sequence and at what relative position

Next the potential alignments are expanded adding up a score for (total matches ndash mismatches ndash gap penalties) to make the best possible alignment But this is usually for a tiny proportion of the sequences in the database ndash so overall it is much quicker

The highest scoring alignments are reported

But we can potentially miss alignments with no word-size bits in common consider BLASTn with a default word-size of 11

TCGGAAGTGGAAGCTGAACCTGATTGTAGAGTTGGAGGCCAGTGTTCTGGCTGAGC||||||||| ||||| |||||||||| |||||||||| |||| ||||| ||||||| TCGGAAGTGTAAGCTCAACCTGATTGCAGAGTTGGAGTCCAGAGTTCTAGCTGAGC

Care is sometimes neededhellip

BLAST ndashTypical OutputINPUT

gtpartial cDNA sequence Xenopus tropicalisCGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGTTCCCACCTCTCCTCTTTCACCATGAAGCTCAAGGACAAATTCCACTCCCCCAAAATCAAGCGCACCCCGTCCAAGAAGGGGAAGCCGGCCGACCTCACCGTCAAAACAGAAGAGAAACCCGTCAACAAAACCTTAAGCCGCTTGGAGGAACAGGAGAAAGAAGTCGTTAATGCCTTGCGTTACTTTAAGACAATTGTTGACAAGATGGCGGTGGACAAGATGGTGCTGGTGATGCTGCCAGGGTCGGCGA

OUTPUTQuery= (311 letters) Database NCBI Protein Reference Sequences 954378 sequences 347895532 total letters

gtgi|41055060|ref|NP_9574201| similar to guanine nucleotide-releasing factor 2 (specific for crk proto-oncogene) [Danio rerio]

Length=691

Score = 133 bits (335)Expect = 6e-31 Identities = 7698 (77) Positives = 8298 (83) Gaps = 498 (4) Frame = +2

Query 26 MSGKIE-KADSQRSHLSSFTMKLKDKFHSPKIKRTPSKKGKPA--DLTVKTEEKPVNKTL 196 MSGKIE K +SQ+SHLSSFTMKL KFHSPKIKRTPSKKGK + VKT EKPVNK + Sbjct 1 MSGKIESKHESQKSHLSSFTMKLM-KFHSPKIKRTPSKKGKQLQPEPAVKTPEKPVNKKV 59

Query 197 SRLEEQEKEVVNALRYFKTIVDKMAVDKMVLVMLPGSA 310 SRLEEQEK+VV+ALRYFKTIVDKM VD VL MLPGSA Sbjct 60 SRLEEQEKDVVSALRYFKTIVDKMNVDTKVLQMLPGSA 97

When is a match significant

RFKISDCQHPCTYSHNQYMTNHMRECPYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV

NFSWKKTSEKETNCQFDYPNDYNEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFNMCWLEVNSS

RF---KISDCQHPCTYSH-NQYMTNHMREC----PYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV F K S+ + C + + N Y N +C P+ + +W +P + D I N M ++ NFSWKKTSEKETNCQFDYPNDY--NEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFN------MCWLEVNSS

Here is a lsquotypicalrsquo weak alignment from BLASTp

In fact the sequences were randomly generated so there is no biologically significant alignmenthellip

E-values

The number of matches like the discovered match that I would expect to find by chance

An E-value of 00 implies that I would expect no matches like this to arise by chance thereforehellip

An E-value of 1 implies I would expect 1 match like this to arise by chance so if I have a match with such an E-valuehellip

Also ldquoexpect valueldquo or ldquoexpectationrdquo

E-values From First Principles

Some database statistics (23rd July 2005)

Database NCBI RefSeq mRNA 272619 sequences 503566580 total letters (~50 x 108)

Database NCBI nr 3329110 sequences 14601814750 total letters (~14 x 1010)

Notation

12e-35 = 12 x 10-35

48 x 106 = 4800000

We will consider first searching a nucleotide sequence (lsquoACGTAGACGTrsquo) against a nucleotide database eg the RefSeq mRNA above

Then we will consider the more complex case of amino acid sequence (protein) searches Which is of course what we mostly do

Calculating an E-valueThe RefSeq mRNA database has ~ 50 x 108 letters There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA

Query = lsquoArsquo

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACrsquo

AC AC AC AC

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACGrsquo

ACG

Expected number of matches = (50 x 108) 4 = ~12 x 108

Expected number of matches = (50 x 108) (4x 4) = ~31 x 107

Expected number of matches = (50 x 108) (4 x 4 x 4) = ~81 x 106

Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer

Expected number of matches = (50 x 108) (4 x 4 x 4 x 4 hellip 60 times ) = (50 x 108) 1036 = 50 x 10-28

E-value = 50 x 10-28

E-values In PracticeSo if I take a 60 nt sequencegtsequenceACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA and actually BLAST it against the RefSeq mRNA database I get

BLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds

Length=6060 Score = 119 bits (60) Expect = 2e-26 Identities = 6060 (100) Gaps = 060 (0) Strand=PlusPlus

Query 1 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 60 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 2977 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 3036

What do I get if I BLAST it against the larger nr databaseBLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds

Length=6060 Score = 119 bits (60) Expect = 6e-25 Identities = 6060 (100) Gaps = 060 (0) Strand=PlusPlus

Query 1 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 60 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 2977 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 3036

theoretical value was 50e-28 -

E-value Exercise

Given a transcription factor binding site

ACC[TG]TA

How many would you expect to find by chance in a 10k promoter sequence

How would this differ if there was an optional additional base between the 4th and 5th positionsIeACC[TG]TAOR ACC[TG]TA

E-value Exercise AnswerACC[TG]TA

Expect lsquoArsquo every 4 ntExpect lsquoACCrsquo every 4x4x4 = 64 nt

Expect lsquoT or Grsquo every 2nd ntExpect lsquoACC[TG]rsquo every 64x2 nt = 128 nt

Expect lsquoTArsquo every 4x4 = 16 ntExpect lsquoACC[TG]TArsquo every 128x16 nt = 2048 nt (4x4x4x2x4x4)We would expect ~5 of these promoter sites every 10k by chance

If also ACC[TG]TAA allowed

The two motifs independently have the same E-valueTo allow either means we expect twice as many

E-values Effect of Database SizeThe nr mRNA database has ~ 14 x 1010 letters (was RefSeq and 50 x108)There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA

Query = lsquoArsquo

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACrsquo

AC AC AC AC

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACGrsquo

ACG

Expected number of matches = (14 x 1010) 4 = ~12 x 108

Expected number of matches = (14 x 1010) (4x 4) = ~31 x 107

Expected number of matches = (14 x 1010) (4 x 4 x 4) = ~81 x 106

Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer

Expected number of matches = (14 x 1010) (4 x 4 x 4 x 4 hellip 60 times ) = (14 x 1010) 1036 = 14 x 10-26

E-value = 14 x 10-26

(was E-value = 50 x 10-28)

E-values Effect of Database Size

The E-value is simply dependent on database size

RefSeq

nr

14 x 1010 letters

50 x108 letters

30 x bigger

BLAST the same sequenceagainst each E-value = 14e-26

E-value = 50e-28

The database was ~30 times bigger and so the E-value was ~30 times bigger

Why were the values differentOur calculated E-value for searching against the RefSeq mRNA database was 50 x 10-28But our actual BLAST search at NCBI gave a value of

20 x 10-26 - about 40x larger - why is this

Gapped alignments

If we were expecting N matches for a query sequence lsquoACGTACGTACGTrsquo imagine what would happen to N if we allowed gaps in our matches

ACGTACGTACGT

This would now give us additional possible alignments that would meet our lsquomatchrsquo criteria

ACGTACGTACGT ACGTACAGTACGT ACGTACCGTACGT etc|||||||||||| |||||| |||||| |||||| ||||||ACGTACGTACGT ACGTAC-GTACGT ACGTAC-GTACGT

We will expect many more matches in a given database if we allow our alignments to have gaps The E-value will be larger

E-values Effect of Query Length

Biologically itrsquos the same match Does it mean we are any less sure that this match didnrsquot occur by chance The E-value is simply dependent on match length

database

BLAST 500 nt sequence against a database

BLASTn Get a full length match with sequence XYZ at an E-value = 50e-160

gtsequence ACTAGTCTAGCTAGACATCGATCGATGATGCTACACAGATAGACGATAGATAGTAAGTCGATCGATCGCGCATCGATCGTCTAGATCGATCGCTCGCTGTGTAGATAGATCGGCGATAGA

database

BLAST half of the same sequence against the same database

BLASTn

gtsequence ACTAGTCTAGCTAGACATCGATCGATGATGCTACACAGATAGACGATAGATAGTAAGTCG

Get a match with sequence XYZ again but at an E-value = 50e-80

Why not just use identityAt some levels this a good question

But consider two very different searches both of which give a 75 identity match

Query1 was 60 nt longCGGAGCTCAGGGCTTAACGACTGATATCTCCGCGCATGTCGAGAAACGATACAGCCAGCG||||||||||| || | || | || || |||| | | | |||||| | ||||||||||CGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGWhich would have an E-value ~ 50 x 10-19

And Query2 only 16 nt longACGTACGTACGTACGT||| || | |||| ||ACGCACCTTCGTAGGTWhich would have an E-value ~ 30

And intuitively we feel we would expect to see that sort of number of matches in the database just by chancehellip

So whatrsquos the real problemBasically you are usually trying to answer the question

Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism

Are there any useful guidelines though at least for biological meaningfulness

Basically you are usually trying to answer the question

Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism

BLAST

The difficulty is because

ORTHOLOGY

BLAST Similarity + Probability

biological knowledge

nature of query sequence

phylogenetic relationship

match length PI size of databasehellip

Rules of ThumbHow good does an E-value have to be before we might even think we have an ortholog

largerworse smallerbetter

E-values 10-5 10-10 10-40 10-100 00

fantasy borderline encouraging

pretty good canrsquot get

better

But note that in some gene families with closely related members you can get an E-value of 00 for several different matches and then identity may be more sensitive Also bear in mind in cases like this that ideas of lsquofunctionalrsquo orthology may break down with more than one locus producing identical proteins which share the same functionhellip

Protein BLASTItrsquos (nearly) always better to make comparisons at the amino acid level between protein sequences than the DNA level because the amino acid sequence is more conserved than the underlying DNA sequenceDoes this cause us to treat expected values any differently

If we follow the argument as before then for an exact match of a 20 amino acid sequence in the RefSeq protein database each additional amino acid will reduce the E-value by 120th (there are 20 different amino acids) And as there are 347895532 letters in that databaseE-value = ~35 x 108 (20 x 20 x 20 hellip20 times) = ~35 x 10-18

But this is what we get if we run the blast at NCBI

Score = 431 bits (100) Expect = 8e-04 Identities = 2020 (100) Positives = 2020 (100) Gaps = 020 (0) Frame = +3

Query 3 SSSSFRAYRAALSEVEPPCI 62 SSSSFRAYRAALSEVEPPCISbjct 972 SSSSFRAYRAALSEVEPPCI 991

Really too big a discrepancy to easily explain with hand wavinghellip

Amino Acid Substitutions

A SC F LWYG I LMVL IMFVM ILVP V ILMW FY

N DHSQ REHKS ANTT SY HFW

H NQYK RQER QK

D NEE DQK

In fact we need to take into account both amino acid substitutability as well as as before allowing gapped alignments On average any residue can be substituted for by about 2 others so each position has about 17th chance of lsquomatchingrsquo rather than 120th

So now we getE-value = ~35 x 108 (7 x 7 x 7 hellip20 times) = ~44 x 10-9which is much closer to the actual BLAST value

These substitutabilities are dealt with by the BLOSUM and PAM matrices

ExercisesGo to the file random-DNA-sequenceshtml select one of the 20 randomly generated nucleotide sequences and do a BLASTx (translated DNA-gtprotein) at NCBI against the nr protein database

Did you find any lsquosignificantrsquo hits

Repeat with a second sequence

What conclusions might you draw from this exercise

Try the same sequence(s) against the nr nucleotide database

Is there any general difference

Part 4 Tweaking BLASTAlthough you normally see BLAST as a web page with boxes to place data in and tick boxes etc it is actually a command line program that can be run just by typing the appropriate command and options eg

promptgt blastall ndashp ltblast typegt ndashi ltinput sequencegt ndashd ltdatabasegt

This is the simplest form where the basic program lsquoblastallrsquo takes a number of different options or parameters indicated by the ndashx and followed by its value -p ltwhich blast flavour to rungt-i ltfile with query sequence ingt-d ltpre-indexed database namegt

Not All Parameters are hellipThere are many other parameters and if not listed explicitly they will use a default value most appropriate to the blast flavour requested Eg for ndashW ltword sizegt blastn uses ndashW 11 where blastx uses ndashW 3

There are also some options that appear on the web pages that are not really parameters but manage the job in a similar way One of the most useful of these is on the NCBI blast pages where you can use Entrez queries or pick from an organism list to modify your search

The Many Parameters of BLASTThere are almost literally hundreds of parameters but most are way too obscure even for die-hard techies like me Very few of them are regularly useful in any but their default value but just occasionally they are very necessary

Here are some of the ones that I have used

-e max expected value -m output format (graphical or tabularspreadsheet)-F filter query sequence for low complexity (default TRUE)-U use only upper case regions of query (default FALSE)-G gap opening cost-E gap extension cost-q nucleotide mismatch penalty (BLASTx uses matrices)-r nucleotide match reward-b number of matching sequences to report-g allow gaps (default TRUE)-W word size-z effective database size (removes effect of actual database size)-S query strands to search (default both directions)-l restrict database sequences to given list of lsquogilsquo numbers

BLAST Parameters Exercises1 BLASTn vs BLASTx

Open the file example-sequenceshtml copy the sequence gtblastn-vs-blastxThis is a Xenopus tropicalis cDNA sequence

Go to the NCBI BLAST Home PageNucleotide-nucleotide BLAST (blastn) section Paste your sequence into the box

Run BLASTn against the nr nucleotide database using all default optionsThen hit [format] to wait for the results in a new page(hint if you paste the sequence definition line lsquogtnamersquo into the box as well your results will be labelled accordingly which can be useful)

Now repeat but go to the TRANSLATED BLAST section and BLAST against the nr protein database using BLASTx

How might the different results help us view the presence of this gene in other vertebrates

Results for Exercise 1 BLASTn

BLASTx

BLAST Parameters Exercises2 Low complexity filtering

Open the file example-sequenceshtml copy the sequence gtlow-complexity-filtering-A This sequence contains a long AT tandem repeat

Go to the NCBI BLAST Home PageTRANSLATED BLAST sectionBLASTx Paste your sequence into the box

Carefully UNTICK the ldquoChoose filter [ ] Low complexityrdquo BOX in the second section And then run BLASTx against the nr database

What do you feel about these alignmentsRe-run but leave the low-complexity filter ON this timeDoes this change our view of the protein matches

Now continue with gtlow-complexity-filtering-B and ndashCC is an especially interesting case ndash what can we deduce about the cDNA sequence Annotators beware

Results for Exercise 2A (OFF) BLASTn ndash low complexity filtering OFF

Results for Exercise 2A (ON) BLASTn ndash low complexity filtering ON

Results for Exercise 2B

ON OFF

Results for Exercise 2C

There is a sequence error an extra G at position 117 in the sequence cDNA (117)AGAAAAGAAGAAACATGGCAATGGATCAGAA|||||||||||||||| ||||||||||||||AGAAAAGAAGAAACAT-GCAATGGATCAGAA

Genomic sequence

ON

OFF

BLAST Parameters Exercises3 Limit by Entrez query

Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching sequences in fruit fly enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT

Open the file example-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home Page TRANSLATED BLAST sectionBLASTx and paste the sequence

Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip

BLAST Parameters Exercises4 BLASTn vs tBLASTx and nucleotide mismatch penalties

Open the file example-sequenceshtml

Also open the NCBI BLAST Home PageSPECIAL ndash Align two sequences section

There are several Xenopus tropicalis cyclins in the examples fileCopy the sequence gtcyclin-A1-Xt to the Sequence 1 BLAST windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 BLAST window(i) Run the default comparison should be BLASTn Note the alignmentNow run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these

(ii) Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1 Then try reducing the gap open and gap extension penaltieshellipWhat do we learn from this

(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip

Results for Exercise 4 (i)

BLASTn tBLASTx

Results for Exercise 4 (ii)

Mismatch penalty = -2 (default) Mismatch penalty = -1

BLAST Parameters Exercises5 E-Value maximum for reporting

Open the file example-sequenceshtml

Copy the sequence gtsumo-binding-motif and go to the NCBI BLAST Home PageGo to the PROTEIN BLAST section BLASTp and paste the sequence

Run the search with the default values

Now re-run the search setting the maximum E-value in the box

Expect 100

What difference does this make

BLAST Parameters Exercises6 Word Size

Open the file example-sequenceshtml

Copy the sequence gtmorpholino and go to the NCBI BLAST Home PageGo to the NUCLEOTIDE BLAST section BLASTn and paste the sequence

Check OFF the low complexity filter and then run the search

Now re-run the search setting the following parameters

Low complexity OFFExpect 100Word Size 7Other advanced -q-1 (mismatch penalty -1 instead of default -3)

What difference does this make

END

  • Bioinformatics Workshop 1 Sequences and Similarity Searches
  • The Basic Questions
  • Part 1 Structural Genomics
  • Chromosomes and Genes
  • Gene to Protein
  • Sequence Signals
  • Genomic Signals
  • Derivative Sequences
  • Gene Models
  • Sequences and Genes (Accession Numbers and Names)
  • Gene Symbols Names Etc
  • A Gene-Centric View
  • Sequences and Accession Numbers
  • mRNA Splicing Signals
  • Gene Predictions
  • Supporting Evidence
  • TheoreticalPredicted Sequences
  • Sequences for a model organism
  • So Whatrsquos in the Databases Now
  • Part 2 Comparative Genomics
  • Speciation
  • Residual Similarity
  • Computers Can Detect Homology
  • Orthologs
  • Paralogs
  • lsquoOtherrsquo-logs
  • The Essential Paradigm
  • Function Conserved Longer than Detectable Similarity
  • Redundancy in the Genetic Code
  • Protein Similarity Persists Longer
  • Always Compare Protein Sequences
  • Exercise 1 nucleotide vs amino acid search
  • Answers Exercise 1
  • The Essential Task
  • Functional Orthologs
  • Finding Orthologs
  • Using Synteny is Better
  • Metazome
  • Metazome Exercise
  • Part 3 Finding Sequence Similarities
  • Gaps in Alignments
  • The Downside of Gaps
  • BLAST
  • Flavours of BLAST
  • How does it work
  • BLAST WORDS and INDEXING
  • Analyse the Query Sequence
  • Expand from Word Based Matches
  • BLAST ndashTypical Output
  • When is a match significant
  • E-values
  • E-values From First Principles
  • Calculating an E-value
  • E-values In Practice
  • E-value Exercise
  • E-value Exercise Answer
  • E-values Effect of Database Size
  • Slide 58
  • Why were the values different
  • E-values Effect of Query Length
  • Why not just use identity
  • So whatrsquos the real problem
  • Rules of Thumb
  • Protein BLAST
  • Amino Acid Substitutions
  • Exercises
  • Part 4 Tweaking BLAST
  • Not All Parameters are hellip
  • The Many Parameters of BLAST
  • Slide 70
  • BLAST Parameters Exercises
  • Results for Exercise 1
  • Slide 73
  • Results for Exercise 2A (OFF)
  • Results for Exercise 2A (ON)
  • Results for Exercise 2B
  • Results for Exercise 2C
  • Slide 78
  • Slide 79
  • Results for Exercise 4 (i)
  • Results for Exercise 4 (ii)
  • Slide 82
  • Slide 83
  • END
Page 31: Bioinformatics Workshop 1 Sequences and Similarity Searches

Always Compare Protein Sequences

ATGAATGCAGCCTATGATTGCCGAGCCAGAATGCTAAGG MNAAYDCRARMLR ||||| || || || || || || || ||||| || || | ||||||||+||ATGAAGGCCGCATACGACTGTCGTGCTAGAATCCTGAGA MKAAYDCRARILR

DNA comparison amino acid comparison

The DNA sequence can change while the amino acid sequence stays the same so always look for similarities by comparing amino acid sequences

Exercise 1nucleotide vs amino acid search

Go to the file example-sequenceshtml and locate the section for this exercise There should be two sequences lsquosurfeit1rsquo for frog and fly

Go to NCBI Blast home page then lsquoAlign two sequencesrsquo (bottom left lsquospecialrsquo panel) paste one sequence into each window and hit lsquoAlignrsquo ndash this will do a direct DNADNA comparison

Now find the open reading frames of the two genes and translate them into amino acid protein sequences then repeat the two sequences comparison

Go to NCBI ORF Finder ndash paste sequence ndash hit OrfFind ndash identify longest ORF ndash click on it ndash next screen hit Accept ndash change View to Fasta protein ndash hit View ndash copy sequence to Blast2Seqs Do the same with the other sequence

Before you hit lsquoAlignrsquo change the lsquoProgramrsquo (top left) to blastphellip

Answers Exercise 1

The Essential Taskexperiment data mining

gene sequence what is its function

database of proteins in other species

Cyclin-AFoxA1

cdc25

alpha-tubulin

Predicted protein

Gravin-like

Sprouty-2

calmodulin

KIAA10786568

frizzled

Wint8

Troponin T3

Gravin-like

we can only do this because of implied function based on orthology

Functional Orthologs

function known annotation lsquoGravinrsquo available

Human geneXenopus genefunction unknown

sequence similarityorthologs

same function But we know that function is largely determined by shape

similar shape

Which in general we cannot determine ndash but it is probably SHAPE not SEQUENCE that is conserved

We make an assumption that the same gene function is likely to be present in the two organisms and the ones that have this function are likely to be the most similar in sequence

Finding OrthologsSo how do we find orthologs and can we know when we have

The simplest is Reciprocal Best BLAST but it implicitly relies on having all the protein sequences of you own organism and the one you wish to find an ortholog in

frog proteindatabase of human proteins

best match human protein

database of frog proteins

x

Using Synteny is Better

We know that large regions of (say) vertebrate genomes have preserved their overall organisation from one organism to another

And we find the same genes (ie orthologs) in more or less the same order in the syntenic sectionsThese of course represent chromosomal re-arrangements since these organisms diverged

Human chromosome 5

Mouse chromosome 10

Mouse chromosome 2

MetazomeFortunately someone has done all the hard work for ushellip Dan Rokhsar httpwwwmetazomenet

Metazome Exercise

Go back to Entrez Gene and look for your favourite gene again

Pick probable ortholog vertebrate genes from common organisms (human mouse rat chicken frog fish) and paste their protein sequences into a temporary space

Go to Metazome (httpwwwmetazomenet) find the blast window open two versions of it and blast your sequences against the Tetrapod or Jawed vertebrate node

See if you get the same cluster ID as best top hit and have a look at the Metazome alignment(s)hellip

Part 3 Finding Sequence Similarities

We want computer programs which will compare sequences at all possible different alignments looking for a degree of similarity greater than we would expect to find by chance

But first we have to consider the implication of gapshellip

Insertions and deletions are other possible forms of mutations and they can really mess up our simple alignments

ATGCATGCTGCCAACGGATGTCCTG

ATGAAAGCCGCCTACGAAAGTCCTG||| | || ||| ||| | ||||||

ATGCATGCTGGCCAACGGATGTCCTG

ATGAAAGCCGCCTACGAAAGTCCTG||| | || ||| | | | |

Gaps in Alignments

Consider these two obviously similar sequences

TTCCCAACTCTCCTCTTTCACCATGAAGCTCAAGGACAGATTCCACTCGCCCCAAAATCAAGCTCACCCCGTCCAAGAA | || | || |||||||||||||||||||| ||||||||| ||| ||| | ||| | | |TTCCCACCTCTCCTCTTTGCACCATGAAGCTCAAGGACAAATTCCACTCCCCCAAAATCAAGCGCACCCCGTCCCAGAA

In fact we realise that the most probable alignment (regarding biological origin) is with a small gap in each sequence

TTCCCAACTCTCCTCTTT=CACCATGAAGCTCAAGGACAGATTCCACTCGCCCCAAAATCAAGCTCACCCCGTCCAAGAA |||||| ||||||||||| |||||||||||||||||||| ||||||||| |||||||||||||| |||||||||| ||||TTCCCACCTCTCCTCTTTGCACCATGAAGCTCAAGGACAAATTCCACTC=CCCCAAAATCAAGCGCACCCCGTCCCAGAA

So in general we allow ourselves to insert gaps until we find the optimal alignment

But where should this process stop

The Downside of GapsTake two random sequences with no lsquorealrsquo similarity

GACACTAGGTCGATGCGTGGTGGCGAGA

ACGCATCCGGATGTGCACCGTGGAACTG

And allow lsquocost freersquo gaps

GAC--ACT----AGGTCGATGC---GTGG---TGGCGAGA || | | | | | ||| |||| || ACGCA-TCCGGA--T-G-TGCACCGTGGAACTG

Clearly although the alignment has no mismatches it is obviously not biologically meaningful

To prevent this we assign a cost to adding gaps which is offset against the benefit of finding matches ndash and this is the essence of lsquofinding gapped alignmentsrsquo

We want to find the lsquoalignmentrsquo between the two (or more) sequences which shows the greatest degree of similarity while introducing the fewest gaps hellip

BLAST

gtqueryAGACGAACCTAGCACAAGCGCGTCTGGAAAGACCCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGTTCAGGAGTATTTGGACTGCAATATTGGCCCTCGTTCAAGGGCGCCTACCATCACCCGACGGTCATGCCGGTCCCCAGCAGCTGCTAATAACTTCCTTCGCTACTCAAGTTACCACGCTAGCAAAACCCACGGCATACCGTTTACCCTTTAAAATCAGCTTCAACCAGCAACGAA

There are many programs used to find similarities between sequencesThey range from relatively slow programs which find the exact best matching alignment through ones which take progressively inexact shortcuts to speed things up Of this latter class the best known and easily most widely used is BLAST developed by Stephen Altschul and others and continuously refined over the last 10-15 years

The essential idea is to compare your query sequence against a collection or lsquodatabasersquo of target sequences looking for the one(s) that match the query sequence the best

gttarget1AAAACAGGAATATTTACCGGGACCGGGTAATGATGCATCTCGAGGTACACAATATACCTG GAGAACCGAATTATGAGTTGGCCACCTTACTTAACGAAACCAGCAGAGAAAATCCAACAT GGCAACACCCCTCTGACTACACTAGAAGGAACTACTATGTAAGAAAACAGCCTGTCCCTT GCAGTTTGAATGACTGGGTGATGCGAAATGGGGGTCCTGCCATAGAGCGCTTCCATGGTT TACCTTGCACATTTCAGAGAAGTCCTATGCCAGGAGTCCTTCCTACAGGGCCTTCCTGAA ACTATATATGTGCTTATTCTTGTTTGATTTGGCTTTGCAGgttarget2CTCTTAATTTATTTCTCTTCCTGCAGCTCCCTCGCTTTTTCCTTTCCCTGTTACATTCAT CTGACTTGAAGAGTTGCAAATTTTCAGTGTTTCTGTTTTTGTTGCTGATATGTTGTAAAC TTTTTAATAAAATCTATTTCTATAG gttarget3GCAGTTTGAATGACTGGGTGATGCGAAATGGGGGTCCTGCCATAGAGCGCTTCCATGGTT TACCTTGCACATTTCAGAGAAGTCCTATGCCAGGAGTCCTTCCTACAGGGCCTTCCTGAA ACTATATATGTGCTTATTCTTGTTTGATTTGGCTTTGCAGCTAGGGTTTTCACCTTTTCT GGAAAAAAAAATACTGGCTTCC gttarget4CTGCTATTAATGGGCAAAACAACTCAAATAAAGTCCCTCTGCCACCCTCAGACACTGCCC CTGGCCCCCAGCTGCCCGCTGATCCTTGTAGCCAGAGCAGTAAAGTTTTGAAAGTGGAGC CCAAGGAGAATAAAGTTATTAAAGAAACTGGCTTTGAACAAGGTGAAAAGTCTTGTGCAG CACCTCTAGATCATACTGTGAAGGAAAATCTTGGACAAACTTCTAAAGAACAGGTGGTAG

query

database

COMPARE

LIST MATCHES

Flavours of BLAST

ACGATAGATCCCATCCATAAAT ATGACGATAGATCCCATCAT CGATAGGACCACCACA GATAGACCAGGATACATAGGATAATTA AGCTCGCTTGGCTCGATGGCT

MKJLSPWERSYTRGHYTWER MGHTVNBZY MKLPWRHGDBKJGMNDFD MBKLRPIUHDFRTASGSLKWWRTVBN

MKJLSPWERSYTRGHYTWER MGHTVNBZY MKLPWRHGDBKJGMNDFD MBKLRPIUHDFRTASGSLKWWRTVBN

ATGACGATAGATCCCATCAT CGATAGGACCACCACA GATAGACCAGGATACATAGGATAATTA AGCTCGCTTGGCTCGATGGCT

query sequence other operation database sequences

ATGACGATAGATCCCATCAT CGATAGGACCACCACA GATAGACCAGGATACATAGGATAATTA AGCTCGCTTGGCTCGATGGCT

BLASTn

BLASTp

BLASTx

tBLASTn

tBLASTx

ACGATAGATCCCATCCATAAAT

ACGATAGATCCCATCCATAAAT

MQWCGYRWTYQGYRW

MQWCGYRWTYQGYRW

FAST

FAST

SLOW

SLOWER

HORRIBLY

SLOW

6 fra

me

trans

latio

n

How does it work The main task of any sequence comparison program is to test all possible mutual alignments of two sequence and see how good the match is

CTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGT

CCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGTTC | | | | | ||||||||||||||||||||||||| CTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGTCTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGT || | | | | | | | | |||||||||||||||||||||||| | | | | | |

CCGAGCTTCTCATTGCTCTTCCTAACAGTG=TGATAGGCTAACCGTAATGGCGTTC||||||||||||||||||||||||| ||||||||||||||||||||||||

query

1st database sequence

This would actually be a very slow search process if implemented like thishellip

BLAST achieves its speed through two strategies

- it takes a WORD based approach- it pre-INDEXES database sequences

BLAST WORDS and INDEXING1 GACAAATCCAAACCCCTGAAGTTCTCCACCAGCAAAGCCA

2 TAAGCAAATTTAATTTTGTTTACATTTTC

3 GTTAAGACCTTCCCTGACATTTGCAGCAGTTTCAAATGTA

AAAAAAAA 00001 AAAAAAAC 00002 AAAAAAAG 00003 ACAAATCC 07967 ACAAATCC 07968 ACAAATCC 07979 GACAAATC 33568 GACAAATG 33569 TCCAAACC 64321 TCCAAACC 64322

sequence position word

1 1 33658

1 2 07967

1 3 16210

3 15 33568

3 16 07967

Database of sequences

Numbered list of all possible lsquowordsrsquo

Build a position index of all words in the database

Analyse the Query Sequence gtquery AGACAAATCCAAACCCCTGAAGTTCTCCACCAGCAAAGCCA

AAAAAAAA 00001 AAAAAAAC 00002 AAAAAAAG 00003 ACAAATCC 07967 ACAAATCC 07968 ACAAATCC 07979 GACAAATC 33568 GACAAATG 33569 TCCAAACC 64321 TCCAAACC 64322

QUERY SEQUENCE

Numbered list of all possible lsquowordsrsquo

position word

1 14236

2 33658

3 07967

Analyse QUERY SEQUENCE

sequence position word

1 1 33658

1 2 07967

1 3 16210

3 15 33568

3 16 07967

Index of database

Expand from Word Based Matches We lsquoinstantlyrsquo know which sequences in the database have at least a word length match with our query sequence and at what relative position

Next the potential alignments are expanded adding up a score for (total matches ndash mismatches ndash gap penalties) to make the best possible alignment But this is usually for a tiny proportion of the sequences in the database ndash so overall it is much quicker

The highest scoring alignments are reported

But we can potentially miss alignments with no word-size bits in common consider BLASTn with a default word-size of 11

TCGGAAGTGGAAGCTGAACCTGATTGTAGAGTTGGAGGCCAGTGTTCTGGCTGAGC||||||||| ||||| |||||||||| |||||||||| |||| ||||| ||||||| TCGGAAGTGTAAGCTCAACCTGATTGCAGAGTTGGAGTCCAGAGTTCTAGCTGAGC

Care is sometimes neededhellip

BLAST ndashTypical OutputINPUT

gtpartial cDNA sequence Xenopus tropicalisCGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGTTCCCACCTCTCCTCTTTCACCATGAAGCTCAAGGACAAATTCCACTCCCCCAAAATCAAGCGCACCCCGTCCAAGAAGGGGAAGCCGGCCGACCTCACCGTCAAAACAGAAGAGAAACCCGTCAACAAAACCTTAAGCCGCTTGGAGGAACAGGAGAAAGAAGTCGTTAATGCCTTGCGTTACTTTAAGACAATTGTTGACAAGATGGCGGTGGACAAGATGGTGCTGGTGATGCTGCCAGGGTCGGCGA

OUTPUTQuery= (311 letters) Database NCBI Protein Reference Sequences 954378 sequences 347895532 total letters

gtgi|41055060|ref|NP_9574201| similar to guanine nucleotide-releasing factor 2 (specific for crk proto-oncogene) [Danio rerio]

Length=691

Score = 133 bits (335)Expect = 6e-31 Identities = 7698 (77) Positives = 8298 (83) Gaps = 498 (4) Frame = +2

Query 26 MSGKIE-KADSQRSHLSSFTMKLKDKFHSPKIKRTPSKKGKPA--DLTVKTEEKPVNKTL 196 MSGKIE K +SQ+SHLSSFTMKL KFHSPKIKRTPSKKGK + VKT EKPVNK + Sbjct 1 MSGKIESKHESQKSHLSSFTMKLM-KFHSPKIKRTPSKKGKQLQPEPAVKTPEKPVNKKV 59

Query 197 SRLEEQEKEVVNALRYFKTIVDKMAVDKMVLVMLPGSA 310 SRLEEQEK+VV+ALRYFKTIVDKM VD VL MLPGSA Sbjct 60 SRLEEQEKDVVSALRYFKTIVDKMNVDTKVLQMLPGSA 97

When is a match significant

RFKISDCQHPCTYSHNQYMTNHMRECPYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV

NFSWKKTSEKETNCQFDYPNDYNEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFNMCWLEVNSS

RF---KISDCQHPCTYSH-NQYMTNHMREC----PYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV F K S+ + C + + N Y N +C P+ + +W +P + D I N M ++ NFSWKKTSEKETNCQFDYPNDY--NEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFN------MCWLEVNSS

Here is a lsquotypicalrsquo weak alignment from BLASTp

In fact the sequences were randomly generated so there is no biologically significant alignmenthellip

E-values

The number of matches like the discovered match that I would expect to find by chance

An E-value of 00 implies that I would expect no matches like this to arise by chance thereforehellip

An E-value of 1 implies I would expect 1 match like this to arise by chance so if I have a match with such an E-valuehellip

Also ldquoexpect valueldquo or ldquoexpectationrdquo

E-values From First Principles

Some database statistics (23rd July 2005)

Database NCBI RefSeq mRNA 272619 sequences 503566580 total letters (~50 x 108)

Database NCBI nr 3329110 sequences 14601814750 total letters (~14 x 1010)

Notation

12e-35 = 12 x 10-35

48 x 106 = 4800000

We will consider first searching a nucleotide sequence (lsquoACGTAGACGTrsquo) against a nucleotide database eg the RefSeq mRNA above

Then we will consider the more complex case of amino acid sequence (protein) searches Which is of course what we mostly do

Calculating an E-valueThe RefSeq mRNA database has ~ 50 x 108 letters There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA

Query = lsquoArsquo

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACrsquo

AC AC AC AC

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACGrsquo

ACG

Expected number of matches = (50 x 108) 4 = ~12 x 108

Expected number of matches = (50 x 108) (4x 4) = ~31 x 107

Expected number of matches = (50 x 108) (4 x 4 x 4) = ~81 x 106

Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer

Expected number of matches = (50 x 108) (4 x 4 x 4 x 4 hellip 60 times ) = (50 x 108) 1036 = 50 x 10-28

E-value = 50 x 10-28

E-values In PracticeSo if I take a 60 nt sequencegtsequenceACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA and actually BLAST it against the RefSeq mRNA database I get

BLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds

Length=6060 Score = 119 bits (60) Expect = 2e-26 Identities = 6060 (100) Gaps = 060 (0) Strand=PlusPlus

Query 1 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 60 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 2977 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 3036

What do I get if I BLAST it against the larger nr databaseBLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds

Length=6060 Score = 119 bits (60) Expect = 6e-25 Identities = 6060 (100) Gaps = 060 (0) Strand=PlusPlus

Query 1 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 60 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 2977 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 3036

theoretical value was 50e-28 -

E-value Exercise

Given a transcription factor binding site

ACC[TG]TA

How many would you expect to find by chance in a 10k promoter sequence

How would this differ if there was an optional additional base between the 4th and 5th positionsIeACC[TG]TAOR ACC[TG]TA

E-value Exercise AnswerACC[TG]TA

Expect lsquoArsquo every 4 ntExpect lsquoACCrsquo every 4x4x4 = 64 nt

Expect lsquoT or Grsquo every 2nd ntExpect lsquoACC[TG]rsquo every 64x2 nt = 128 nt

Expect lsquoTArsquo every 4x4 = 16 ntExpect lsquoACC[TG]TArsquo every 128x16 nt = 2048 nt (4x4x4x2x4x4)We would expect ~5 of these promoter sites every 10k by chance

If also ACC[TG]TAA allowed

The two motifs independently have the same E-valueTo allow either means we expect twice as many

E-values Effect of Database SizeThe nr mRNA database has ~ 14 x 1010 letters (was RefSeq and 50 x108)There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA

Query = lsquoArsquo

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACrsquo

AC AC AC AC

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACGrsquo

ACG

Expected number of matches = (14 x 1010) 4 = ~12 x 108

Expected number of matches = (14 x 1010) (4x 4) = ~31 x 107

Expected number of matches = (14 x 1010) (4 x 4 x 4) = ~81 x 106

Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer

Expected number of matches = (14 x 1010) (4 x 4 x 4 x 4 hellip 60 times ) = (14 x 1010) 1036 = 14 x 10-26

E-value = 14 x 10-26

(was E-value = 50 x 10-28)

E-values Effect of Database Size

The E-value is simply dependent on database size

RefSeq

nr

14 x 1010 letters

50 x108 letters

30 x bigger

BLAST the same sequenceagainst each E-value = 14e-26

E-value = 50e-28

The database was ~30 times bigger and so the E-value was ~30 times bigger

Why were the values differentOur calculated E-value for searching against the RefSeq mRNA database was 50 x 10-28But our actual BLAST search at NCBI gave a value of

20 x 10-26 - about 40x larger - why is this

Gapped alignments

If we were expecting N matches for a query sequence lsquoACGTACGTACGTrsquo imagine what would happen to N if we allowed gaps in our matches

ACGTACGTACGT

This would now give us additional possible alignments that would meet our lsquomatchrsquo criteria

ACGTACGTACGT ACGTACAGTACGT ACGTACCGTACGT etc|||||||||||| |||||| |||||| |||||| ||||||ACGTACGTACGT ACGTAC-GTACGT ACGTAC-GTACGT

We will expect many more matches in a given database if we allow our alignments to have gaps The E-value will be larger

E-values Effect of Query Length

Biologically itrsquos the same match Does it mean we are any less sure that this match didnrsquot occur by chance The E-value is simply dependent on match length

database

BLAST 500 nt sequence against a database

BLASTn Get a full length match with sequence XYZ at an E-value = 50e-160

gtsequence ACTAGTCTAGCTAGACATCGATCGATGATGCTACACAGATAGACGATAGATAGTAAGTCGATCGATCGCGCATCGATCGTCTAGATCGATCGCTCGCTGTGTAGATAGATCGGCGATAGA

database

BLAST half of the same sequence against the same database

BLASTn

gtsequence ACTAGTCTAGCTAGACATCGATCGATGATGCTACACAGATAGACGATAGATAGTAAGTCG

Get a match with sequence XYZ again but at an E-value = 50e-80

Why not just use identityAt some levels this a good question

But consider two very different searches both of which give a 75 identity match

Query1 was 60 nt longCGGAGCTCAGGGCTTAACGACTGATATCTCCGCGCATGTCGAGAAACGATACAGCCAGCG||||||||||| || | || | || || |||| | | | |||||| | ||||||||||CGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGWhich would have an E-value ~ 50 x 10-19

And Query2 only 16 nt longACGTACGTACGTACGT||| || | |||| ||ACGCACCTTCGTAGGTWhich would have an E-value ~ 30

And intuitively we feel we would expect to see that sort of number of matches in the database just by chancehellip

So whatrsquos the real problemBasically you are usually trying to answer the question

Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism

Are there any useful guidelines though at least for biological meaningfulness

Basically you are usually trying to answer the question

Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism

BLAST

The difficulty is because

ORTHOLOGY

BLAST Similarity + Probability

biological knowledge

nature of query sequence

phylogenetic relationship

match length PI size of databasehellip

Rules of ThumbHow good does an E-value have to be before we might even think we have an ortholog

largerworse smallerbetter

E-values 10-5 10-10 10-40 10-100 00

fantasy borderline encouraging

pretty good canrsquot get

better

But note that in some gene families with closely related members you can get an E-value of 00 for several different matches and then identity may be more sensitive Also bear in mind in cases like this that ideas of lsquofunctionalrsquo orthology may break down with more than one locus producing identical proteins which share the same functionhellip

Protein BLASTItrsquos (nearly) always better to make comparisons at the amino acid level between protein sequences than the DNA level because the amino acid sequence is more conserved than the underlying DNA sequenceDoes this cause us to treat expected values any differently

If we follow the argument as before then for an exact match of a 20 amino acid sequence in the RefSeq protein database each additional amino acid will reduce the E-value by 120th (there are 20 different amino acids) And as there are 347895532 letters in that databaseE-value = ~35 x 108 (20 x 20 x 20 hellip20 times) = ~35 x 10-18

But this is what we get if we run the blast at NCBI

Score = 431 bits (100) Expect = 8e-04 Identities = 2020 (100) Positives = 2020 (100) Gaps = 020 (0) Frame = +3

Query 3 SSSSFRAYRAALSEVEPPCI 62 SSSSFRAYRAALSEVEPPCISbjct 972 SSSSFRAYRAALSEVEPPCI 991

Really too big a discrepancy to easily explain with hand wavinghellip

Amino Acid Substitutions

A SC F LWYG I LMVL IMFVM ILVP V ILMW FY

N DHSQ REHKS ANTT SY HFW

H NQYK RQER QK

D NEE DQK

In fact we need to take into account both amino acid substitutability as well as as before allowing gapped alignments On average any residue can be substituted for by about 2 others so each position has about 17th chance of lsquomatchingrsquo rather than 120th

So now we getE-value = ~35 x 108 (7 x 7 x 7 hellip20 times) = ~44 x 10-9which is much closer to the actual BLAST value

These substitutabilities are dealt with by the BLOSUM and PAM matrices

ExercisesGo to the file random-DNA-sequenceshtml select one of the 20 randomly generated nucleotide sequences and do a BLASTx (translated DNA-gtprotein) at NCBI against the nr protein database

Did you find any lsquosignificantrsquo hits

Repeat with a second sequence

What conclusions might you draw from this exercise

Try the same sequence(s) against the nr nucleotide database

Is there any general difference

Part 4 Tweaking BLASTAlthough you normally see BLAST as a web page with boxes to place data in and tick boxes etc it is actually a command line program that can be run just by typing the appropriate command and options eg

promptgt blastall ndashp ltblast typegt ndashi ltinput sequencegt ndashd ltdatabasegt

This is the simplest form where the basic program lsquoblastallrsquo takes a number of different options or parameters indicated by the ndashx and followed by its value -p ltwhich blast flavour to rungt-i ltfile with query sequence ingt-d ltpre-indexed database namegt

Not All Parameters are hellipThere are many other parameters and if not listed explicitly they will use a default value most appropriate to the blast flavour requested Eg for ndashW ltword sizegt blastn uses ndashW 11 where blastx uses ndashW 3

There are also some options that appear on the web pages that are not really parameters but manage the job in a similar way One of the most useful of these is on the NCBI blast pages where you can use Entrez queries or pick from an organism list to modify your search

The Many Parameters of BLASTThere are almost literally hundreds of parameters but most are way too obscure even for die-hard techies like me Very few of them are regularly useful in any but their default value but just occasionally they are very necessary

Here are some of the ones that I have used

-e max expected value -m output format (graphical or tabularspreadsheet)-F filter query sequence for low complexity (default TRUE)-U use only upper case regions of query (default FALSE)-G gap opening cost-E gap extension cost-q nucleotide mismatch penalty (BLASTx uses matrices)-r nucleotide match reward-b number of matching sequences to report-g allow gaps (default TRUE)-W word size-z effective database size (removes effect of actual database size)-S query strands to search (default both directions)-l restrict database sequences to given list of lsquogilsquo numbers

BLAST Parameters Exercises1 BLASTn vs BLASTx

Open the file example-sequenceshtml copy the sequence gtblastn-vs-blastxThis is a Xenopus tropicalis cDNA sequence

Go to the NCBI BLAST Home PageNucleotide-nucleotide BLAST (blastn) section Paste your sequence into the box

Run BLASTn against the nr nucleotide database using all default optionsThen hit [format] to wait for the results in a new page(hint if you paste the sequence definition line lsquogtnamersquo into the box as well your results will be labelled accordingly which can be useful)

Now repeat but go to the TRANSLATED BLAST section and BLAST against the nr protein database using BLASTx

How might the different results help us view the presence of this gene in other vertebrates

Results for Exercise 1 BLASTn

BLASTx

BLAST Parameters Exercises2 Low complexity filtering

Open the file example-sequenceshtml copy the sequence gtlow-complexity-filtering-A This sequence contains a long AT tandem repeat

Go to the NCBI BLAST Home PageTRANSLATED BLAST sectionBLASTx Paste your sequence into the box

Carefully UNTICK the ldquoChoose filter [ ] Low complexityrdquo BOX in the second section And then run BLASTx against the nr database

What do you feel about these alignmentsRe-run but leave the low-complexity filter ON this timeDoes this change our view of the protein matches

Now continue with gtlow-complexity-filtering-B and ndashCC is an especially interesting case ndash what can we deduce about the cDNA sequence Annotators beware

Results for Exercise 2A (OFF) BLASTn ndash low complexity filtering OFF

Results for Exercise 2A (ON) BLASTn ndash low complexity filtering ON

Results for Exercise 2B

ON OFF

Results for Exercise 2C

There is a sequence error an extra G at position 117 in the sequence cDNA (117)AGAAAAGAAGAAACATGGCAATGGATCAGAA|||||||||||||||| ||||||||||||||AGAAAAGAAGAAACAT-GCAATGGATCAGAA

Genomic sequence

ON

OFF

BLAST Parameters Exercises3 Limit by Entrez query

Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching sequences in fruit fly enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT

Open the file example-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home Page TRANSLATED BLAST sectionBLASTx and paste the sequence

Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip

BLAST Parameters Exercises4 BLASTn vs tBLASTx and nucleotide mismatch penalties

Open the file example-sequenceshtml

Also open the NCBI BLAST Home PageSPECIAL ndash Align two sequences section

There are several Xenopus tropicalis cyclins in the examples fileCopy the sequence gtcyclin-A1-Xt to the Sequence 1 BLAST windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 BLAST window(i) Run the default comparison should be BLASTn Note the alignmentNow run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these

(ii) Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1 Then try reducing the gap open and gap extension penaltieshellipWhat do we learn from this

(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip

Results for Exercise 4 (i)

BLASTn tBLASTx

Results for Exercise 4 (ii)

Mismatch penalty = -2 (default) Mismatch penalty = -1

BLAST Parameters Exercises5 E-Value maximum for reporting

Open the file example-sequenceshtml

Copy the sequence gtsumo-binding-motif and go to the NCBI BLAST Home PageGo to the PROTEIN BLAST section BLASTp and paste the sequence

Run the search with the default values

Now re-run the search setting the maximum E-value in the box

Expect 100

What difference does this make

BLAST Parameters Exercises6 Word Size

Open the file example-sequenceshtml

Copy the sequence gtmorpholino and go to the NCBI BLAST Home PageGo to the NUCLEOTIDE BLAST section BLASTn and paste the sequence

Check OFF the low complexity filter and then run the search

Now re-run the search setting the following parameters

Low complexity OFFExpect 100Word Size 7Other advanced -q-1 (mismatch penalty -1 instead of default -3)

What difference does this make

END

  • Bioinformatics Workshop 1 Sequences and Similarity Searches
  • The Basic Questions
  • Part 1 Structural Genomics
  • Chromosomes and Genes
  • Gene to Protein
  • Sequence Signals
  • Genomic Signals
  • Derivative Sequences
  • Gene Models
  • Sequences and Genes (Accession Numbers and Names)
  • Gene Symbols Names Etc
  • A Gene-Centric View
  • Sequences and Accession Numbers
  • mRNA Splicing Signals
  • Gene Predictions
  • Supporting Evidence
  • TheoreticalPredicted Sequences
  • Sequences for a model organism
  • So Whatrsquos in the Databases Now
  • Part 2 Comparative Genomics
  • Speciation
  • Residual Similarity
  • Computers Can Detect Homology
  • Orthologs
  • Paralogs
  • lsquoOtherrsquo-logs
  • The Essential Paradigm
  • Function Conserved Longer than Detectable Similarity
  • Redundancy in the Genetic Code
  • Protein Similarity Persists Longer
  • Always Compare Protein Sequences
  • Exercise 1 nucleotide vs amino acid search
  • Answers Exercise 1
  • The Essential Task
  • Functional Orthologs
  • Finding Orthologs
  • Using Synteny is Better
  • Metazome
  • Metazome Exercise
  • Part 3 Finding Sequence Similarities
  • Gaps in Alignments
  • The Downside of Gaps
  • BLAST
  • Flavours of BLAST
  • How does it work
  • BLAST WORDS and INDEXING
  • Analyse the Query Sequence
  • Expand from Word Based Matches
  • BLAST ndashTypical Output
  • When is a match significant
  • E-values
  • E-values From First Principles
  • Calculating an E-value
  • E-values In Practice
  • E-value Exercise
  • E-value Exercise Answer
  • E-values Effect of Database Size
  • Slide 58
  • Why were the values different
  • E-values Effect of Query Length
  • Why not just use identity
  • So whatrsquos the real problem
  • Rules of Thumb
  • Protein BLAST
  • Amino Acid Substitutions
  • Exercises
  • Part 4 Tweaking BLAST
  • Not All Parameters are hellip
  • The Many Parameters of BLAST
  • Slide 70
  • BLAST Parameters Exercises
  • Results for Exercise 1
  • Slide 73
  • Results for Exercise 2A (OFF)
  • Results for Exercise 2A (ON)
  • Results for Exercise 2B
  • Results for Exercise 2C
  • Slide 78
  • Slide 79
  • Results for Exercise 4 (i)
  • Results for Exercise 4 (ii)
  • Slide 82
  • Slide 83
  • END
Page 32: Bioinformatics Workshop 1 Sequences and Similarity Searches

Exercise 1nucleotide vs amino acid search

Go to the file example-sequenceshtml and locate the section for this exercise There should be two sequences lsquosurfeit1rsquo for frog and fly

Go to NCBI Blast home page then lsquoAlign two sequencesrsquo (bottom left lsquospecialrsquo panel) paste one sequence into each window and hit lsquoAlignrsquo ndash this will do a direct DNADNA comparison

Now find the open reading frames of the two genes and translate them into amino acid protein sequences then repeat the two sequences comparison

Go to NCBI ORF Finder ndash paste sequence ndash hit OrfFind ndash identify longest ORF ndash click on it ndash next screen hit Accept ndash change View to Fasta protein ndash hit View ndash copy sequence to Blast2Seqs Do the same with the other sequence

Before you hit lsquoAlignrsquo change the lsquoProgramrsquo (top left) to blastphellip

Answers Exercise 1

The Essential Taskexperiment data mining

gene sequence what is its function

database of proteins in other species

Cyclin-AFoxA1

cdc25

alpha-tubulin

Predicted protein

Gravin-like

Sprouty-2

calmodulin

KIAA10786568

frizzled

Wint8

Troponin T3

Gravin-like

we can only do this because of implied function based on orthology

Functional Orthologs

function known annotation lsquoGravinrsquo available

Human geneXenopus genefunction unknown

sequence similarityorthologs

same function But we know that function is largely determined by shape

similar shape

Which in general we cannot determine ndash but it is probably SHAPE not SEQUENCE that is conserved

We make an assumption that the same gene function is likely to be present in the two organisms and the ones that have this function are likely to be the most similar in sequence

Finding OrthologsSo how do we find orthologs and can we know when we have

The simplest is Reciprocal Best BLAST but it implicitly relies on having all the protein sequences of you own organism and the one you wish to find an ortholog in

frog proteindatabase of human proteins

best match human protein

database of frog proteins

x

Using Synteny is Better

We know that large regions of (say) vertebrate genomes have preserved their overall organisation from one organism to another

And we find the same genes (ie orthologs) in more or less the same order in the syntenic sectionsThese of course represent chromosomal re-arrangements since these organisms diverged

Human chromosome 5

Mouse chromosome 10

Mouse chromosome 2

MetazomeFortunately someone has done all the hard work for ushellip Dan Rokhsar httpwwwmetazomenet

Metazome Exercise

Go back to Entrez Gene and look for your favourite gene again

Pick probable ortholog vertebrate genes from common organisms (human mouse rat chicken frog fish) and paste their protein sequences into a temporary space

Go to Metazome (httpwwwmetazomenet) find the blast window open two versions of it and blast your sequences against the Tetrapod or Jawed vertebrate node

See if you get the same cluster ID as best top hit and have a look at the Metazome alignment(s)hellip

Part 3 Finding Sequence Similarities

We want computer programs which will compare sequences at all possible different alignments looking for a degree of similarity greater than we would expect to find by chance

But first we have to consider the implication of gapshellip

Insertions and deletions are other possible forms of mutations and they can really mess up our simple alignments

ATGCATGCTGCCAACGGATGTCCTG

ATGAAAGCCGCCTACGAAAGTCCTG||| | || ||| ||| | ||||||

ATGCATGCTGGCCAACGGATGTCCTG

ATGAAAGCCGCCTACGAAAGTCCTG||| | || ||| | | | |

Gaps in Alignments

Consider these two obviously similar sequences

TTCCCAACTCTCCTCTTTCACCATGAAGCTCAAGGACAGATTCCACTCGCCCCAAAATCAAGCTCACCCCGTCCAAGAA | || | || |||||||||||||||||||| ||||||||| ||| ||| | ||| | | |TTCCCACCTCTCCTCTTTGCACCATGAAGCTCAAGGACAAATTCCACTCCCCCAAAATCAAGCGCACCCCGTCCCAGAA

In fact we realise that the most probable alignment (regarding biological origin) is with a small gap in each sequence

TTCCCAACTCTCCTCTTT=CACCATGAAGCTCAAGGACAGATTCCACTCGCCCCAAAATCAAGCTCACCCCGTCCAAGAA |||||| ||||||||||| |||||||||||||||||||| ||||||||| |||||||||||||| |||||||||| ||||TTCCCACCTCTCCTCTTTGCACCATGAAGCTCAAGGACAAATTCCACTC=CCCCAAAATCAAGCGCACCCCGTCCCAGAA

So in general we allow ourselves to insert gaps until we find the optimal alignment

But where should this process stop

The Downside of GapsTake two random sequences with no lsquorealrsquo similarity

GACACTAGGTCGATGCGTGGTGGCGAGA

ACGCATCCGGATGTGCACCGTGGAACTG

And allow lsquocost freersquo gaps

GAC--ACT----AGGTCGATGC---GTGG---TGGCGAGA || | | | | | ||| |||| || ACGCA-TCCGGA--T-G-TGCACCGTGGAACTG

Clearly although the alignment has no mismatches it is obviously not biologically meaningful

To prevent this we assign a cost to adding gaps which is offset against the benefit of finding matches ndash and this is the essence of lsquofinding gapped alignmentsrsquo

We want to find the lsquoalignmentrsquo between the two (or more) sequences which shows the greatest degree of similarity while introducing the fewest gaps hellip

BLAST

gtqueryAGACGAACCTAGCACAAGCGCGTCTGGAAAGACCCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGTTCAGGAGTATTTGGACTGCAATATTGGCCCTCGTTCAAGGGCGCCTACCATCACCCGACGGTCATGCCGGTCCCCAGCAGCTGCTAATAACTTCCTTCGCTACTCAAGTTACCACGCTAGCAAAACCCACGGCATACCGTTTACCCTTTAAAATCAGCTTCAACCAGCAACGAA

There are many programs used to find similarities between sequencesThey range from relatively slow programs which find the exact best matching alignment through ones which take progressively inexact shortcuts to speed things up Of this latter class the best known and easily most widely used is BLAST developed by Stephen Altschul and others and continuously refined over the last 10-15 years

The essential idea is to compare your query sequence against a collection or lsquodatabasersquo of target sequences looking for the one(s) that match the query sequence the best

gttarget1AAAACAGGAATATTTACCGGGACCGGGTAATGATGCATCTCGAGGTACACAATATACCTG GAGAACCGAATTATGAGTTGGCCACCTTACTTAACGAAACCAGCAGAGAAAATCCAACAT GGCAACACCCCTCTGACTACACTAGAAGGAACTACTATGTAAGAAAACAGCCTGTCCCTT GCAGTTTGAATGACTGGGTGATGCGAAATGGGGGTCCTGCCATAGAGCGCTTCCATGGTT TACCTTGCACATTTCAGAGAAGTCCTATGCCAGGAGTCCTTCCTACAGGGCCTTCCTGAA ACTATATATGTGCTTATTCTTGTTTGATTTGGCTTTGCAGgttarget2CTCTTAATTTATTTCTCTTCCTGCAGCTCCCTCGCTTTTTCCTTTCCCTGTTACATTCAT CTGACTTGAAGAGTTGCAAATTTTCAGTGTTTCTGTTTTTGTTGCTGATATGTTGTAAAC TTTTTAATAAAATCTATTTCTATAG gttarget3GCAGTTTGAATGACTGGGTGATGCGAAATGGGGGTCCTGCCATAGAGCGCTTCCATGGTT TACCTTGCACATTTCAGAGAAGTCCTATGCCAGGAGTCCTTCCTACAGGGCCTTCCTGAA ACTATATATGTGCTTATTCTTGTTTGATTTGGCTTTGCAGCTAGGGTTTTCACCTTTTCT GGAAAAAAAAATACTGGCTTCC gttarget4CTGCTATTAATGGGCAAAACAACTCAAATAAAGTCCCTCTGCCACCCTCAGACACTGCCC CTGGCCCCCAGCTGCCCGCTGATCCTTGTAGCCAGAGCAGTAAAGTTTTGAAAGTGGAGC CCAAGGAGAATAAAGTTATTAAAGAAACTGGCTTTGAACAAGGTGAAAAGTCTTGTGCAG CACCTCTAGATCATACTGTGAAGGAAAATCTTGGACAAACTTCTAAAGAACAGGTGGTAG

query

database

COMPARE

LIST MATCHES

Flavours of BLAST

ACGATAGATCCCATCCATAAAT ATGACGATAGATCCCATCAT CGATAGGACCACCACA GATAGACCAGGATACATAGGATAATTA AGCTCGCTTGGCTCGATGGCT

MKJLSPWERSYTRGHYTWER MGHTVNBZY MKLPWRHGDBKJGMNDFD MBKLRPIUHDFRTASGSLKWWRTVBN

MKJLSPWERSYTRGHYTWER MGHTVNBZY MKLPWRHGDBKJGMNDFD MBKLRPIUHDFRTASGSLKWWRTVBN

ATGACGATAGATCCCATCAT CGATAGGACCACCACA GATAGACCAGGATACATAGGATAATTA AGCTCGCTTGGCTCGATGGCT

query sequence other operation database sequences

ATGACGATAGATCCCATCAT CGATAGGACCACCACA GATAGACCAGGATACATAGGATAATTA AGCTCGCTTGGCTCGATGGCT

BLASTn

BLASTp

BLASTx

tBLASTn

tBLASTx

ACGATAGATCCCATCCATAAAT

ACGATAGATCCCATCCATAAAT

MQWCGYRWTYQGYRW

MQWCGYRWTYQGYRW

FAST

FAST

SLOW

SLOWER

HORRIBLY

SLOW

6 fra

me

trans

latio

n

How does it work The main task of any sequence comparison program is to test all possible mutual alignments of two sequence and see how good the match is

CTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGT

CCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGTTC | | | | | ||||||||||||||||||||||||| CTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGTCTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGT || | | | | | | | | |||||||||||||||||||||||| | | | | | |

CCGAGCTTCTCATTGCTCTTCCTAACAGTG=TGATAGGCTAACCGTAATGGCGTTC||||||||||||||||||||||||| ||||||||||||||||||||||||

query

1st database sequence

This would actually be a very slow search process if implemented like thishellip

BLAST achieves its speed through two strategies

- it takes a WORD based approach- it pre-INDEXES database sequences

BLAST WORDS and INDEXING1 GACAAATCCAAACCCCTGAAGTTCTCCACCAGCAAAGCCA

2 TAAGCAAATTTAATTTTGTTTACATTTTC

3 GTTAAGACCTTCCCTGACATTTGCAGCAGTTTCAAATGTA

AAAAAAAA 00001 AAAAAAAC 00002 AAAAAAAG 00003 ACAAATCC 07967 ACAAATCC 07968 ACAAATCC 07979 GACAAATC 33568 GACAAATG 33569 TCCAAACC 64321 TCCAAACC 64322

sequence position word

1 1 33658

1 2 07967

1 3 16210

3 15 33568

3 16 07967

Database of sequences

Numbered list of all possible lsquowordsrsquo

Build a position index of all words in the database

Analyse the Query Sequence gtquery AGACAAATCCAAACCCCTGAAGTTCTCCACCAGCAAAGCCA

AAAAAAAA 00001 AAAAAAAC 00002 AAAAAAAG 00003 ACAAATCC 07967 ACAAATCC 07968 ACAAATCC 07979 GACAAATC 33568 GACAAATG 33569 TCCAAACC 64321 TCCAAACC 64322

QUERY SEQUENCE

Numbered list of all possible lsquowordsrsquo

position word

1 14236

2 33658

3 07967

Analyse QUERY SEQUENCE

sequence position word

1 1 33658

1 2 07967

1 3 16210

3 15 33568

3 16 07967

Index of database

Expand from Word Based Matches We lsquoinstantlyrsquo know which sequences in the database have at least a word length match with our query sequence and at what relative position

Next the potential alignments are expanded adding up a score for (total matches ndash mismatches ndash gap penalties) to make the best possible alignment But this is usually for a tiny proportion of the sequences in the database ndash so overall it is much quicker

The highest scoring alignments are reported

But we can potentially miss alignments with no word-size bits in common consider BLASTn with a default word-size of 11

TCGGAAGTGGAAGCTGAACCTGATTGTAGAGTTGGAGGCCAGTGTTCTGGCTGAGC||||||||| ||||| |||||||||| |||||||||| |||| ||||| ||||||| TCGGAAGTGTAAGCTCAACCTGATTGCAGAGTTGGAGTCCAGAGTTCTAGCTGAGC

Care is sometimes neededhellip

BLAST ndashTypical OutputINPUT

gtpartial cDNA sequence Xenopus tropicalisCGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGTTCCCACCTCTCCTCTTTCACCATGAAGCTCAAGGACAAATTCCACTCCCCCAAAATCAAGCGCACCCCGTCCAAGAAGGGGAAGCCGGCCGACCTCACCGTCAAAACAGAAGAGAAACCCGTCAACAAAACCTTAAGCCGCTTGGAGGAACAGGAGAAAGAAGTCGTTAATGCCTTGCGTTACTTTAAGACAATTGTTGACAAGATGGCGGTGGACAAGATGGTGCTGGTGATGCTGCCAGGGTCGGCGA

OUTPUTQuery= (311 letters) Database NCBI Protein Reference Sequences 954378 sequences 347895532 total letters

gtgi|41055060|ref|NP_9574201| similar to guanine nucleotide-releasing factor 2 (specific for crk proto-oncogene) [Danio rerio]

Length=691

Score = 133 bits (335)Expect = 6e-31 Identities = 7698 (77) Positives = 8298 (83) Gaps = 498 (4) Frame = +2

Query 26 MSGKIE-KADSQRSHLSSFTMKLKDKFHSPKIKRTPSKKGKPA--DLTVKTEEKPVNKTL 196 MSGKIE K +SQ+SHLSSFTMKL KFHSPKIKRTPSKKGK + VKT EKPVNK + Sbjct 1 MSGKIESKHESQKSHLSSFTMKLM-KFHSPKIKRTPSKKGKQLQPEPAVKTPEKPVNKKV 59

Query 197 SRLEEQEKEVVNALRYFKTIVDKMAVDKMVLVMLPGSA 310 SRLEEQEK+VV+ALRYFKTIVDKM VD VL MLPGSA Sbjct 60 SRLEEQEKDVVSALRYFKTIVDKMNVDTKVLQMLPGSA 97

When is a match significant

RFKISDCQHPCTYSHNQYMTNHMRECPYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV

NFSWKKTSEKETNCQFDYPNDYNEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFNMCWLEVNSS

RF---KISDCQHPCTYSH-NQYMTNHMREC----PYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV F K S+ + C + + N Y N +C P+ + +W +P + D I N M ++ NFSWKKTSEKETNCQFDYPNDY--NEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFN------MCWLEVNSS

Here is a lsquotypicalrsquo weak alignment from BLASTp

In fact the sequences were randomly generated so there is no biologically significant alignmenthellip

E-values

The number of matches like the discovered match that I would expect to find by chance

An E-value of 00 implies that I would expect no matches like this to arise by chance thereforehellip

An E-value of 1 implies I would expect 1 match like this to arise by chance so if I have a match with such an E-valuehellip

Also ldquoexpect valueldquo or ldquoexpectationrdquo

E-values From First Principles

Some database statistics (23rd July 2005)

Database NCBI RefSeq mRNA 272619 sequences 503566580 total letters (~50 x 108)

Database NCBI nr 3329110 sequences 14601814750 total letters (~14 x 1010)

Notation

12e-35 = 12 x 10-35

48 x 106 = 4800000

We will consider first searching a nucleotide sequence (lsquoACGTAGACGTrsquo) against a nucleotide database eg the RefSeq mRNA above

Then we will consider the more complex case of amino acid sequence (protein) searches Which is of course what we mostly do

Calculating an E-valueThe RefSeq mRNA database has ~ 50 x 108 letters There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA

Query = lsquoArsquo

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACrsquo

AC AC AC AC

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACGrsquo

ACG

Expected number of matches = (50 x 108) 4 = ~12 x 108

Expected number of matches = (50 x 108) (4x 4) = ~31 x 107

Expected number of matches = (50 x 108) (4 x 4 x 4) = ~81 x 106

Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer

Expected number of matches = (50 x 108) (4 x 4 x 4 x 4 hellip 60 times ) = (50 x 108) 1036 = 50 x 10-28

E-value = 50 x 10-28

E-values In PracticeSo if I take a 60 nt sequencegtsequenceACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA and actually BLAST it against the RefSeq mRNA database I get

BLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds

Length=6060 Score = 119 bits (60) Expect = 2e-26 Identities = 6060 (100) Gaps = 060 (0) Strand=PlusPlus

Query 1 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 60 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 2977 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 3036

What do I get if I BLAST it against the larger nr databaseBLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds

Length=6060 Score = 119 bits (60) Expect = 6e-25 Identities = 6060 (100) Gaps = 060 (0) Strand=PlusPlus

Query 1 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 60 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 2977 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 3036

theoretical value was 50e-28 -

E-value Exercise

Given a transcription factor binding site

ACC[TG]TA

How many would you expect to find by chance in a 10k promoter sequence

How would this differ if there was an optional additional base between the 4th and 5th positionsIeACC[TG]TAOR ACC[TG]TA

E-value Exercise AnswerACC[TG]TA

Expect lsquoArsquo every 4 ntExpect lsquoACCrsquo every 4x4x4 = 64 nt

Expect lsquoT or Grsquo every 2nd ntExpect lsquoACC[TG]rsquo every 64x2 nt = 128 nt

Expect lsquoTArsquo every 4x4 = 16 ntExpect lsquoACC[TG]TArsquo every 128x16 nt = 2048 nt (4x4x4x2x4x4)We would expect ~5 of these promoter sites every 10k by chance

If also ACC[TG]TAA allowed

The two motifs independently have the same E-valueTo allow either means we expect twice as many

E-values Effect of Database SizeThe nr mRNA database has ~ 14 x 1010 letters (was RefSeq and 50 x108)There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA

Query = lsquoArsquo

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACrsquo

AC AC AC AC

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACGrsquo

ACG

Expected number of matches = (14 x 1010) 4 = ~12 x 108

Expected number of matches = (14 x 1010) (4x 4) = ~31 x 107

Expected number of matches = (14 x 1010) (4 x 4 x 4) = ~81 x 106

Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer

Expected number of matches = (14 x 1010) (4 x 4 x 4 x 4 hellip 60 times ) = (14 x 1010) 1036 = 14 x 10-26

E-value = 14 x 10-26

(was E-value = 50 x 10-28)

E-values Effect of Database Size

The E-value is simply dependent on database size

RefSeq

nr

14 x 1010 letters

50 x108 letters

30 x bigger

BLAST the same sequenceagainst each E-value = 14e-26

E-value = 50e-28

The database was ~30 times bigger and so the E-value was ~30 times bigger

Why were the values differentOur calculated E-value for searching against the RefSeq mRNA database was 50 x 10-28But our actual BLAST search at NCBI gave a value of

20 x 10-26 - about 40x larger - why is this

Gapped alignments

If we were expecting N matches for a query sequence lsquoACGTACGTACGTrsquo imagine what would happen to N if we allowed gaps in our matches

ACGTACGTACGT

This would now give us additional possible alignments that would meet our lsquomatchrsquo criteria

ACGTACGTACGT ACGTACAGTACGT ACGTACCGTACGT etc|||||||||||| |||||| |||||| |||||| ||||||ACGTACGTACGT ACGTAC-GTACGT ACGTAC-GTACGT

We will expect many more matches in a given database if we allow our alignments to have gaps The E-value will be larger

E-values Effect of Query Length

Biologically itrsquos the same match Does it mean we are any less sure that this match didnrsquot occur by chance The E-value is simply dependent on match length

database

BLAST 500 nt sequence against a database

BLASTn Get a full length match with sequence XYZ at an E-value = 50e-160

gtsequence ACTAGTCTAGCTAGACATCGATCGATGATGCTACACAGATAGACGATAGATAGTAAGTCGATCGATCGCGCATCGATCGTCTAGATCGATCGCTCGCTGTGTAGATAGATCGGCGATAGA

database

BLAST half of the same sequence against the same database

BLASTn

gtsequence ACTAGTCTAGCTAGACATCGATCGATGATGCTACACAGATAGACGATAGATAGTAAGTCG

Get a match with sequence XYZ again but at an E-value = 50e-80

Why not just use identityAt some levels this a good question

But consider two very different searches both of which give a 75 identity match

Query1 was 60 nt longCGGAGCTCAGGGCTTAACGACTGATATCTCCGCGCATGTCGAGAAACGATACAGCCAGCG||||||||||| || | || | || || |||| | | | |||||| | ||||||||||CGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGWhich would have an E-value ~ 50 x 10-19

And Query2 only 16 nt longACGTACGTACGTACGT||| || | |||| ||ACGCACCTTCGTAGGTWhich would have an E-value ~ 30

And intuitively we feel we would expect to see that sort of number of matches in the database just by chancehellip

So whatrsquos the real problemBasically you are usually trying to answer the question

Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism

Are there any useful guidelines though at least for biological meaningfulness

Basically you are usually trying to answer the question

Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism

BLAST

The difficulty is because

ORTHOLOGY

BLAST Similarity + Probability

biological knowledge

nature of query sequence

phylogenetic relationship

match length PI size of databasehellip

Rules of ThumbHow good does an E-value have to be before we might even think we have an ortholog

largerworse smallerbetter

E-values 10-5 10-10 10-40 10-100 00

fantasy borderline encouraging

pretty good canrsquot get

better

But note that in some gene families with closely related members you can get an E-value of 00 for several different matches and then identity may be more sensitive Also bear in mind in cases like this that ideas of lsquofunctionalrsquo orthology may break down with more than one locus producing identical proteins which share the same functionhellip

Protein BLASTItrsquos (nearly) always better to make comparisons at the amino acid level between protein sequences than the DNA level because the amino acid sequence is more conserved than the underlying DNA sequenceDoes this cause us to treat expected values any differently

If we follow the argument as before then for an exact match of a 20 amino acid sequence in the RefSeq protein database each additional amino acid will reduce the E-value by 120th (there are 20 different amino acids) And as there are 347895532 letters in that databaseE-value = ~35 x 108 (20 x 20 x 20 hellip20 times) = ~35 x 10-18

But this is what we get if we run the blast at NCBI

Score = 431 bits (100) Expect = 8e-04 Identities = 2020 (100) Positives = 2020 (100) Gaps = 020 (0) Frame = +3

Query 3 SSSSFRAYRAALSEVEPPCI 62 SSSSFRAYRAALSEVEPPCISbjct 972 SSSSFRAYRAALSEVEPPCI 991

Really too big a discrepancy to easily explain with hand wavinghellip

Amino Acid Substitutions

A SC F LWYG I LMVL IMFVM ILVP V ILMW FY

N DHSQ REHKS ANTT SY HFW

H NQYK RQER QK

D NEE DQK

In fact we need to take into account both amino acid substitutability as well as as before allowing gapped alignments On average any residue can be substituted for by about 2 others so each position has about 17th chance of lsquomatchingrsquo rather than 120th

So now we getE-value = ~35 x 108 (7 x 7 x 7 hellip20 times) = ~44 x 10-9which is much closer to the actual BLAST value

These substitutabilities are dealt with by the BLOSUM and PAM matrices

ExercisesGo to the file random-DNA-sequenceshtml select one of the 20 randomly generated nucleotide sequences and do a BLASTx (translated DNA-gtprotein) at NCBI against the nr protein database

Did you find any lsquosignificantrsquo hits

Repeat with a second sequence

What conclusions might you draw from this exercise

Try the same sequence(s) against the nr nucleotide database

Is there any general difference

Part 4 Tweaking BLASTAlthough you normally see BLAST as a web page with boxes to place data in and tick boxes etc it is actually a command line program that can be run just by typing the appropriate command and options eg

promptgt blastall ndashp ltblast typegt ndashi ltinput sequencegt ndashd ltdatabasegt

This is the simplest form where the basic program lsquoblastallrsquo takes a number of different options or parameters indicated by the ndashx and followed by its value -p ltwhich blast flavour to rungt-i ltfile with query sequence ingt-d ltpre-indexed database namegt

Not All Parameters are hellipThere are many other parameters and if not listed explicitly they will use a default value most appropriate to the blast flavour requested Eg for ndashW ltword sizegt blastn uses ndashW 11 where blastx uses ndashW 3

There are also some options that appear on the web pages that are not really parameters but manage the job in a similar way One of the most useful of these is on the NCBI blast pages where you can use Entrez queries or pick from an organism list to modify your search

The Many Parameters of BLASTThere are almost literally hundreds of parameters but most are way too obscure even for die-hard techies like me Very few of them are regularly useful in any but their default value but just occasionally they are very necessary

Here are some of the ones that I have used

-e max expected value -m output format (graphical or tabularspreadsheet)-F filter query sequence for low complexity (default TRUE)-U use only upper case regions of query (default FALSE)-G gap opening cost-E gap extension cost-q nucleotide mismatch penalty (BLASTx uses matrices)-r nucleotide match reward-b number of matching sequences to report-g allow gaps (default TRUE)-W word size-z effective database size (removes effect of actual database size)-S query strands to search (default both directions)-l restrict database sequences to given list of lsquogilsquo numbers

BLAST Parameters Exercises1 BLASTn vs BLASTx

Open the file example-sequenceshtml copy the sequence gtblastn-vs-blastxThis is a Xenopus tropicalis cDNA sequence

Go to the NCBI BLAST Home PageNucleotide-nucleotide BLAST (blastn) section Paste your sequence into the box

Run BLASTn against the nr nucleotide database using all default optionsThen hit [format] to wait for the results in a new page(hint if you paste the sequence definition line lsquogtnamersquo into the box as well your results will be labelled accordingly which can be useful)

Now repeat but go to the TRANSLATED BLAST section and BLAST against the nr protein database using BLASTx

How might the different results help us view the presence of this gene in other vertebrates

Results for Exercise 1 BLASTn

BLASTx

BLAST Parameters Exercises2 Low complexity filtering

Open the file example-sequenceshtml copy the sequence gtlow-complexity-filtering-A This sequence contains a long AT tandem repeat

Go to the NCBI BLAST Home PageTRANSLATED BLAST sectionBLASTx Paste your sequence into the box

Carefully UNTICK the ldquoChoose filter [ ] Low complexityrdquo BOX in the second section And then run BLASTx against the nr database

What do you feel about these alignmentsRe-run but leave the low-complexity filter ON this timeDoes this change our view of the protein matches

Now continue with gtlow-complexity-filtering-B and ndashCC is an especially interesting case ndash what can we deduce about the cDNA sequence Annotators beware

Results for Exercise 2A (OFF) BLASTn ndash low complexity filtering OFF

Results for Exercise 2A (ON) BLASTn ndash low complexity filtering ON

Results for Exercise 2B

ON OFF

Results for Exercise 2C

There is a sequence error an extra G at position 117 in the sequence cDNA (117)AGAAAAGAAGAAACATGGCAATGGATCAGAA|||||||||||||||| ||||||||||||||AGAAAAGAAGAAACAT-GCAATGGATCAGAA

Genomic sequence

ON

OFF

BLAST Parameters Exercises3 Limit by Entrez query

Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching sequences in fruit fly enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT

Open the file example-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home Page TRANSLATED BLAST sectionBLASTx and paste the sequence

Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip

BLAST Parameters Exercises4 BLASTn vs tBLASTx and nucleotide mismatch penalties

Open the file example-sequenceshtml

Also open the NCBI BLAST Home PageSPECIAL ndash Align two sequences section

There are several Xenopus tropicalis cyclins in the examples fileCopy the sequence gtcyclin-A1-Xt to the Sequence 1 BLAST windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 BLAST window(i) Run the default comparison should be BLASTn Note the alignmentNow run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these

(ii) Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1 Then try reducing the gap open and gap extension penaltieshellipWhat do we learn from this

(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip

Results for Exercise 4 (i)

BLASTn tBLASTx

Results for Exercise 4 (ii)

Mismatch penalty = -2 (default) Mismatch penalty = -1

BLAST Parameters Exercises5 E-Value maximum for reporting

Open the file example-sequenceshtml

Copy the sequence gtsumo-binding-motif and go to the NCBI BLAST Home PageGo to the PROTEIN BLAST section BLASTp and paste the sequence

Run the search with the default values

Now re-run the search setting the maximum E-value in the box

Expect 100

What difference does this make

BLAST Parameters Exercises6 Word Size

Open the file example-sequenceshtml

Copy the sequence gtmorpholino and go to the NCBI BLAST Home PageGo to the NUCLEOTIDE BLAST section BLASTn and paste the sequence

Check OFF the low complexity filter and then run the search

Now re-run the search setting the following parameters

Low complexity OFFExpect 100Word Size 7Other advanced -q-1 (mismatch penalty -1 instead of default -3)

What difference does this make

END

  • Bioinformatics Workshop 1 Sequences and Similarity Searches
  • The Basic Questions
  • Part 1 Structural Genomics
  • Chromosomes and Genes
  • Gene to Protein
  • Sequence Signals
  • Genomic Signals
  • Derivative Sequences
  • Gene Models
  • Sequences and Genes (Accession Numbers and Names)
  • Gene Symbols Names Etc
  • A Gene-Centric View
  • Sequences and Accession Numbers
  • mRNA Splicing Signals
  • Gene Predictions
  • Supporting Evidence
  • TheoreticalPredicted Sequences
  • Sequences for a model organism
  • So Whatrsquos in the Databases Now
  • Part 2 Comparative Genomics
  • Speciation
  • Residual Similarity
  • Computers Can Detect Homology
  • Orthologs
  • Paralogs
  • lsquoOtherrsquo-logs
  • The Essential Paradigm
  • Function Conserved Longer than Detectable Similarity
  • Redundancy in the Genetic Code
  • Protein Similarity Persists Longer
  • Always Compare Protein Sequences
  • Exercise 1 nucleotide vs amino acid search
  • Answers Exercise 1
  • The Essential Task
  • Functional Orthologs
  • Finding Orthologs
  • Using Synteny is Better
  • Metazome
  • Metazome Exercise
  • Part 3 Finding Sequence Similarities
  • Gaps in Alignments
  • The Downside of Gaps
  • BLAST
  • Flavours of BLAST
  • How does it work
  • BLAST WORDS and INDEXING
  • Analyse the Query Sequence
  • Expand from Word Based Matches
  • BLAST ndashTypical Output
  • When is a match significant
  • E-values
  • E-values From First Principles
  • Calculating an E-value
  • E-values In Practice
  • E-value Exercise
  • E-value Exercise Answer
  • E-values Effect of Database Size
  • Slide 58
  • Why were the values different
  • E-values Effect of Query Length
  • Why not just use identity
  • So whatrsquos the real problem
  • Rules of Thumb
  • Protein BLAST
  • Amino Acid Substitutions
  • Exercises
  • Part 4 Tweaking BLAST
  • Not All Parameters are hellip
  • The Many Parameters of BLAST
  • Slide 70
  • BLAST Parameters Exercises
  • Results for Exercise 1
  • Slide 73
  • Results for Exercise 2A (OFF)
  • Results for Exercise 2A (ON)
  • Results for Exercise 2B
  • Results for Exercise 2C
  • Slide 78
  • Slide 79
  • Results for Exercise 4 (i)
  • Results for Exercise 4 (ii)
  • Slide 82
  • Slide 83
  • END
Page 33: Bioinformatics Workshop 1 Sequences and Similarity Searches

Answers Exercise 1

The Essential Taskexperiment data mining

gene sequence what is its function

database of proteins in other species

Cyclin-AFoxA1

cdc25

alpha-tubulin

Predicted protein

Gravin-like

Sprouty-2

calmodulin

KIAA10786568

frizzled

Wint8

Troponin T3

Gravin-like

we can only do this because of implied function based on orthology

Functional Orthologs

function known annotation lsquoGravinrsquo available

Human geneXenopus genefunction unknown

sequence similarityorthologs

same function But we know that function is largely determined by shape

similar shape

Which in general we cannot determine ndash but it is probably SHAPE not SEQUENCE that is conserved

We make an assumption that the same gene function is likely to be present in the two organisms and the ones that have this function are likely to be the most similar in sequence

Finding OrthologsSo how do we find orthologs and can we know when we have

The simplest is Reciprocal Best BLAST but it implicitly relies on having all the protein sequences of you own organism and the one you wish to find an ortholog in

frog proteindatabase of human proteins

best match human protein

database of frog proteins

x

Using Synteny is Better

We know that large regions of (say) vertebrate genomes have preserved their overall organisation from one organism to another

And we find the same genes (ie orthologs) in more or less the same order in the syntenic sectionsThese of course represent chromosomal re-arrangements since these organisms diverged

Human chromosome 5

Mouse chromosome 10

Mouse chromosome 2

MetazomeFortunately someone has done all the hard work for ushellip Dan Rokhsar httpwwwmetazomenet

Metazome Exercise

Go back to Entrez Gene and look for your favourite gene again

Pick probable ortholog vertebrate genes from common organisms (human mouse rat chicken frog fish) and paste their protein sequences into a temporary space

Go to Metazome (httpwwwmetazomenet) find the blast window open two versions of it and blast your sequences against the Tetrapod or Jawed vertebrate node

See if you get the same cluster ID as best top hit and have a look at the Metazome alignment(s)hellip

Part 3 Finding Sequence Similarities

We want computer programs which will compare sequences at all possible different alignments looking for a degree of similarity greater than we would expect to find by chance

But first we have to consider the implication of gapshellip

Insertions and deletions are other possible forms of mutations and they can really mess up our simple alignments

ATGCATGCTGCCAACGGATGTCCTG

ATGAAAGCCGCCTACGAAAGTCCTG||| | || ||| ||| | ||||||

ATGCATGCTGGCCAACGGATGTCCTG

ATGAAAGCCGCCTACGAAAGTCCTG||| | || ||| | | | |

Gaps in Alignments

Consider these two obviously similar sequences

TTCCCAACTCTCCTCTTTCACCATGAAGCTCAAGGACAGATTCCACTCGCCCCAAAATCAAGCTCACCCCGTCCAAGAA | || | || |||||||||||||||||||| ||||||||| ||| ||| | ||| | | |TTCCCACCTCTCCTCTTTGCACCATGAAGCTCAAGGACAAATTCCACTCCCCCAAAATCAAGCGCACCCCGTCCCAGAA

In fact we realise that the most probable alignment (regarding biological origin) is with a small gap in each sequence

TTCCCAACTCTCCTCTTT=CACCATGAAGCTCAAGGACAGATTCCACTCGCCCCAAAATCAAGCTCACCCCGTCCAAGAA |||||| ||||||||||| |||||||||||||||||||| ||||||||| |||||||||||||| |||||||||| ||||TTCCCACCTCTCCTCTTTGCACCATGAAGCTCAAGGACAAATTCCACTC=CCCCAAAATCAAGCGCACCCCGTCCCAGAA

So in general we allow ourselves to insert gaps until we find the optimal alignment

But where should this process stop

The Downside of GapsTake two random sequences with no lsquorealrsquo similarity

GACACTAGGTCGATGCGTGGTGGCGAGA

ACGCATCCGGATGTGCACCGTGGAACTG

And allow lsquocost freersquo gaps

GAC--ACT----AGGTCGATGC---GTGG---TGGCGAGA || | | | | | ||| |||| || ACGCA-TCCGGA--T-G-TGCACCGTGGAACTG

Clearly although the alignment has no mismatches it is obviously not biologically meaningful

To prevent this we assign a cost to adding gaps which is offset against the benefit of finding matches ndash and this is the essence of lsquofinding gapped alignmentsrsquo

We want to find the lsquoalignmentrsquo between the two (or more) sequences which shows the greatest degree of similarity while introducing the fewest gaps hellip

BLAST

gtqueryAGACGAACCTAGCACAAGCGCGTCTGGAAAGACCCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGTTCAGGAGTATTTGGACTGCAATATTGGCCCTCGTTCAAGGGCGCCTACCATCACCCGACGGTCATGCCGGTCCCCAGCAGCTGCTAATAACTTCCTTCGCTACTCAAGTTACCACGCTAGCAAAACCCACGGCATACCGTTTACCCTTTAAAATCAGCTTCAACCAGCAACGAA

There are many programs used to find similarities between sequencesThey range from relatively slow programs which find the exact best matching alignment through ones which take progressively inexact shortcuts to speed things up Of this latter class the best known and easily most widely used is BLAST developed by Stephen Altschul and others and continuously refined over the last 10-15 years

The essential idea is to compare your query sequence against a collection or lsquodatabasersquo of target sequences looking for the one(s) that match the query sequence the best

gttarget1AAAACAGGAATATTTACCGGGACCGGGTAATGATGCATCTCGAGGTACACAATATACCTG GAGAACCGAATTATGAGTTGGCCACCTTACTTAACGAAACCAGCAGAGAAAATCCAACAT GGCAACACCCCTCTGACTACACTAGAAGGAACTACTATGTAAGAAAACAGCCTGTCCCTT GCAGTTTGAATGACTGGGTGATGCGAAATGGGGGTCCTGCCATAGAGCGCTTCCATGGTT TACCTTGCACATTTCAGAGAAGTCCTATGCCAGGAGTCCTTCCTACAGGGCCTTCCTGAA ACTATATATGTGCTTATTCTTGTTTGATTTGGCTTTGCAGgttarget2CTCTTAATTTATTTCTCTTCCTGCAGCTCCCTCGCTTTTTCCTTTCCCTGTTACATTCAT CTGACTTGAAGAGTTGCAAATTTTCAGTGTTTCTGTTTTTGTTGCTGATATGTTGTAAAC TTTTTAATAAAATCTATTTCTATAG gttarget3GCAGTTTGAATGACTGGGTGATGCGAAATGGGGGTCCTGCCATAGAGCGCTTCCATGGTT TACCTTGCACATTTCAGAGAAGTCCTATGCCAGGAGTCCTTCCTACAGGGCCTTCCTGAA ACTATATATGTGCTTATTCTTGTTTGATTTGGCTTTGCAGCTAGGGTTTTCACCTTTTCT GGAAAAAAAAATACTGGCTTCC gttarget4CTGCTATTAATGGGCAAAACAACTCAAATAAAGTCCCTCTGCCACCCTCAGACACTGCCC CTGGCCCCCAGCTGCCCGCTGATCCTTGTAGCCAGAGCAGTAAAGTTTTGAAAGTGGAGC CCAAGGAGAATAAAGTTATTAAAGAAACTGGCTTTGAACAAGGTGAAAAGTCTTGTGCAG CACCTCTAGATCATACTGTGAAGGAAAATCTTGGACAAACTTCTAAAGAACAGGTGGTAG

query

database

COMPARE

LIST MATCHES

Flavours of BLAST

ACGATAGATCCCATCCATAAAT ATGACGATAGATCCCATCAT CGATAGGACCACCACA GATAGACCAGGATACATAGGATAATTA AGCTCGCTTGGCTCGATGGCT

MKJLSPWERSYTRGHYTWER MGHTVNBZY MKLPWRHGDBKJGMNDFD MBKLRPIUHDFRTASGSLKWWRTVBN

MKJLSPWERSYTRGHYTWER MGHTVNBZY MKLPWRHGDBKJGMNDFD MBKLRPIUHDFRTASGSLKWWRTVBN

ATGACGATAGATCCCATCAT CGATAGGACCACCACA GATAGACCAGGATACATAGGATAATTA AGCTCGCTTGGCTCGATGGCT

query sequence other operation database sequences

ATGACGATAGATCCCATCAT CGATAGGACCACCACA GATAGACCAGGATACATAGGATAATTA AGCTCGCTTGGCTCGATGGCT

BLASTn

BLASTp

BLASTx

tBLASTn

tBLASTx

ACGATAGATCCCATCCATAAAT

ACGATAGATCCCATCCATAAAT

MQWCGYRWTYQGYRW

MQWCGYRWTYQGYRW

FAST

FAST

SLOW

SLOWER

HORRIBLY

SLOW

6 fra

me

trans

latio

n

How does it work The main task of any sequence comparison program is to test all possible mutual alignments of two sequence and see how good the match is

CTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGT

CCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGTTC | | | | | ||||||||||||||||||||||||| CTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGTCTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGT || | | | | | | | | |||||||||||||||||||||||| | | | | | |

CCGAGCTTCTCATTGCTCTTCCTAACAGTG=TGATAGGCTAACCGTAATGGCGTTC||||||||||||||||||||||||| ||||||||||||||||||||||||

query

1st database sequence

This would actually be a very slow search process if implemented like thishellip

BLAST achieves its speed through two strategies

- it takes a WORD based approach- it pre-INDEXES database sequences

BLAST WORDS and INDEXING1 GACAAATCCAAACCCCTGAAGTTCTCCACCAGCAAAGCCA

2 TAAGCAAATTTAATTTTGTTTACATTTTC

3 GTTAAGACCTTCCCTGACATTTGCAGCAGTTTCAAATGTA

AAAAAAAA 00001 AAAAAAAC 00002 AAAAAAAG 00003 ACAAATCC 07967 ACAAATCC 07968 ACAAATCC 07979 GACAAATC 33568 GACAAATG 33569 TCCAAACC 64321 TCCAAACC 64322

sequence position word

1 1 33658

1 2 07967

1 3 16210

3 15 33568

3 16 07967

Database of sequences

Numbered list of all possible lsquowordsrsquo

Build a position index of all words in the database

Analyse the Query Sequence gtquery AGACAAATCCAAACCCCTGAAGTTCTCCACCAGCAAAGCCA

AAAAAAAA 00001 AAAAAAAC 00002 AAAAAAAG 00003 ACAAATCC 07967 ACAAATCC 07968 ACAAATCC 07979 GACAAATC 33568 GACAAATG 33569 TCCAAACC 64321 TCCAAACC 64322

QUERY SEQUENCE

Numbered list of all possible lsquowordsrsquo

position word

1 14236

2 33658

3 07967

Analyse QUERY SEQUENCE

sequence position word

1 1 33658

1 2 07967

1 3 16210

3 15 33568

3 16 07967

Index of database

Expand from Word Based Matches We lsquoinstantlyrsquo know which sequences in the database have at least a word length match with our query sequence and at what relative position

Next the potential alignments are expanded adding up a score for (total matches ndash mismatches ndash gap penalties) to make the best possible alignment But this is usually for a tiny proportion of the sequences in the database ndash so overall it is much quicker

The highest scoring alignments are reported

But we can potentially miss alignments with no word-size bits in common consider BLASTn with a default word-size of 11

TCGGAAGTGGAAGCTGAACCTGATTGTAGAGTTGGAGGCCAGTGTTCTGGCTGAGC||||||||| ||||| |||||||||| |||||||||| |||| ||||| ||||||| TCGGAAGTGTAAGCTCAACCTGATTGCAGAGTTGGAGTCCAGAGTTCTAGCTGAGC

Care is sometimes neededhellip

BLAST ndashTypical OutputINPUT

gtpartial cDNA sequence Xenopus tropicalisCGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGTTCCCACCTCTCCTCTTTCACCATGAAGCTCAAGGACAAATTCCACTCCCCCAAAATCAAGCGCACCCCGTCCAAGAAGGGGAAGCCGGCCGACCTCACCGTCAAAACAGAAGAGAAACCCGTCAACAAAACCTTAAGCCGCTTGGAGGAACAGGAGAAAGAAGTCGTTAATGCCTTGCGTTACTTTAAGACAATTGTTGACAAGATGGCGGTGGACAAGATGGTGCTGGTGATGCTGCCAGGGTCGGCGA

OUTPUTQuery= (311 letters) Database NCBI Protein Reference Sequences 954378 sequences 347895532 total letters

gtgi|41055060|ref|NP_9574201| similar to guanine nucleotide-releasing factor 2 (specific for crk proto-oncogene) [Danio rerio]

Length=691

Score = 133 bits (335)Expect = 6e-31 Identities = 7698 (77) Positives = 8298 (83) Gaps = 498 (4) Frame = +2

Query 26 MSGKIE-KADSQRSHLSSFTMKLKDKFHSPKIKRTPSKKGKPA--DLTVKTEEKPVNKTL 196 MSGKIE K +SQ+SHLSSFTMKL KFHSPKIKRTPSKKGK + VKT EKPVNK + Sbjct 1 MSGKIESKHESQKSHLSSFTMKLM-KFHSPKIKRTPSKKGKQLQPEPAVKTPEKPVNKKV 59

Query 197 SRLEEQEKEVVNALRYFKTIVDKMAVDKMVLVMLPGSA 310 SRLEEQEK+VV+ALRYFKTIVDKM VD VL MLPGSA Sbjct 60 SRLEEQEKDVVSALRYFKTIVDKMNVDTKVLQMLPGSA 97

When is a match significant

RFKISDCQHPCTYSHNQYMTNHMRECPYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV

NFSWKKTSEKETNCQFDYPNDYNEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFNMCWLEVNSS

RF---KISDCQHPCTYSH-NQYMTNHMREC----PYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV F K S+ + C + + N Y N +C P+ + +W +P + D I N M ++ NFSWKKTSEKETNCQFDYPNDY--NEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFN------MCWLEVNSS

Here is a lsquotypicalrsquo weak alignment from BLASTp

In fact the sequences were randomly generated so there is no biologically significant alignmenthellip

E-values

The number of matches like the discovered match that I would expect to find by chance

An E-value of 00 implies that I would expect no matches like this to arise by chance thereforehellip

An E-value of 1 implies I would expect 1 match like this to arise by chance so if I have a match with such an E-valuehellip

Also ldquoexpect valueldquo or ldquoexpectationrdquo

E-values From First Principles

Some database statistics (23rd July 2005)

Database NCBI RefSeq mRNA 272619 sequences 503566580 total letters (~50 x 108)

Database NCBI nr 3329110 sequences 14601814750 total letters (~14 x 1010)

Notation

12e-35 = 12 x 10-35

48 x 106 = 4800000

We will consider first searching a nucleotide sequence (lsquoACGTAGACGTrsquo) against a nucleotide database eg the RefSeq mRNA above

Then we will consider the more complex case of amino acid sequence (protein) searches Which is of course what we mostly do

Calculating an E-valueThe RefSeq mRNA database has ~ 50 x 108 letters There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA

Query = lsquoArsquo

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACrsquo

AC AC AC AC

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACGrsquo

ACG

Expected number of matches = (50 x 108) 4 = ~12 x 108

Expected number of matches = (50 x 108) (4x 4) = ~31 x 107

Expected number of matches = (50 x 108) (4 x 4 x 4) = ~81 x 106

Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer

Expected number of matches = (50 x 108) (4 x 4 x 4 x 4 hellip 60 times ) = (50 x 108) 1036 = 50 x 10-28

E-value = 50 x 10-28

E-values In PracticeSo if I take a 60 nt sequencegtsequenceACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA and actually BLAST it against the RefSeq mRNA database I get

BLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds

Length=6060 Score = 119 bits (60) Expect = 2e-26 Identities = 6060 (100) Gaps = 060 (0) Strand=PlusPlus

Query 1 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 60 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 2977 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 3036

What do I get if I BLAST it against the larger nr databaseBLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds

Length=6060 Score = 119 bits (60) Expect = 6e-25 Identities = 6060 (100) Gaps = 060 (0) Strand=PlusPlus

Query 1 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 60 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 2977 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 3036

theoretical value was 50e-28 -

E-value Exercise

Given a transcription factor binding site

ACC[TG]TA

How many would you expect to find by chance in a 10k promoter sequence

How would this differ if there was an optional additional base between the 4th and 5th positionsIeACC[TG]TAOR ACC[TG]TA

E-value Exercise AnswerACC[TG]TA

Expect lsquoArsquo every 4 ntExpect lsquoACCrsquo every 4x4x4 = 64 nt

Expect lsquoT or Grsquo every 2nd ntExpect lsquoACC[TG]rsquo every 64x2 nt = 128 nt

Expect lsquoTArsquo every 4x4 = 16 ntExpect lsquoACC[TG]TArsquo every 128x16 nt = 2048 nt (4x4x4x2x4x4)We would expect ~5 of these promoter sites every 10k by chance

If also ACC[TG]TAA allowed

The two motifs independently have the same E-valueTo allow either means we expect twice as many

E-values Effect of Database SizeThe nr mRNA database has ~ 14 x 1010 letters (was RefSeq and 50 x108)There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA

Query = lsquoArsquo

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACrsquo

AC AC AC AC

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACGrsquo

ACG

Expected number of matches = (14 x 1010) 4 = ~12 x 108

Expected number of matches = (14 x 1010) (4x 4) = ~31 x 107

Expected number of matches = (14 x 1010) (4 x 4 x 4) = ~81 x 106

Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer

Expected number of matches = (14 x 1010) (4 x 4 x 4 x 4 hellip 60 times ) = (14 x 1010) 1036 = 14 x 10-26

E-value = 14 x 10-26

(was E-value = 50 x 10-28)

E-values Effect of Database Size

The E-value is simply dependent on database size

RefSeq

nr

14 x 1010 letters

50 x108 letters

30 x bigger

BLAST the same sequenceagainst each E-value = 14e-26

E-value = 50e-28

The database was ~30 times bigger and so the E-value was ~30 times bigger

Why were the values differentOur calculated E-value for searching against the RefSeq mRNA database was 50 x 10-28But our actual BLAST search at NCBI gave a value of

20 x 10-26 - about 40x larger - why is this

Gapped alignments

If we were expecting N matches for a query sequence lsquoACGTACGTACGTrsquo imagine what would happen to N if we allowed gaps in our matches

ACGTACGTACGT

This would now give us additional possible alignments that would meet our lsquomatchrsquo criteria

ACGTACGTACGT ACGTACAGTACGT ACGTACCGTACGT etc|||||||||||| |||||| |||||| |||||| ||||||ACGTACGTACGT ACGTAC-GTACGT ACGTAC-GTACGT

We will expect many more matches in a given database if we allow our alignments to have gaps The E-value will be larger

E-values Effect of Query Length

Biologically itrsquos the same match Does it mean we are any less sure that this match didnrsquot occur by chance The E-value is simply dependent on match length

database

BLAST 500 nt sequence against a database

BLASTn Get a full length match with sequence XYZ at an E-value = 50e-160

gtsequence ACTAGTCTAGCTAGACATCGATCGATGATGCTACACAGATAGACGATAGATAGTAAGTCGATCGATCGCGCATCGATCGTCTAGATCGATCGCTCGCTGTGTAGATAGATCGGCGATAGA

database

BLAST half of the same sequence against the same database

BLASTn

gtsequence ACTAGTCTAGCTAGACATCGATCGATGATGCTACACAGATAGACGATAGATAGTAAGTCG

Get a match with sequence XYZ again but at an E-value = 50e-80

Why not just use identityAt some levels this a good question

But consider two very different searches both of which give a 75 identity match

Query1 was 60 nt longCGGAGCTCAGGGCTTAACGACTGATATCTCCGCGCATGTCGAGAAACGATACAGCCAGCG||||||||||| || | || | || || |||| | | | |||||| | ||||||||||CGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGWhich would have an E-value ~ 50 x 10-19

And Query2 only 16 nt longACGTACGTACGTACGT||| || | |||| ||ACGCACCTTCGTAGGTWhich would have an E-value ~ 30

And intuitively we feel we would expect to see that sort of number of matches in the database just by chancehellip

So whatrsquos the real problemBasically you are usually trying to answer the question

Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism

Are there any useful guidelines though at least for biological meaningfulness

Basically you are usually trying to answer the question

Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism

BLAST

The difficulty is because

ORTHOLOGY

BLAST Similarity + Probability

biological knowledge

nature of query sequence

phylogenetic relationship

match length PI size of databasehellip

Rules of ThumbHow good does an E-value have to be before we might even think we have an ortholog

largerworse smallerbetter

E-values 10-5 10-10 10-40 10-100 00

fantasy borderline encouraging

pretty good canrsquot get

better

But note that in some gene families with closely related members you can get an E-value of 00 for several different matches and then identity may be more sensitive Also bear in mind in cases like this that ideas of lsquofunctionalrsquo orthology may break down with more than one locus producing identical proteins which share the same functionhellip

Protein BLASTItrsquos (nearly) always better to make comparisons at the amino acid level between protein sequences than the DNA level because the amino acid sequence is more conserved than the underlying DNA sequenceDoes this cause us to treat expected values any differently

If we follow the argument as before then for an exact match of a 20 amino acid sequence in the RefSeq protein database each additional amino acid will reduce the E-value by 120th (there are 20 different amino acids) And as there are 347895532 letters in that databaseE-value = ~35 x 108 (20 x 20 x 20 hellip20 times) = ~35 x 10-18

But this is what we get if we run the blast at NCBI

Score = 431 bits (100) Expect = 8e-04 Identities = 2020 (100) Positives = 2020 (100) Gaps = 020 (0) Frame = +3

Query 3 SSSSFRAYRAALSEVEPPCI 62 SSSSFRAYRAALSEVEPPCISbjct 972 SSSSFRAYRAALSEVEPPCI 991

Really too big a discrepancy to easily explain with hand wavinghellip

Amino Acid Substitutions

A SC F LWYG I LMVL IMFVM ILVP V ILMW FY

N DHSQ REHKS ANTT SY HFW

H NQYK RQER QK

D NEE DQK

In fact we need to take into account both amino acid substitutability as well as as before allowing gapped alignments On average any residue can be substituted for by about 2 others so each position has about 17th chance of lsquomatchingrsquo rather than 120th

So now we getE-value = ~35 x 108 (7 x 7 x 7 hellip20 times) = ~44 x 10-9which is much closer to the actual BLAST value

These substitutabilities are dealt with by the BLOSUM and PAM matrices

ExercisesGo to the file random-DNA-sequenceshtml select one of the 20 randomly generated nucleotide sequences and do a BLASTx (translated DNA-gtprotein) at NCBI against the nr protein database

Did you find any lsquosignificantrsquo hits

Repeat with a second sequence

What conclusions might you draw from this exercise

Try the same sequence(s) against the nr nucleotide database

Is there any general difference

Part 4 Tweaking BLASTAlthough you normally see BLAST as a web page with boxes to place data in and tick boxes etc it is actually a command line program that can be run just by typing the appropriate command and options eg

promptgt blastall ndashp ltblast typegt ndashi ltinput sequencegt ndashd ltdatabasegt

This is the simplest form where the basic program lsquoblastallrsquo takes a number of different options or parameters indicated by the ndashx and followed by its value -p ltwhich blast flavour to rungt-i ltfile with query sequence ingt-d ltpre-indexed database namegt

Not All Parameters are hellipThere are many other parameters and if not listed explicitly they will use a default value most appropriate to the blast flavour requested Eg for ndashW ltword sizegt blastn uses ndashW 11 where blastx uses ndashW 3

There are also some options that appear on the web pages that are not really parameters but manage the job in a similar way One of the most useful of these is on the NCBI blast pages where you can use Entrez queries or pick from an organism list to modify your search

The Many Parameters of BLASTThere are almost literally hundreds of parameters but most are way too obscure even for die-hard techies like me Very few of them are regularly useful in any but their default value but just occasionally they are very necessary

Here are some of the ones that I have used

-e max expected value -m output format (graphical or tabularspreadsheet)-F filter query sequence for low complexity (default TRUE)-U use only upper case regions of query (default FALSE)-G gap opening cost-E gap extension cost-q nucleotide mismatch penalty (BLASTx uses matrices)-r nucleotide match reward-b number of matching sequences to report-g allow gaps (default TRUE)-W word size-z effective database size (removes effect of actual database size)-S query strands to search (default both directions)-l restrict database sequences to given list of lsquogilsquo numbers

BLAST Parameters Exercises1 BLASTn vs BLASTx

Open the file example-sequenceshtml copy the sequence gtblastn-vs-blastxThis is a Xenopus tropicalis cDNA sequence

Go to the NCBI BLAST Home PageNucleotide-nucleotide BLAST (blastn) section Paste your sequence into the box

Run BLASTn against the nr nucleotide database using all default optionsThen hit [format] to wait for the results in a new page(hint if you paste the sequence definition line lsquogtnamersquo into the box as well your results will be labelled accordingly which can be useful)

Now repeat but go to the TRANSLATED BLAST section and BLAST against the nr protein database using BLASTx

How might the different results help us view the presence of this gene in other vertebrates

Results for Exercise 1 BLASTn

BLASTx

BLAST Parameters Exercises2 Low complexity filtering

Open the file example-sequenceshtml copy the sequence gtlow-complexity-filtering-A This sequence contains a long AT tandem repeat

Go to the NCBI BLAST Home PageTRANSLATED BLAST sectionBLASTx Paste your sequence into the box

Carefully UNTICK the ldquoChoose filter [ ] Low complexityrdquo BOX in the second section And then run BLASTx against the nr database

What do you feel about these alignmentsRe-run but leave the low-complexity filter ON this timeDoes this change our view of the protein matches

Now continue with gtlow-complexity-filtering-B and ndashCC is an especially interesting case ndash what can we deduce about the cDNA sequence Annotators beware

Results for Exercise 2A (OFF) BLASTn ndash low complexity filtering OFF

Results for Exercise 2A (ON) BLASTn ndash low complexity filtering ON

Results for Exercise 2B

ON OFF

Results for Exercise 2C

There is a sequence error an extra G at position 117 in the sequence cDNA (117)AGAAAAGAAGAAACATGGCAATGGATCAGAA|||||||||||||||| ||||||||||||||AGAAAAGAAGAAACAT-GCAATGGATCAGAA

Genomic sequence

ON

OFF

BLAST Parameters Exercises3 Limit by Entrez query

Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching sequences in fruit fly enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT

Open the file example-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home Page TRANSLATED BLAST sectionBLASTx and paste the sequence

Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip

BLAST Parameters Exercises4 BLASTn vs tBLASTx and nucleotide mismatch penalties

Open the file example-sequenceshtml

Also open the NCBI BLAST Home PageSPECIAL ndash Align two sequences section

There are several Xenopus tropicalis cyclins in the examples fileCopy the sequence gtcyclin-A1-Xt to the Sequence 1 BLAST windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 BLAST window(i) Run the default comparison should be BLASTn Note the alignmentNow run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these

(ii) Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1 Then try reducing the gap open and gap extension penaltieshellipWhat do we learn from this

(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip

Results for Exercise 4 (i)

BLASTn tBLASTx

Results for Exercise 4 (ii)

Mismatch penalty = -2 (default) Mismatch penalty = -1

BLAST Parameters Exercises5 E-Value maximum for reporting

Open the file example-sequenceshtml

Copy the sequence gtsumo-binding-motif and go to the NCBI BLAST Home PageGo to the PROTEIN BLAST section BLASTp and paste the sequence

Run the search with the default values

Now re-run the search setting the maximum E-value in the box

Expect 100

What difference does this make

BLAST Parameters Exercises6 Word Size

Open the file example-sequenceshtml

Copy the sequence gtmorpholino and go to the NCBI BLAST Home PageGo to the NUCLEOTIDE BLAST section BLASTn and paste the sequence

Check OFF the low complexity filter and then run the search

Now re-run the search setting the following parameters

Low complexity OFFExpect 100Word Size 7Other advanced -q-1 (mismatch penalty -1 instead of default -3)

What difference does this make

END

  • Bioinformatics Workshop 1 Sequences and Similarity Searches
  • The Basic Questions
  • Part 1 Structural Genomics
  • Chromosomes and Genes
  • Gene to Protein
  • Sequence Signals
  • Genomic Signals
  • Derivative Sequences
  • Gene Models
  • Sequences and Genes (Accession Numbers and Names)
  • Gene Symbols Names Etc
  • A Gene-Centric View
  • Sequences and Accession Numbers
  • mRNA Splicing Signals
  • Gene Predictions
  • Supporting Evidence
  • TheoreticalPredicted Sequences
  • Sequences for a model organism
  • So Whatrsquos in the Databases Now
  • Part 2 Comparative Genomics
  • Speciation
  • Residual Similarity
  • Computers Can Detect Homology
  • Orthologs
  • Paralogs
  • lsquoOtherrsquo-logs
  • The Essential Paradigm
  • Function Conserved Longer than Detectable Similarity
  • Redundancy in the Genetic Code
  • Protein Similarity Persists Longer
  • Always Compare Protein Sequences
  • Exercise 1 nucleotide vs amino acid search
  • Answers Exercise 1
  • The Essential Task
  • Functional Orthologs
  • Finding Orthologs
  • Using Synteny is Better
  • Metazome
  • Metazome Exercise
  • Part 3 Finding Sequence Similarities
  • Gaps in Alignments
  • The Downside of Gaps
  • BLAST
  • Flavours of BLAST
  • How does it work
  • BLAST WORDS and INDEXING
  • Analyse the Query Sequence
  • Expand from Word Based Matches
  • BLAST ndashTypical Output
  • When is a match significant
  • E-values
  • E-values From First Principles
  • Calculating an E-value
  • E-values In Practice
  • E-value Exercise
  • E-value Exercise Answer
  • E-values Effect of Database Size
  • Slide 58
  • Why were the values different
  • E-values Effect of Query Length
  • Why not just use identity
  • So whatrsquos the real problem
  • Rules of Thumb
  • Protein BLAST
  • Amino Acid Substitutions
  • Exercises
  • Part 4 Tweaking BLAST
  • Not All Parameters are hellip
  • The Many Parameters of BLAST
  • Slide 70
  • BLAST Parameters Exercises
  • Results for Exercise 1
  • Slide 73
  • Results for Exercise 2A (OFF)
  • Results for Exercise 2A (ON)
  • Results for Exercise 2B
  • Results for Exercise 2C
  • Slide 78
  • Slide 79
  • Results for Exercise 4 (i)
  • Results for Exercise 4 (ii)
  • Slide 82
  • Slide 83
  • END
Page 34: Bioinformatics Workshop 1 Sequences and Similarity Searches

The Essential Taskexperiment data mining

gene sequence what is its function

database of proteins in other species

Cyclin-AFoxA1

cdc25

alpha-tubulin

Predicted protein

Gravin-like

Sprouty-2

calmodulin

KIAA10786568

frizzled

Wint8

Troponin T3

Gravin-like

we can only do this because of implied function based on orthology

Functional Orthologs

function known annotation lsquoGravinrsquo available

Human geneXenopus genefunction unknown

sequence similarityorthologs

same function But we know that function is largely determined by shape

similar shape

Which in general we cannot determine ndash but it is probably SHAPE not SEQUENCE that is conserved

We make an assumption that the same gene function is likely to be present in the two organisms and the ones that have this function are likely to be the most similar in sequence

Finding OrthologsSo how do we find orthologs and can we know when we have

The simplest is Reciprocal Best BLAST but it implicitly relies on having all the protein sequences of you own organism and the one you wish to find an ortholog in

frog proteindatabase of human proteins

best match human protein

database of frog proteins

x

Using Synteny is Better

We know that large regions of (say) vertebrate genomes have preserved their overall organisation from one organism to another

And we find the same genes (ie orthologs) in more or less the same order in the syntenic sectionsThese of course represent chromosomal re-arrangements since these organisms diverged

Human chromosome 5

Mouse chromosome 10

Mouse chromosome 2

MetazomeFortunately someone has done all the hard work for ushellip Dan Rokhsar httpwwwmetazomenet

Metazome Exercise

Go back to Entrez Gene and look for your favourite gene again

Pick probable ortholog vertebrate genes from common organisms (human mouse rat chicken frog fish) and paste their protein sequences into a temporary space

Go to Metazome (httpwwwmetazomenet) find the blast window open two versions of it and blast your sequences against the Tetrapod or Jawed vertebrate node

See if you get the same cluster ID as best top hit and have a look at the Metazome alignment(s)hellip

Part 3 Finding Sequence Similarities

We want computer programs which will compare sequences at all possible different alignments looking for a degree of similarity greater than we would expect to find by chance

But first we have to consider the implication of gapshellip

Insertions and deletions are other possible forms of mutations and they can really mess up our simple alignments

ATGCATGCTGCCAACGGATGTCCTG

ATGAAAGCCGCCTACGAAAGTCCTG||| | || ||| ||| | ||||||

ATGCATGCTGGCCAACGGATGTCCTG

ATGAAAGCCGCCTACGAAAGTCCTG||| | || ||| | | | |

Gaps in Alignments

Consider these two obviously similar sequences

TTCCCAACTCTCCTCTTTCACCATGAAGCTCAAGGACAGATTCCACTCGCCCCAAAATCAAGCTCACCCCGTCCAAGAA | || | || |||||||||||||||||||| ||||||||| ||| ||| | ||| | | |TTCCCACCTCTCCTCTTTGCACCATGAAGCTCAAGGACAAATTCCACTCCCCCAAAATCAAGCGCACCCCGTCCCAGAA

In fact we realise that the most probable alignment (regarding biological origin) is with a small gap in each sequence

TTCCCAACTCTCCTCTTT=CACCATGAAGCTCAAGGACAGATTCCACTCGCCCCAAAATCAAGCTCACCCCGTCCAAGAA |||||| ||||||||||| |||||||||||||||||||| ||||||||| |||||||||||||| |||||||||| ||||TTCCCACCTCTCCTCTTTGCACCATGAAGCTCAAGGACAAATTCCACTC=CCCCAAAATCAAGCGCACCCCGTCCCAGAA

So in general we allow ourselves to insert gaps until we find the optimal alignment

But where should this process stop

The Downside of GapsTake two random sequences with no lsquorealrsquo similarity

GACACTAGGTCGATGCGTGGTGGCGAGA

ACGCATCCGGATGTGCACCGTGGAACTG

And allow lsquocost freersquo gaps

GAC--ACT----AGGTCGATGC---GTGG---TGGCGAGA || | | | | | ||| |||| || ACGCA-TCCGGA--T-G-TGCACCGTGGAACTG

Clearly although the alignment has no mismatches it is obviously not biologically meaningful

To prevent this we assign a cost to adding gaps which is offset against the benefit of finding matches ndash and this is the essence of lsquofinding gapped alignmentsrsquo

We want to find the lsquoalignmentrsquo between the two (or more) sequences which shows the greatest degree of similarity while introducing the fewest gaps hellip

BLAST

gtqueryAGACGAACCTAGCACAAGCGCGTCTGGAAAGACCCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGTTCAGGAGTATTTGGACTGCAATATTGGCCCTCGTTCAAGGGCGCCTACCATCACCCGACGGTCATGCCGGTCCCCAGCAGCTGCTAATAACTTCCTTCGCTACTCAAGTTACCACGCTAGCAAAACCCACGGCATACCGTTTACCCTTTAAAATCAGCTTCAACCAGCAACGAA

There are many programs used to find similarities between sequencesThey range from relatively slow programs which find the exact best matching alignment through ones which take progressively inexact shortcuts to speed things up Of this latter class the best known and easily most widely used is BLAST developed by Stephen Altschul and others and continuously refined over the last 10-15 years

The essential idea is to compare your query sequence against a collection or lsquodatabasersquo of target sequences looking for the one(s) that match the query sequence the best

gttarget1AAAACAGGAATATTTACCGGGACCGGGTAATGATGCATCTCGAGGTACACAATATACCTG GAGAACCGAATTATGAGTTGGCCACCTTACTTAACGAAACCAGCAGAGAAAATCCAACAT GGCAACACCCCTCTGACTACACTAGAAGGAACTACTATGTAAGAAAACAGCCTGTCCCTT GCAGTTTGAATGACTGGGTGATGCGAAATGGGGGTCCTGCCATAGAGCGCTTCCATGGTT TACCTTGCACATTTCAGAGAAGTCCTATGCCAGGAGTCCTTCCTACAGGGCCTTCCTGAA ACTATATATGTGCTTATTCTTGTTTGATTTGGCTTTGCAGgttarget2CTCTTAATTTATTTCTCTTCCTGCAGCTCCCTCGCTTTTTCCTTTCCCTGTTACATTCAT CTGACTTGAAGAGTTGCAAATTTTCAGTGTTTCTGTTTTTGTTGCTGATATGTTGTAAAC TTTTTAATAAAATCTATTTCTATAG gttarget3GCAGTTTGAATGACTGGGTGATGCGAAATGGGGGTCCTGCCATAGAGCGCTTCCATGGTT TACCTTGCACATTTCAGAGAAGTCCTATGCCAGGAGTCCTTCCTACAGGGCCTTCCTGAA ACTATATATGTGCTTATTCTTGTTTGATTTGGCTTTGCAGCTAGGGTTTTCACCTTTTCT GGAAAAAAAAATACTGGCTTCC gttarget4CTGCTATTAATGGGCAAAACAACTCAAATAAAGTCCCTCTGCCACCCTCAGACACTGCCC CTGGCCCCCAGCTGCCCGCTGATCCTTGTAGCCAGAGCAGTAAAGTTTTGAAAGTGGAGC CCAAGGAGAATAAAGTTATTAAAGAAACTGGCTTTGAACAAGGTGAAAAGTCTTGTGCAG CACCTCTAGATCATACTGTGAAGGAAAATCTTGGACAAACTTCTAAAGAACAGGTGGTAG

query

database

COMPARE

LIST MATCHES

Flavours of BLAST

ACGATAGATCCCATCCATAAAT ATGACGATAGATCCCATCAT CGATAGGACCACCACA GATAGACCAGGATACATAGGATAATTA AGCTCGCTTGGCTCGATGGCT

MKJLSPWERSYTRGHYTWER MGHTVNBZY MKLPWRHGDBKJGMNDFD MBKLRPIUHDFRTASGSLKWWRTVBN

MKJLSPWERSYTRGHYTWER MGHTVNBZY MKLPWRHGDBKJGMNDFD MBKLRPIUHDFRTASGSLKWWRTVBN

ATGACGATAGATCCCATCAT CGATAGGACCACCACA GATAGACCAGGATACATAGGATAATTA AGCTCGCTTGGCTCGATGGCT

query sequence other operation database sequences

ATGACGATAGATCCCATCAT CGATAGGACCACCACA GATAGACCAGGATACATAGGATAATTA AGCTCGCTTGGCTCGATGGCT

BLASTn

BLASTp

BLASTx

tBLASTn

tBLASTx

ACGATAGATCCCATCCATAAAT

ACGATAGATCCCATCCATAAAT

MQWCGYRWTYQGYRW

MQWCGYRWTYQGYRW

FAST

FAST

SLOW

SLOWER

HORRIBLY

SLOW

6 fra

me

trans

latio

n

How does it work The main task of any sequence comparison program is to test all possible mutual alignments of two sequence and see how good the match is

CTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGT

CCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGTTC | | | | | ||||||||||||||||||||||||| CTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGTCTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGT || | | | | | | | | |||||||||||||||||||||||| | | | | | |

CCGAGCTTCTCATTGCTCTTCCTAACAGTG=TGATAGGCTAACCGTAATGGCGTTC||||||||||||||||||||||||| ||||||||||||||||||||||||

query

1st database sequence

This would actually be a very slow search process if implemented like thishellip

BLAST achieves its speed through two strategies

- it takes a WORD based approach- it pre-INDEXES database sequences

BLAST WORDS and INDEXING1 GACAAATCCAAACCCCTGAAGTTCTCCACCAGCAAAGCCA

2 TAAGCAAATTTAATTTTGTTTACATTTTC

3 GTTAAGACCTTCCCTGACATTTGCAGCAGTTTCAAATGTA

AAAAAAAA 00001 AAAAAAAC 00002 AAAAAAAG 00003 ACAAATCC 07967 ACAAATCC 07968 ACAAATCC 07979 GACAAATC 33568 GACAAATG 33569 TCCAAACC 64321 TCCAAACC 64322

sequence position word

1 1 33658

1 2 07967

1 3 16210

3 15 33568

3 16 07967

Database of sequences

Numbered list of all possible lsquowordsrsquo

Build a position index of all words in the database

Analyse the Query Sequence gtquery AGACAAATCCAAACCCCTGAAGTTCTCCACCAGCAAAGCCA

AAAAAAAA 00001 AAAAAAAC 00002 AAAAAAAG 00003 ACAAATCC 07967 ACAAATCC 07968 ACAAATCC 07979 GACAAATC 33568 GACAAATG 33569 TCCAAACC 64321 TCCAAACC 64322

QUERY SEQUENCE

Numbered list of all possible lsquowordsrsquo

position word

1 14236

2 33658

3 07967

Analyse QUERY SEQUENCE

sequence position word

1 1 33658

1 2 07967

1 3 16210

3 15 33568

3 16 07967

Index of database

Expand from Word Based Matches We lsquoinstantlyrsquo know which sequences in the database have at least a word length match with our query sequence and at what relative position

Next the potential alignments are expanded adding up a score for (total matches ndash mismatches ndash gap penalties) to make the best possible alignment But this is usually for a tiny proportion of the sequences in the database ndash so overall it is much quicker

The highest scoring alignments are reported

But we can potentially miss alignments with no word-size bits in common consider BLASTn with a default word-size of 11

TCGGAAGTGGAAGCTGAACCTGATTGTAGAGTTGGAGGCCAGTGTTCTGGCTGAGC||||||||| ||||| |||||||||| |||||||||| |||| ||||| ||||||| TCGGAAGTGTAAGCTCAACCTGATTGCAGAGTTGGAGTCCAGAGTTCTAGCTGAGC

Care is sometimes neededhellip

BLAST ndashTypical OutputINPUT

gtpartial cDNA sequence Xenopus tropicalisCGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGTTCCCACCTCTCCTCTTTCACCATGAAGCTCAAGGACAAATTCCACTCCCCCAAAATCAAGCGCACCCCGTCCAAGAAGGGGAAGCCGGCCGACCTCACCGTCAAAACAGAAGAGAAACCCGTCAACAAAACCTTAAGCCGCTTGGAGGAACAGGAGAAAGAAGTCGTTAATGCCTTGCGTTACTTTAAGACAATTGTTGACAAGATGGCGGTGGACAAGATGGTGCTGGTGATGCTGCCAGGGTCGGCGA

OUTPUTQuery= (311 letters) Database NCBI Protein Reference Sequences 954378 sequences 347895532 total letters

gtgi|41055060|ref|NP_9574201| similar to guanine nucleotide-releasing factor 2 (specific for crk proto-oncogene) [Danio rerio]

Length=691

Score = 133 bits (335)Expect = 6e-31 Identities = 7698 (77) Positives = 8298 (83) Gaps = 498 (4) Frame = +2

Query 26 MSGKIE-KADSQRSHLSSFTMKLKDKFHSPKIKRTPSKKGKPA--DLTVKTEEKPVNKTL 196 MSGKIE K +SQ+SHLSSFTMKL KFHSPKIKRTPSKKGK + VKT EKPVNK + Sbjct 1 MSGKIESKHESQKSHLSSFTMKLM-KFHSPKIKRTPSKKGKQLQPEPAVKTPEKPVNKKV 59

Query 197 SRLEEQEKEVVNALRYFKTIVDKMAVDKMVLVMLPGSA 310 SRLEEQEK+VV+ALRYFKTIVDKM VD VL MLPGSA Sbjct 60 SRLEEQEKDVVSALRYFKTIVDKMNVDTKVLQMLPGSA 97

When is a match significant

RFKISDCQHPCTYSHNQYMTNHMRECPYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV

NFSWKKTSEKETNCQFDYPNDYNEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFNMCWLEVNSS

RF---KISDCQHPCTYSH-NQYMTNHMREC----PYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV F K S+ + C + + N Y N +C P+ + +W +P + D I N M ++ NFSWKKTSEKETNCQFDYPNDY--NEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFN------MCWLEVNSS

Here is a lsquotypicalrsquo weak alignment from BLASTp

In fact the sequences were randomly generated so there is no biologically significant alignmenthellip

E-values

The number of matches like the discovered match that I would expect to find by chance

An E-value of 00 implies that I would expect no matches like this to arise by chance thereforehellip

An E-value of 1 implies I would expect 1 match like this to arise by chance so if I have a match with such an E-valuehellip

Also ldquoexpect valueldquo or ldquoexpectationrdquo

E-values From First Principles

Some database statistics (23rd July 2005)

Database NCBI RefSeq mRNA 272619 sequences 503566580 total letters (~50 x 108)

Database NCBI nr 3329110 sequences 14601814750 total letters (~14 x 1010)

Notation

12e-35 = 12 x 10-35

48 x 106 = 4800000

We will consider first searching a nucleotide sequence (lsquoACGTAGACGTrsquo) against a nucleotide database eg the RefSeq mRNA above

Then we will consider the more complex case of amino acid sequence (protein) searches Which is of course what we mostly do

Calculating an E-valueThe RefSeq mRNA database has ~ 50 x 108 letters There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA

Query = lsquoArsquo

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACrsquo

AC AC AC AC

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACGrsquo

ACG

Expected number of matches = (50 x 108) 4 = ~12 x 108

Expected number of matches = (50 x 108) (4x 4) = ~31 x 107

Expected number of matches = (50 x 108) (4 x 4 x 4) = ~81 x 106

Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer

Expected number of matches = (50 x 108) (4 x 4 x 4 x 4 hellip 60 times ) = (50 x 108) 1036 = 50 x 10-28

E-value = 50 x 10-28

E-values In PracticeSo if I take a 60 nt sequencegtsequenceACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA and actually BLAST it against the RefSeq mRNA database I get

BLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds

Length=6060 Score = 119 bits (60) Expect = 2e-26 Identities = 6060 (100) Gaps = 060 (0) Strand=PlusPlus

Query 1 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 60 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 2977 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 3036

What do I get if I BLAST it against the larger nr databaseBLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds

Length=6060 Score = 119 bits (60) Expect = 6e-25 Identities = 6060 (100) Gaps = 060 (0) Strand=PlusPlus

Query 1 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 60 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 2977 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 3036

theoretical value was 50e-28 -

E-value Exercise

Given a transcription factor binding site

ACC[TG]TA

How many would you expect to find by chance in a 10k promoter sequence

How would this differ if there was an optional additional base between the 4th and 5th positionsIeACC[TG]TAOR ACC[TG]TA

E-value Exercise AnswerACC[TG]TA

Expect lsquoArsquo every 4 ntExpect lsquoACCrsquo every 4x4x4 = 64 nt

Expect lsquoT or Grsquo every 2nd ntExpect lsquoACC[TG]rsquo every 64x2 nt = 128 nt

Expect lsquoTArsquo every 4x4 = 16 ntExpect lsquoACC[TG]TArsquo every 128x16 nt = 2048 nt (4x4x4x2x4x4)We would expect ~5 of these promoter sites every 10k by chance

If also ACC[TG]TAA allowed

The two motifs independently have the same E-valueTo allow either means we expect twice as many

E-values Effect of Database SizeThe nr mRNA database has ~ 14 x 1010 letters (was RefSeq and 50 x108)There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA

Query = lsquoArsquo

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACrsquo

AC AC AC AC

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACGrsquo

ACG

Expected number of matches = (14 x 1010) 4 = ~12 x 108

Expected number of matches = (14 x 1010) (4x 4) = ~31 x 107

Expected number of matches = (14 x 1010) (4 x 4 x 4) = ~81 x 106

Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer

Expected number of matches = (14 x 1010) (4 x 4 x 4 x 4 hellip 60 times ) = (14 x 1010) 1036 = 14 x 10-26

E-value = 14 x 10-26

(was E-value = 50 x 10-28)

E-values Effect of Database Size

The E-value is simply dependent on database size

RefSeq

nr

14 x 1010 letters

50 x108 letters

30 x bigger

BLAST the same sequenceagainst each E-value = 14e-26

E-value = 50e-28

The database was ~30 times bigger and so the E-value was ~30 times bigger

Why were the values differentOur calculated E-value for searching against the RefSeq mRNA database was 50 x 10-28But our actual BLAST search at NCBI gave a value of

20 x 10-26 - about 40x larger - why is this

Gapped alignments

If we were expecting N matches for a query sequence lsquoACGTACGTACGTrsquo imagine what would happen to N if we allowed gaps in our matches

ACGTACGTACGT

This would now give us additional possible alignments that would meet our lsquomatchrsquo criteria

ACGTACGTACGT ACGTACAGTACGT ACGTACCGTACGT etc|||||||||||| |||||| |||||| |||||| ||||||ACGTACGTACGT ACGTAC-GTACGT ACGTAC-GTACGT

We will expect many more matches in a given database if we allow our alignments to have gaps The E-value will be larger

E-values Effect of Query Length

Biologically itrsquos the same match Does it mean we are any less sure that this match didnrsquot occur by chance The E-value is simply dependent on match length

database

BLAST 500 nt sequence against a database

BLASTn Get a full length match with sequence XYZ at an E-value = 50e-160

gtsequence ACTAGTCTAGCTAGACATCGATCGATGATGCTACACAGATAGACGATAGATAGTAAGTCGATCGATCGCGCATCGATCGTCTAGATCGATCGCTCGCTGTGTAGATAGATCGGCGATAGA

database

BLAST half of the same sequence against the same database

BLASTn

gtsequence ACTAGTCTAGCTAGACATCGATCGATGATGCTACACAGATAGACGATAGATAGTAAGTCG

Get a match with sequence XYZ again but at an E-value = 50e-80

Why not just use identityAt some levels this a good question

But consider two very different searches both of which give a 75 identity match

Query1 was 60 nt longCGGAGCTCAGGGCTTAACGACTGATATCTCCGCGCATGTCGAGAAACGATACAGCCAGCG||||||||||| || | || | || || |||| | | | |||||| | ||||||||||CGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGWhich would have an E-value ~ 50 x 10-19

And Query2 only 16 nt longACGTACGTACGTACGT||| || | |||| ||ACGCACCTTCGTAGGTWhich would have an E-value ~ 30

And intuitively we feel we would expect to see that sort of number of matches in the database just by chancehellip

So whatrsquos the real problemBasically you are usually trying to answer the question

Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism

Are there any useful guidelines though at least for biological meaningfulness

Basically you are usually trying to answer the question

Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism

BLAST

The difficulty is because

ORTHOLOGY

BLAST Similarity + Probability

biological knowledge

nature of query sequence

phylogenetic relationship

match length PI size of databasehellip

Rules of ThumbHow good does an E-value have to be before we might even think we have an ortholog

largerworse smallerbetter

E-values 10-5 10-10 10-40 10-100 00

fantasy borderline encouraging

pretty good canrsquot get

better

But note that in some gene families with closely related members you can get an E-value of 00 for several different matches and then identity may be more sensitive Also bear in mind in cases like this that ideas of lsquofunctionalrsquo orthology may break down with more than one locus producing identical proteins which share the same functionhellip

Protein BLASTItrsquos (nearly) always better to make comparisons at the amino acid level between protein sequences than the DNA level because the amino acid sequence is more conserved than the underlying DNA sequenceDoes this cause us to treat expected values any differently

If we follow the argument as before then for an exact match of a 20 amino acid sequence in the RefSeq protein database each additional amino acid will reduce the E-value by 120th (there are 20 different amino acids) And as there are 347895532 letters in that databaseE-value = ~35 x 108 (20 x 20 x 20 hellip20 times) = ~35 x 10-18

But this is what we get if we run the blast at NCBI

Score = 431 bits (100) Expect = 8e-04 Identities = 2020 (100) Positives = 2020 (100) Gaps = 020 (0) Frame = +3

Query 3 SSSSFRAYRAALSEVEPPCI 62 SSSSFRAYRAALSEVEPPCISbjct 972 SSSSFRAYRAALSEVEPPCI 991

Really too big a discrepancy to easily explain with hand wavinghellip

Amino Acid Substitutions

A SC F LWYG I LMVL IMFVM ILVP V ILMW FY

N DHSQ REHKS ANTT SY HFW

H NQYK RQER QK

D NEE DQK

In fact we need to take into account both amino acid substitutability as well as as before allowing gapped alignments On average any residue can be substituted for by about 2 others so each position has about 17th chance of lsquomatchingrsquo rather than 120th

So now we getE-value = ~35 x 108 (7 x 7 x 7 hellip20 times) = ~44 x 10-9which is much closer to the actual BLAST value

These substitutabilities are dealt with by the BLOSUM and PAM matrices

ExercisesGo to the file random-DNA-sequenceshtml select one of the 20 randomly generated nucleotide sequences and do a BLASTx (translated DNA-gtprotein) at NCBI against the nr protein database

Did you find any lsquosignificantrsquo hits

Repeat with a second sequence

What conclusions might you draw from this exercise

Try the same sequence(s) against the nr nucleotide database

Is there any general difference

Part 4 Tweaking BLASTAlthough you normally see BLAST as a web page with boxes to place data in and tick boxes etc it is actually a command line program that can be run just by typing the appropriate command and options eg

promptgt blastall ndashp ltblast typegt ndashi ltinput sequencegt ndashd ltdatabasegt

This is the simplest form where the basic program lsquoblastallrsquo takes a number of different options or parameters indicated by the ndashx and followed by its value -p ltwhich blast flavour to rungt-i ltfile with query sequence ingt-d ltpre-indexed database namegt

Not All Parameters are hellipThere are many other parameters and if not listed explicitly they will use a default value most appropriate to the blast flavour requested Eg for ndashW ltword sizegt blastn uses ndashW 11 where blastx uses ndashW 3

There are also some options that appear on the web pages that are not really parameters but manage the job in a similar way One of the most useful of these is on the NCBI blast pages where you can use Entrez queries or pick from an organism list to modify your search

The Many Parameters of BLASTThere are almost literally hundreds of parameters but most are way too obscure even for die-hard techies like me Very few of them are regularly useful in any but their default value but just occasionally they are very necessary

Here are some of the ones that I have used

-e max expected value -m output format (graphical or tabularspreadsheet)-F filter query sequence for low complexity (default TRUE)-U use only upper case regions of query (default FALSE)-G gap opening cost-E gap extension cost-q nucleotide mismatch penalty (BLASTx uses matrices)-r nucleotide match reward-b number of matching sequences to report-g allow gaps (default TRUE)-W word size-z effective database size (removes effect of actual database size)-S query strands to search (default both directions)-l restrict database sequences to given list of lsquogilsquo numbers

BLAST Parameters Exercises1 BLASTn vs BLASTx

Open the file example-sequenceshtml copy the sequence gtblastn-vs-blastxThis is a Xenopus tropicalis cDNA sequence

Go to the NCBI BLAST Home PageNucleotide-nucleotide BLAST (blastn) section Paste your sequence into the box

Run BLASTn against the nr nucleotide database using all default optionsThen hit [format] to wait for the results in a new page(hint if you paste the sequence definition line lsquogtnamersquo into the box as well your results will be labelled accordingly which can be useful)

Now repeat but go to the TRANSLATED BLAST section and BLAST against the nr protein database using BLASTx

How might the different results help us view the presence of this gene in other vertebrates

Results for Exercise 1 BLASTn

BLASTx

BLAST Parameters Exercises2 Low complexity filtering

Open the file example-sequenceshtml copy the sequence gtlow-complexity-filtering-A This sequence contains a long AT tandem repeat

Go to the NCBI BLAST Home PageTRANSLATED BLAST sectionBLASTx Paste your sequence into the box

Carefully UNTICK the ldquoChoose filter [ ] Low complexityrdquo BOX in the second section And then run BLASTx against the nr database

What do you feel about these alignmentsRe-run but leave the low-complexity filter ON this timeDoes this change our view of the protein matches

Now continue with gtlow-complexity-filtering-B and ndashCC is an especially interesting case ndash what can we deduce about the cDNA sequence Annotators beware

Results for Exercise 2A (OFF) BLASTn ndash low complexity filtering OFF

Results for Exercise 2A (ON) BLASTn ndash low complexity filtering ON

Results for Exercise 2B

ON OFF

Results for Exercise 2C

There is a sequence error an extra G at position 117 in the sequence cDNA (117)AGAAAAGAAGAAACATGGCAATGGATCAGAA|||||||||||||||| ||||||||||||||AGAAAAGAAGAAACAT-GCAATGGATCAGAA

Genomic sequence

ON

OFF

BLAST Parameters Exercises3 Limit by Entrez query

Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching sequences in fruit fly enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT

Open the file example-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home Page TRANSLATED BLAST sectionBLASTx and paste the sequence

Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip

BLAST Parameters Exercises4 BLASTn vs tBLASTx and nucleotide mismatch penalties

Open the file example-sequenceshtml

Also open the NCBI BLAST Home PageSPECIAL ndash Align two sequences section

There are several Xenopus tropicalis cyclins in the examples fileCopy the sequence gtcyclin-A1-Xt to the Sequence 1 BLAST windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 BLAST window(i) Run the default comparison should be BLASTn Note the alignmentNow run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these

(ii) Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1 Then try reducing the gap open and gap extension penaltieshellipWhat do we learn from this

(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip

Results for Exercise 4 (i)

BLASTn tBLASTx

Results for Exercise 4 (ii)

Mismatch penalty = -2 (default) Mismatch penalty = -1

BLAST Parameters Exercises5 E-Value maximum for reporting

Open the file example-sequenceshtml

Copy the sequence gtsumo-binding-motif and go to the NCBI BLAST Home PageGo to the PROTEIN BLAST section BLASTp and paste the sequence

Run the search with the default values

Now re-run the search setting the maximum E-value in the box

Expect 100

What difference does this make

BLAST Parameters Exercises6 Word Size

Open the file example-sequenceshtml

Copy the sequence gtmorpholino and go to the NCBI BLAST Home PageGo to the NUCLEOTIDE BLAST section BLASTn and paste the sequence

Check OFF the low complexity filter and then run the search

Now re-run the search setting the following parameters

Low complexity OFFExpect 100Word Size 7Other advanced -q-1 (mismatch penalty -1 instead of default -3)

What difference does this make

END

  • Bioinformatics Workshop 1 Sequences and Similarity Searches
  • The Basic Questions
  • Part 1 Structural Genomics
  • Chromosomes and Genes
  • Gene to Protein
  • Sequence Signals
  • Genomic Signals
  • Derivative Sequences
  • Gene Models
  • Sequences and Genes (Accession Numbers and Names)
  • Gene Symbols Names Etc
  • A Gene-Centric View
  • Sequences and Accession Numbers
  • mRNA Splicing Signals
  • Gene Predictions
  • Supporting Evidence
  • TheoreticalPredicted Sequences
  • Sequences for a model organism
  • So Whatrsquos in the Databases Now
  • Part 2 Comparative Genomics
  • Speciation
  • Residual Similarity
  • Computers Can Detect Homology
  • Orthologs
  • Paralogs
  • lsquoOtherrsquo-logs
  • The Essential Paradigm
  • Function Conserved Longer than Detectable Similarity
  • Redundancy in the Genetic Code
  • Protein Similarity Persists Longer
  • Always Compare Protein Sequences
  • Exercise 1 nucleotide vs amino acid search
  • Answers Exercise 1
  • The Essential Task
  • Functional Orthologs
  • Finding Orthologs
  • Using Synteny is Better
  • Metazome
  • Metazome Exercise
  • Part 3 Finding Sequence Similarities
  • Gaps in Alignments
  • The Downside of Gaps
  • BLAST
  • Flavours of BLAST
  • How does it work
  • BLAST WORDS and INDEXING
  • Analyse the Query Sequence
  • Expand from Word Based Matches
  • BLAST ndashTypical Output
  • When is a match significant
  • E-values
  • E-values From First Principles
  • Calculating an E-value
  • E-values In Practice
  • E-value Exercise
  • E-value Exercise Answer
  • E-values Effect of Database Size
  • Slide 58
  • Why were the values different
  • E-values Effect of Query Length
  • Why not just use identity
  • So whatrsquos the real problem
  • Rules of Thumb
  • Protein BLAST
  • Amino Acid Substitutions
  • Exercises
  • Part 4 Tweaking BLAST
  • Not All Parameters are hellip
  • The Many Parameters of BLAST
  • Slide 70
  • BLAST Parameters Exercises
  • Results for Exercise 1
  • Slide 73
  • Results for Exercise 2A (OFF)
  • Results for Exercise 2A (ON)
  • Results for Exercise 2B
  • Results for Exercise 2C
  • Slide 78
  • Slide 79
  • Results for Exercise 4 (i)
  • Results for Exercise 4 (ii)
  • Slide 82
  • Slide 83
  • END
Page 35: Bioinformatics Workshop 1 Sequences and Similarity Searches

Functional Orthologs

function known annotation lsquoGravinrsquo available

Human geneXenopus genefunction unknown

sequence similarityorthologs

same function But we know that function is largely determined by shape

similar shape

Which in general we cannot determine ndash but it is probably SHAPE not SEQUENCE that is conserved

We make an assumption that the same gene function is likely to be present in the two organisms and the ones that have this function are likely to be the most similar in sequence

Finding OrthologsSo how do we find orthologs and can we know when we have

The simplest is Reciprocal Best BLAST but it implicitly relies on having all the protein sequences of you own organism and the one you wish to find an ortholog in

frog proteindatabase of human proteins

best match human protein

database of frog proteins

x

Using Synteny is Better

We know that large regions of (say) vertebrate genomes have preserved their overall organisation from one organism to another

And we find the same genes (ie orthologs) in more or less the same order in the syntenic sectionsThese of course represent chromosomal re-arrangements since these organisms diverged

Human chromosome 5

Mouse chromosome 10

Mouse chromosome 2

MetazomeFortunately someone has done all the hard work for ushellip Dan Rokhsar httpwwwmetazomenet

Metazome Exercise

Go back to Entrez Gene and look for your favourite gene again

Pick probable ortholog vertebrate genes from common organisms (human mouse rat chicken frog fish) and paste their protein sequences into a temporary space

Go to Metazome (httpwwwmetazomenet) find the blast window open two versions of it and blast your sequences against the Tetrapod or Jawed vertebrate node

See if you get the same cluster ID as best top hit and have a look at the Metazome alignment(s)hellip

Part 3 Finding Sequence Similarities

We want computer programs which will compare sequences at all possible different alignments looking for a degree of similarity greater than we would expect to find by chance

But first we have to consider the implication of gapshellip

Insertions and deletions are other possible forms of mutations and they can really mess up our simple alignments

ATGCATGCTGCCAACGGATGTCCTG

ATGAAAGCCGCCTACGAAAGTCCTG||| | || ||| ||| | ||||||

ATGCATGCTGGCCAACGGATGTCCTG

ATGAAAGCCGCCTACGAAAGTCCTG||| | || ||| | | | |

Gaps in Alignments

Consider these two obviously similar sequences

TTCCCAACTCTCCTCTTTCACCATGAAGCTCAAGGACAGATTCCACTCGCCCCAAAATCAAGCTCACCCCGTCCAAGAA | || | || |||||||||||||||||||| ||||||||| ||| ||| | ||| | | |TTCCCACCTCTCCTCTTTGCACCATGAAGCTCAAGGACAAATTCCACTCCCCCAAAATCAAGCGCACCCCGTCCCAGAA

In fact we realise that the most probable alignment (regarding biological origin) is with a small gap in each sequence

TTCCCAACTCTCCTCTTT=CACCATGAAGCTCAAGGACAGATTCCACTCGCCCCAAAATCAAGCTCACCCCGTCCAAGAA |||||| ||||||||||| |||||||||||||||||||| ||||||||| |||||||||||||| |||||||||| ||||TTCCCACCTCTCCTCTTTGCACCATGAAGCTCAAGGACAAATTCCACTC=CCCCAAAATCAAGCGCACCCCGTCCCAGAA

So in general we allow ourselves to insert gaps until we find the optimal alignment

But where should this process stop

The Downside of GapsTake two random sequences with no lsquorealrsquo similarity

GACACTAGGTCGATGCGTGGTGGCGAGA

ACGCATCCGGATGTGCACCGTGGAACTG

And allow lsquocost freersquo gaps

GAC--ACT----AGGTCGATGC---GTGG---TGGCGAGA || | | | | | ||| |||| || ACGCA-TCCGGA--T-G-TGCACCGTGGAACTG

Clearly although the alignment has no mismatches it is obviously not biologically meaningful

To prevent this we assign a cost to adding gaps which is offset against the benefit of finding matches ndash and this is the essence of lsquofinding gapped alignmentsrsquo

We want to find the lsquoalignmentrsquo between the two (or more) sequences which shows the greatest degree of similarity while introducing the fewest gaps hellip

BLAST

gtqueryAGACGAACCTAGCACAAGCGCGTCTGGAAAGACCCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGTTCAGGAGTATTTGGACTGCAATATTGGCCCTCGTTCAAGGGCGCCTACCATCACCCGACGGTCATGCCGGTCCCCAGCAGCTGCTAATAACTTCCTTCGCTACTCAAGTTACCACGCTAGCAAAACCCACGGCATACCGTTTACCCTTTAAAATCAGCTTCAACCAGCAACGAA

There are many programs used to find similarities between sequencesThey range from relatively slow programs which find the exact best matching alignment through ones which take progressively inexact shortcuts to speed things up Of this latter class the best known and easily most widely used is BLAST developed by Stephen Altschul and others and continuously refined over the last 10-15 years

The essential idea is to compare your query sequence against a collection or lsquodatabasersquo of target sequences looking for the one(s) that match the query sequence the best

gttarget1AAAACAGGAATATTTACCGGGACCGGGTAATGATGCATCTCGAGGTACACAATATACCTG GAGAACCGAATTATGAGTTGGCCACCTTACTTAACGAAACCAGCAGAGAAAATCCAACAT GGCAACACCCCTCTGACTACACTAGAAGGAACTACTATGTAAGAAAACAGCCTGTCCCTT GCAGTTTGAATGACTGGGTGATGCGAAATGGGGGTCCTGCCATAGAGCGCTTCCATGGTT TACCTTGCACATTTCAGAGAAGTCCTATGCCAGGAGTCCTTCCTACAGGGCCTTCCTGAA ACTATATATGTGCTTATTCTTGTTTGATTTGGCTTTGCAGgttarget2CTCTTAATTTATTTCTCTTCCTGCAGCTCCCTCGCTTTTTCCTTTCCCTGTTACATTCAT CTGACTTGAAGAGTTGCAAATTTTCAGTGTTTCTGTTTTTGTTGCTGATATGTTGTAAAC TTTTTAATAAAATCTATTTCTATAG gttarget3GCAGTTTGAATGACTGGGTGATGCGAAATGGGGGTCCTGCCATAGAGCGCTTCCATGGTT TACCTTGCACATTTCAGAGAAGTCCTATGCCAGGAGTCCTTCCTACAGGGCCTTCCTGAA ACTATATATGTGCTTATTCTTGTTTGATTTGGCTTTGCAGCTAGGGTTTTCACCTTTTCT GGAAAAAAAAATACTGGCTTCC gttarget4CTGCTATTAATGGGCAAAACAACTCAAATAAAGTCCCTCTGCCACCCTCAGACACTGCCC CTGGCCCCCAGCTGCCCGCTGATCCTTGTAGCCAGAGCAGTAAAGTTTTGAAAGTGGAGC CCAAGGAGAATAAAGTTATTAAAGAAACTGGCTTTGAACAAGGTGAAAAGTCTTGTGCAG CACCTCTAGATCATACTGTGAAGGAAAATCTTGGACAAACTTCTAAAGAACAGGTGGTAG

query

database

COMPARE

LIST MATCHES

Flavours of BLAST

ACGATAGATCCCATCCATAAAT ATGACGATAGATCCCATCAT CGATAGGACCACCACA GATAGACCAGGATACATAGGATAATTA AGCTCGCTTGGCTCGATGGCT

MKJLSPWERSYTRGHYTWER MGHTVNBZY MKLPWRHGDBKJGMNDFD MBKLRPIUHDFRTASGSLKWWRTVBN

MKJLSPWERSYTRGHYTWER MGHTVNBZY MKLPWRHGDBKJGMNDFD MBKLRPIUHDFRTASGSLKWWRTVBN

ATGACGATAGATCCCATCAT CGATAGGACCACCACA GATAGACCAGGATACATAGGATAATTA AGCTCGCTTGGCTCGATGGCT

query sequence other operation database sequences

ATGACGATAGATCCCATCAT CGATAGGACCACCACA GATAGACCAGGATACATAGGATAATTA AGCTCGCTTGGCTCGATGGCT

BLASTn

BLASTp

BLASTx

tBLASTn

tBLASTx

ACGATAGATCCCATCCATAAAT

ACGATAGATCCCATCCATAAAT

MQWCGYRWTYQGYRW

MQWCGYRWTYQGYRW

FAST

FAST

SLOW

SLOWER

HORRIBLY

SLOW

6 fra

me

trans

latio

n

How does it work The main task of any sequence comparison program is to test all possible mutual alignments of two sequence and see how good the match is

CTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGT

CCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGTTC | | | | | ||||||||||||||||||||||||| CTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGTCTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGT || | | | | | | | | |||||||||||||||||||||||| | | | | | |

CCGAGCTTCTCATTGCTCTTCCTAACAGTG=TGATAGGCTAACCGTAATGGCGTTC||||||||||||||||||||||||| ||||||||||||||||||||||||

query

1st database sequence

This would actually be a very slow search process if implemented like thishellip

BLAST achieves its speed through two strategies

- it takes a WORD based approach- it pre-INDEXES database sequences

BLAST WORDS and INDEXING1 GACAAATCCAAACCCCTGAAGTTCTCCACCAGCAAAGCCA

2 TAAGCAAATTTAATTTTGTTTACATTTTC

3 GTTAAGACCTTCCCTGACATTTGCAGCAGTTTCAAATGTA

AAAAAAAA 00001 AAAAAAAC 00002 AAAAAAAG 00003 ACAAATCC 07967 ACAAATCC 07968 ACAAATCC 07979 GACAAATC 33568 GACAAATG 33569 TCCAAACC 64321 TCCAAACC 64322

sequence position word

1 1 33658

1 2 07967

1 3 16210

3 15 33568

3 16 07967

Database of sequences

Numbered list of all possible lsquowordsrsquo

Build a position index of all words in the database

Analyse the Query Sequence gtquery AGACAAATCCAAACCCCTGAAGTTCTCCACCAGCAAAGCCA

AAAAAAAA 00001 AAAAAAAC 00002 AAAAAAAG 00003 ACAAATCC 07967 ACAAATCC 07968 ACAAATCC 07979 GACAAATC 33568 GACAAATG 33569 TCCAAACC 64321 TCCAAACC 64322

QUERY SEQUENCE

Numbered list of all possible lsquowordsrsquo

position word

1 14236

2 33658

3 07967

Analyse QUERY SEQUENCE

sequence position word

1 1 33658

1 2 07967

1 3 16210

3 15 33568

3 16 07967

Index of database

Expand from Word Based Matches We lsquoinstantlyrsquo know which sequences in the database have at least a word length match with our query sequence and at what relative position

Next the potential alignments are expanded adding up a score for (total matches ndash mismatches ndash gap penalties) to make the best possible alignment But this is usually for a tiny proportion of the sequences in the database ndash so overall it is much quicker

The highest scoring alignments are reported

But we can potentially miss alignments with no word-size bits in common consider BLASTn with a default word-size of 11

TCGGAAGTGGAAGCTGAACCTGATTGTAGAGTTGGAGGCCAGTGTTCTGGCTGAGC||||||||| ||||| |||||||||| |||||||||| |||| ||||| ||||||| TCGGAAGTGTAAGCTCAACCTGATTGCAGAGTTGGAGTCCAGAGTTCTAGCTGAGC

Care is sometimes neededhellip

BLAST ndashTypical OutputINPUT

gtpartial cDNA sequence Xenopus tropicalisCGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGTTCCCACCTCTCCTCTTTCACCATGAAGCTCAAGGACAAATTCCACTCCCCCAAAATCAAGCGCACCCCGTCCAAGAAGGGGAAGCCGGCCGACCTCACCGTCAAAACAGAAGAGAAACCCGTCAACAAAACCTTAAGCCGCTTGGAGGAACAGGAGAAAGAAGTCGTTAATGCCTTGCGTTACTTTAAGACAATTGTTGACAAGATGGCGGTGGACAAGATGGTGCTGGTGATGCTGCCAGGGTCGGCGA

OUTPUTQuery= (311 letters) Database NCBI Protein Reference Sequences 954378 sequences 347895532 total letters

gtgi|41055060|ref|NP_9574201| similar to guanine nucleotide-releasing factor 2 (specific for crk proto-oncogene) [Danio rerio]

Length=691

Score = 133 bits (335)Expect = 6e-31 Identities = 7698 (77) Positives = 8298 (83) Gaps = 498 (4) Frame = +2

Query 26 MSGKIE-KADSQRSHLSSFTMKLKDKFHSPKIKRTPSKKGKPA--DLTVKTEEKPVNKTL 196 MSGKIE K +SQ+SHLSSFTMKL KFHSPKIKRTPSKKGK + VKT EKPVNK + Sbjct 1 MSGKIESKHESQKSHLSSFTMKLM-KFHSPKIKRTPSKKGKQLQPEPAVKTPEKPVNKKV 59

Query 197 SRLEEQEKEVVNALRYFKTIVDKMAVDKMVLVMLPGSA 310 SRLEEQEK+VV+ALRYFKTIVDKM VD VL MLPGSA Sbjct 60 SRLEEQEKDVVSALRYFKTIVDKMNVDTKVLQMLPGSA 97

When is a match significant

RFKISDCQHPCTYSHNQYMTNHMRECPYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV

NFSWKKTSEKETNCQFDYPNDYNEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFNMCWLEVNSS

RF---KISDCQHPCTYSH-NQYMTNHMREC----PYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV F K S+ + C + + N Y N +C P+ + +W +P + D I N M ++ NFSWKKTSEKETNCQFDYPNDY--NEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFN------MCWLEVNSS

Here is a lsquotypicalrsquo weak alignment from BLASTp

In fact the sequences were randomly generated so there is no biologically significant alignmenthellip

E-values

The number of matches like the discovered match that I would expect to find by chance

An E-value of 00 implies that I would expect no matches like this to arise by chance thereforehellip

An E-value of 1 implies I would expect 1 match like this to arise by chance so if I have a match with such an E-valuehellip

Also ldquoexpect valueldquo or ldquoexpectationrdquo

E-values From First Principles

Some database statistics (23rd July 2005)

Database NCBI RefSeq mRNA 272619 sequences 503566580 total letters (~50 x 108)

Database NCBI nr 3329110 sequences 14601814750 total letters (~14 x 1010)

Notation

12e-35 = 12 x 10-35

48 x 106 = 4800000

We will consider first searching a nucleotide sequence (lsquoACGTAGACGTrsquo) against a nucleotide database eg the RefSeq mRNA above

Then we will consider the more complex case of amino acid sequence (protein) searches Which is of course what we mostly do

Calculating an E-valueThe RefSeq mRNA database has ~ 50 x 108 letters There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA

Query = lsquoArsquo

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACrsquo

AC AC AC AC

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACGrsquo

ACG

Expected number of matches = (50 x 108) 4 = ~12 x 108

Expected number of matches = (50 x 108) (4x 4) = ~31 x 107

Expected number of matches = (50 x 108) (4 x 4 x 4) = ~81 x 106

Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer

Expected number of matches = (50 x 108) (4 x 4 x 4 x 4 hellip 60 times ) = (50 x 108) 1036 = 50 x 10-28

E-value = 50 x 10-28

E-values In PracticeSo if I take a 60 nt sequencegtsequenceACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA and actually BLAST it against the RefSeq mRNA database I get

BLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds

Length=6060 Score = 119 bits (60) Expect = 2e-26 Identities = 6060 (100) Gaps = 060 (0) Strand=PlusPlus

Query 1 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 60 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 2977 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 3036

What do I get if I BLAST it against the larger nr databaseBLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds

Length=6060 Score = 119 bits (60) Expect = 6e-25 Identities = 6060 (100) Gaps = 060 (0) Strand=PlusPlus

Query 1 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 60 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 2977 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 3036

theoretical value was 50e-28 -

E-value Exercise

Given a transcription factor binding site

ACC[TG]TA

How many would you expect to find by chance in a 10k promoter sequence

How would this differ if there was an optional additional base between the 4th and 5th positionsIeACC[TG]TAOR ACC[TG]TA

E-value Exercise AnswerACC[TG]TA

Expect lsquoArsquo every 4 ntExpect lsquoACCrsquo every 4x4x4 = 64 nt

Expect lsquoT or Grsquo every 2nd ntExpect lsquoACC[TG]rsquo every 64x2 nt = 128 nt

Expect lsquoTArsquo every 4x4 = 16 ntExpect lsquoACC[TG]TArsquo every 128x16 nt = 2048 nt (4x4x4x2x4x4)We would expect ~5 of these promoter sites every 10k by chance

If also ACC[TG]TAA allowed

The two motifs independently have the same E-valueTo allow either means we expect twice as many

E-values Effect of Database SizeThe nr mRNA database has ~ 14 x 1010 letters (was RefSeq and 50 x108)There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA

Query = lsquoArsquo

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACrsquo

AC AC AC AC

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACGrsquo

ACG

Expected number of matches = (14 x 1010) 4 = ~12 x 108

Expected number of matches = (14 x 1010) (4x 4) = ~31 x 107

Expected number of matches = (14 x 1010) (4 x 4 x 4) = ~81 x 106

Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer

Expected number of matches = (14 x 1010) (4 x 4 x 4 x 4 hellip 60 times ) = (14 x 1010) 1036 = 14 x 10-26

E-value = 14 x 10-26

(was E-value = 50 x 10-28)

E-values Effect of Database Size

The E-value is simply dependent on database size

RefSeq

nr

14 x 1010 letters

50 x108 letters

30 x bigger

BLAST the same sequenceagainst each E-value = 14e-26

E-value = 50e-28

The database was ~30 times bigger and so the E-value was ~30 times bigger

Why were the values differentOur calculated E-value for searching against the RefSeq mRNA database was 50 x 10-28But our actual BLAST search at NCBI gave a value of

20 x 10-26 - about 40x larger - why is this

Gapped alignments

If we were expecting N matches for a query sequence lsquoACGTACGTACGTrsquo imagine what would happen to N if we allowed gaps in our matches

ACGTACGTACGT

This would now give us additional possible alignments that would meet our lsquomatchrsquo criteria

ACGTACGTACGT ACGTACAGTACGT ACGTACCGTACGT etc|||||||||||| |||||| |||||| |||||| ||||||ACGTACGTACGT ACGTAC-GTACGT ACGTAC-GTACGT

We will expect many more matches in a given database if we allow our alignments to have gaps The E-value will be larger

E-values Effect of Query Length

Biologically itrsquos the same match Does it mean we are any less sure that this match didnrsquot occur by chance The E-value is simply dependent on match length

database

BLAST 500 nt sequence against a database

BLASTn Get a full length match with sequence XYZ at an E-value = 50e-160

gtsequence ACTAGTCTAGCTAGACATCGATCGATGATGCTACACAGATAGACGATAGATAGTAAGTCGATCGATCGCGCATCGATCGTCTAGATCGATCGCTCGCTGTGTAGATAGATCGGCGATAGA

database

BLAST half of the same sequence against the same database

BLASTn

gtsequence ACTAGTCTAGCTAGACATCGATCGATGATGCTACACAGATAGACGATAGATAGTAAGTCG

Get a match with sequence XYZ again but at an E-value = 50e-80

Why not just use identityAt some levels this a good question

But consider two very different searches both of which give a 75 identity match

Query1 was 60 nt longCGGAGCTCAGGGCTTAACGACTGATATCTCCGCGCATGTCGAGAAACGATACAGCCAGCG||||||||||| || | || | || || |||| | | | |||||| | ||||||||||CGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGWhich would have an E-value ~ 50 x 10-19

And Query2 only 16 nt longACGTACGTACGTACGT||| || | |||| ||ACGCACCTTCGTAGGTWhich would have an E-value ~ 30

And intuitively we feel we would expect to see that sort of number of matches in the database just by chancehellip

So whatrsquos the real problemBasically you are usually trying to answer the question

Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism

Are there any useful guidelines though at least for biological meaningfulness

Basically you are usually trying to answer the question

Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism

BLAST

The difficulty is because

ORTHOLOGY

BLAST Similarity + Probability

biological knowledge

nature of query sequence

phylogenetic relationship

match length PI size of databasehellip

Rules of ThumbHow good does an E-value have to be before we might even think we have an ortholog

largerworse smallerbetter

E-values 10-5 10-10 10-40 10-100 00

fantasy borderline encouraging

pretty good canrsquot get

better

But note that in some gene families with closely related members you can get an E-value of 00 for several different matches and then identity may be more sensitive Also bear in mind in cases like this that ideas of lsquofunctionalrsquo orthology may break down with more than one locus producing identical proteins which share the same functionhellip

Protein BLASTItrsquos (nearly) always better to make comparisons at the amino acid level between protein sequences than the DNA level because the amino acid sequence is more conserved than the underlying DNA sequenceDoes this cause us to treat expected values any differently

If we follow the argument as before then for an exact match of a 20 amino acid sequence in the RefSeq protein database each additional amino acid will reduce the E-value by 120th (there are 20 different amino acids) And as there are 347895532 letters in that databaseE-value = ~35 x 108 (20 x 20 x 20 hellip20 times) = ~35 x 10-18

But this is what we get if we run the blast at NCBI

Score = 431 bits (100) Expect = 8e-04 Identities = 2020 (100) Positives = 2020 (100) Gaps = 020 (0) Frame = +3

Query 3 SSSSFRAYRAALSEVEPPCI 62 SSSSFRAYRAALSEVEPPCISbjct 972 SSSSFRAYRAALSEVEPPCI 991

Really too big a discrepancy to easily explain with hand wavinghellip

Amino Acid Substitutions

A SC F LWYG I LMVL IMFVM ILVP V ILMW FY

N DHSQ REHKS ANTT SY HFW

H NQYK RQER QK

D NEE DQK

In fact we need to take into account both amino acid substitutability as well as as before allowing gapped alignments On average any residue can be substituted for by about 2 others so each position has about 17th chance of lsquomatchingrsquo rather than 120th

So now we getE-value = ~35 x 108 (7 x 7 x 7 hellip20 times) = ~44 x 10-9which is much closer to the actual BLAST value

These substitutabilities are dealt with by the BLOSUM and PAM matrices

ExercisesGo to the file random-DNA-sequenceshtml select one of the 20 randomly generated nucleotide sequences and do a BLASTx (translated DNA-gtprotein) at NCBI against the nr protein database

Did you find any lsquosignificantrsquo hits

Repeat with a second sequence

What conclusions might you draw from this exercise

Try the same sequence(s) against the nr nucleotide database

Is there any general difference

Part 4 Tweaking BLASTAlthough you normally see BLAST as a web page with boxes to place data in and tick boxes etc it is actually a command line program that can be run just by typing the appropriate command and options eg

promptgt blastall ndashp ltblast typegt ndashi ltinput sequencegt ndashd ltdatabasegt

This is the simplest form where the basic program lsquoblastallrsquo takes a number of different options or parameters indicated by the ndashx and followed by its value -p ltwhich blast flavour to rungt-i ltfile with query sequence ingt-d ltpre-indexed database namegt

Not All Parameters are hellipThere are many other parameters and if not listed explicitly they will use a default value most appropriate to the blast flavour requested Eg for ndashW ltword sizegt blastn uses ndashW 11 where blastx uses ndashW 3

There are also some options that appear on the web pages that are not really parameters but manage the job in a similar way One of the most useful of these is on the NCBI blast pages where you can use Entrez queries or pick from an organism list to modify your search

The Many Parameters of BLASTThere are almost literally hundreds of parameters but most are way too obscure even for die-hard techies like me Very few of them are regularly useful in any but their default value but just occasionally they are very necessary

Here are some of the ones that I have used

-e max expected value -m output format (graphical or tabularspreadsheet)-F filter query sequence for low complexity (default TRUE)-U use only upper case regions of query (default FALSE)-G gap opening cost-E gap extension cost-q nucleotide mismatch penalty (BLASTx uses matrices)-r nucleotide match reward-b number of matching sequences to report-g allow gaps (default TRUE)-W word size-z effective database size (removes effect of actual database size)-S query strands to search (default both directions)-l restrict database sequences to given list of lsquogilsquo numbers

BLAST Parameters Exercises1 BLASTn vs BLASTx

Open the file example-sequenceshtml copy the sequence gtblastn-vs-blastxThis is a Xenopus tropicalis cDNA sequence

Go to the NCBI BLAST Home PageNucleotide-nucleotide BLAST (blastn) section Paste your sequence into the box

Run BLASTn against the nr nucleotide database using all default optionsThen hit [format] to wait for the results in a new page(hint if you paste the sequence definition line lsquogtnamersquo into the box as well your results will be labelled accordingly which can be useful)

Now repeat but go to the TRANSLATED BLAST section and BLAST against the nr protein database using BLASTx

How might the different results help us view the presence of this gene in other vertebrates

Results for Exercise 1 BLASTn

BLASTx

BLAST Parameters Exercises2 Low complexity filtering

Open the file example-sequenceshtml copy the sequence gtlow-complexity-filtering-A This sequence contains a long AT tandem repeat

Go to the NCBI BLAST Home PageTRANSLATED BLAST sectionBLASTx Paste your sequence into the box

Carefully UNTICK the ldquoChoose filter [ ] Low complexityrdquo BOX in the second section And then run BLASTx against the nr database

What do you feel about these alignmentsRe-run but leave the low-complexity filter ON this timeDoes this change our view of the protein matches

Now continue with gtlow-complexity-filtering-B and ndashCC is an especially interesting case ndash what can we deduce about the cDNA sequence Annotators beware

Results for Exercise 2A (OFF) BLASTn ndash low complexity filtering OFF

Results for Exercise 2A (ON) BLASTn ndash low complexity filtering ON

Results for Exercise 2B

ON OFF

Results for Exercise 2C

There is a sequence error an extra G at position 117 in the sequence cDNA (117)AGAAAAGAAGAAACATGGCAATGGATCAGAA|||||||||||||||| ||||||||||||||AGAAAAGAAGAAACAT-GCAATGGATCAGAA

Genomic sequence

ON

OFF

BLAST Parameters Exercises3 Limit by Entrez query

Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching sequences in fruit fly enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT

Open the file example-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home Page TRANSLATED BLAST sectionBLASTx and paste the sequence

Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip

BLAST Parameters Exercises4 BLASTn vs tBLASTx and nucleotide mismatch penalties

Open the file example-sequenceshtml

Also open the NCBI BLAST Home PageSPECIAL ndash Align two sequences section

There are several Xenopus tropicalis cyclins in the examples fileCopy the sequence gtcyclin-A1-Xt to the Sequence 1 BLAST windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 BLAST window(i) Run the default comparison should be BLASTn Note the alignmentNow run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these

(ii) Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1 Then try reducing the gap open and gap extension penaltieshellipWhat do we learn from this

(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip

Results for Exercise 4 (i)

BLASTn tBLASTx

Results for Exercise 4 (ii)

Mismatch penalty = -2 (default) Mismatch penalty = -1

BLAST Parameters Exercises5 E-Value maximum for reporting

Open the file example-sequenceshtml

Copy the sequence gtsumo-binding-motif and go to the NCBI BLAST Home PageGo to the PROTEIN BLAST section BLASTp and paste the sequence

Run the search with the default values

Now re-run the search setting the maximum E-value in the box

Expect 100

What difference does this make

BLAST Parameters Exercises6 Word Size

Open the file example-sequenceshtml

Copy the sequence gtmorpholino and go to the NCBI BLAST Home PageGo to the NUCLEOTIDE BLAST section BLASTn and paste the sequence

Check OFF the low complexity filter and then run the search

Now re-run the search setting the following parameters

Low complexity OFFExpect 100Word Size 7Other advanced -q-1 (mismatch penalty -1 instead of default -3)

What difference does this make

END

  • Bioinformatics Workshop 1 Sequences and Similarity Searches
  • The Basic Questions
  • Part 1 Structural Genomics
  • Chromosomes and Genes
  • Gene to Protein
  • Sequence Signals
  • Genomic Signals
  • Derivative Sequences
  • Gene Models
  • Sequences and Genes (Accession Numbers and Names)
  • Gene Symbols Names Etc
  • A Gene-Centric View
  • Sequences and Accession Numbers
  • mRNA Splicing Signals
  • Gene Predictions
  • Supporting Evidence
  • TheoreticalPredicted Sequences
  • Sequences for a model organism
  • So Whatrsquos in the Databases Now
  • Part 2 Comparative Genomics
  • Speciation
  • Residual Similarity
  • Computers Can Detect Homology
  • Orthologs
  • Paralogs
  • lsquoOtherrsquo-logs
  • The Essential Paradigm
  • Function Conserved Longer than Detectable Similarity
  • Redundancy in the Genetic Code
  • Protein Similarity Persists Longer
  • Always Compare Protein Sequences
  • Exercise 1 nucleotide vs amino acid search
  • Answers Exercise 1
  • The Essential Task
  • Functional Orthologs
  • Finding Orthologs
  • Using Synteny is Better
  • Metazome
  • Metazome Exercise
  • Part 3 Finding Sequence Similarities
  • Gaps in Alignments
  • The Downside of Gaps
  • BLAST
  • Flavours of BLAST
  • How does it work
  • BLAST WORDS and INDEXING
  • Analyse the Query Sequence
  • Expand from Word Based Matches
  • BLAST ndashTypical Output
  • When is a match significant
  • E-values
  • E-values From First Principles
  • Calculating an E-value
  • E-values In Practice
  • E-value Exercise
  • E-value Exercise Answer
  • E-values Effect of Database Size
  • Slide 58
  • Why were the values different
  • E-values Effect of Query Length
  • Why not just use identity
  • So whatrsquos the real problem
  • Rules of Thumb
  • Protein BLAST
  • Amino Acid Substitutions
  • Exercises
  • Part 4 Tweaking BLAST
  • Not All Parameters are hellip
  • The Many Parameters of BLAST
  • Slide 70
  • BLAST Parameters Exercises
  • Results for Exercise 1
  • Slide 73
  • Results for Exercise 2A (OFF)
  • Results for Exercise 2A (ON)
  • Results for Exercise 2B
  • Results for Exercise 2C
  • Slide 78
  • Slide 79
  • Results for Exercise 4 (i)
  • Results for Exercise 4 (ii)
  • Slide 82
  • Slide 83
  • END
Page 36: Bioinformatics Workshop 1 Sequences and Similarity Searches

Finding OrthologsSo how do we find orthologs and can we know when we have

The simplest is Reciprocal Best BLAST but it implicitly relies on having all the protein sequences of you own organism and the one you wish to find an ortholog in

frog proteindatabase of human proteins

best match human protein

database of frog proteins

x

Using Synteny is Better

We know that large regions of (say) vertebrate genomes have preserved their overall organisation from one organism to another

And we find the same genes (ie orthologs) in more or less the same order in the syntenic sectionsThese of course represent chromosomal re-arrangements since these organisms diverged

Human chromosome 5

Mouse chromosome 10

Mouse chromosome 2

MetazomeFortunately someone has done all the hard work for ushellip Dan Rokhsar httpwwwmetazomenet

Metazome Exercise

Go back to Entrez Gene and look for your favourite gene again

Pick probable ortholog vertebrate genes from common organisms (human mouse rat chicken frog fish) and paste their protein sequences into a temporary space

Go to Metazome (httpwwwmetazomenet) find the blast window open two versions of it and blast your sequences against the Tetrapod or Jawed vertebrate node

See if you get the same cluster ID as best top hit and have a look at the Metazome alignment(s)hellip

Part 3 Finding Sequence Similarities

We want computer programs which will compare sequences at all possible different alignments looking for a degree of similarity greater than we would expect to find by chance

But first we have to consider the implication of gapshellip

Insertions and deletions are other possible forms of mutations and they can really mess up our simple alignments

ATGCATGCTGCCAACGGATGTCCTG

ATGAAAGCCGCCTACGAAAGTCCTG||| | || ||| ||| | ||||||

ATGCATGCTGGCCAACGGATGTCCTG

ATGAAAGCCGCCTACGAAAGTCCTG||| | || ||| | | | |

Gaps in Alignments

Consider these two obviously similar sequences

TTCCCAACTCTCCTCTTTCACCATGAAGCTCAAGGACAGATTCCACTCGCCCCAAAATCAAGCTCACCCCGTCCAAGAA | || | || |||||||||||||||||||| ||||||||| ||| ||| | ||| | | |TTCCCACCTCTCCTCTTTGCACCATGAAGCTCAAGGACAAATTCCACTCCCCCAAAATCAAGCGCACCCCGTCCCAGAA

In fact we realise that the most probable alignment (regarding biological origin) is with a small gap in each sequence

TTCCCAACTCTCCTCTTT=CACCATGAAGCTCAAGGACAGATTCCACTCGCCCCAAAATCAAGCTCACCCCGTCCAAGAA |||||| ||||||||||| |||||||||||||||||||| ||||||||| |||||||||||||| |||||||||| ||||TTCCCACCTCTCCTCTTTGCACCATGAAGCTCAAGGACAAATTCCACTC=CCCCAAAATCAAGCGCACCCCGTCCCAGAA

So in general we allow ourselves to insert gaps until we find the optimal alignment

But where should this process stop

The Downside of GapsTake two random sequences with no lsquorealrsquo similarity

GACACTAGGTCGATGCGTGGTGGCGAGA

ACGCATCCGGATGTGCACCGTGGAACTG

And allow lsquocost freersquo gaps

GAC--ACT----AGGTCGATGC---GTGG---TGGCGAGA || | | | | | ||| |||| || ACGCA-TCCGGA--T-G-TGCACCGTGGAACTG

Clearly although the alignment has no mismatches it is obviously not biologically meaningful

To prevent this we assign a cost to adding gaps which is offset against the benefit of finding matches ndash and this is the essence of lsquofinding gapped alignmentsrsquo

We want to find the lsquoalignmentrsquo between the two (or more) sequences which shows the greatest degree of similarity while introducing the fewest gaps hellip

BLAST

gtqueryAGACGAACCTAGCACAAGCGCGTCTGGAAAGACCCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGTTCAGGAGTATTTGGACTGCAATATTGGCCCTCGTTCAAGGGCGCCTACCATCACCCGACGGTCATGCCGGTCCCCAGCAGCTGCTAATAACTTCCTTCGCTACTCAAGTTACCACGCTAGCAAAACCCACGGCATACCGTTTACCCTTTAAAATCAGCTTCAACCAGCAACGAA

There are many programs used to find similarities between sequencesThey range from relatively slow programs which find the exact best matching alignment through ones which take progressively inexact shortcuts to speed things up Of this latter class the best known and easily most widely used is BLAST developed by Stephen Altschul and others and continuously refined over the last 10-15 years

The essential idea is to compare your query sequence against a collection or lsquodatabasersquo of target sequences looking for the one(s) that match the query sequence the best

gttarget1AAAACAGGAATATTTACCGGGACCGGGTAATGATGCATCTCGAGGTACACAATATACCTG GAGAACCGAATTATGAGTTGGCCACCTTACTTAACGAAACCAGCAGAGAAAATCCAACAT GGCAACACCCCTCTGACTACACTAGAAGGAACTACTATGTAAGAAAACAGCCTGTCCCTT GCAGTTTGAATGACTGGGTGATGCGAAATGGGGGTCCTGCCATAGAGCGCTTCCATGGTT TACCTTGCACATTTCAGAGAAGTCCTATGCCAGGAGTCCTTCCTACAGGGCCTTCCTGAA ACTATATATGTGCTTATTCTTGTTTGATTTGGCTTTGCAGgttarget2CTCTTAATTTATTTCTCTTCCTGCAGCTCCCTCGCTTTTTCCTTTCCCTGTTACATTCAT CTGACTTGAAGAGTTGCAAATTTTCAGTGTTTCTGTTTTTGTTGCTGATATGTTGTAAAC TTTTTAATAAAATCTATTTCTATAG gttarget3GCAGTTTGAATGACTGGGTGATGCGAAATGGGGGTCCTGCCATAGAGCGCTTCCATGGTT TACCTTGCACATTTCAGAGAAGTCCTATGCCAGGAGTCCTTCCTACAGGGCCTTCCTGAA ACTATATATGTGCTTATTCTTGTTTGATTTGGCTTTGCAGCTAGGGTTTTCACCTTTTCT GGAAAAAAAAATACTGGCTTCC gttarget4CTGCTATTAATGGGCAAAACAACTCAAATAAAGTCCCTCTGCCACCCTCAGACACTGCCC CTGGCCCCCAGCTGCCCGCTGATCCTTGTAGCCAGAGCAGTAAAGTTTTGAAAGTGGAGC CCAAGGAGAATAAAGTTATTAAAGAAACTGGCTTTGAACAAGGTGAAAAGTCTTGTGCAG CACCTCTAGATCATACTGTGAAGGAAAATCTTGGACAAACTTCTAAAGAACAGGTGGTAG

query

database

COMPARE

LIST MATCHES

Flavours of BLAST

ACGATAGATCCCATCCATAAAT ATGACGATAGATCCCATCAT CGATAGGACCACCACA GATAGACCAGGATACATAGGATAATTA AGCTCGCTTGGCTCGATGGCT

MKJLSPWERSYTRGHYTWER MGHTVNBZY MKLPWRHGDBKJGMNDFD MBKLRPIUHDFRTASGSLKWWRTVBN

MKJLSPWERSYTRGHYTWER MGHTVNBZY MKLPWRHGDBKJGMNDFD MBKLRPIUHDFRTASGSLKWWRTVBN

ATGACGATAGATCCCATCAT CGATAGGACCACCACA GATAGACCAGGATACATAGGATAATTA AGCTCGCTTGGCTCGATGGCT

query sequence other operation database sequences

ATGACGATAGATCCCATCAT CGATAGGACCACCACA GATAGACCAGGATACATAGGATAATTA AGCTCGCTTGGCTCGATGGCT

BLASTn

BLASTp

BLASTx

tBLASTn

tBLASTx

ACGATAGATCCCATCCATAAAT

ACGATAGATCCCATCCATAAAT

MQWCGYRWTYQGYRW

MQWCGYRWTYQGYRW

FAST

FAST

SLOW

SLOWER

HORRIBLY

SLOW

6 fra

me

trans

latio

n

How does it work The main task of any sequence comparison program is to test all possible mutual alignments of two sequence and see how good the match is

CTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGT

CCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGTTC | | | | | ||||||||||||||||||||||||| CTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGTCTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGT || | | | | | | | | |||||||||||||||||||||||| | | | | | |

CCGAGCTTCTCATTGCTCTTCCTAACAGTG=TGATAGGCTAACCGTAATGGCGTTC||||||||||||||||||||||||| ||||||||||||||||||||||||

query

1st database sequence

This would actually be a very slow search process if implemented like thishellip

BLAST achieves its speed through two strategies

- it takes a WORD based approach- it pre-INDEXES database sequences

BLAST WORDS and INDEXING1 GACAAATCCAAACCCCTGAAGTTCTCCACCAGCAAAGCCA

2 TAAGCAAATTTAATTTTGTTTACATTTTC

3 GTTAAGACCTTCCCTGACATTTGCAGCAGTTTCAAATGTA

AAAAAAAA 00001 AAAAAAAC 00002 AAAAAAAG 00003 ACAAATCC 07967 ACAAATCC 07968 ACAAATCC 07979 GACAAATC 33568 GACAAATG 33569 TCCAAACC 64321 TCCAAACC 64322

sequence position word

1 1 33658

1 2 07967

1 3 16210

3 15 33568

3 16 07967

Database of sequences

Numbered list of all possible lsquowordsrsquo

Build a position index of all words in the database

Analyse the Query Sequence gtquery AGACAAATCCAAACCCCTGAAGTTCTCCACCAGCAAAGCCA

AAAAAAAA 00001 AAAAAAAC 00002 AAAAAAAG 00003 ACAAATCC 07967 ACAAATCC 07968 ACAAATCC 07979 GACAAATC 33568 GACAAATG 33569 TCCAAACC 64321 TCCAAACC 64322

QUERY SEQUENCE

Numbered list of all possible lsquowordsrsquo

position word

1 14236

2 33658

3 07967

Analyse QUERY SEQUENCE

sequence position word

1 1 33658

1 2 07967

1 3 16210

3 15 33568

3 16 07967

Index of database

Expand from Word Based Matches We lsquoinstantlyrsquo know which sequences in the database have at least a word length match with our query sequence and at what relative position

Next the potential alignments are expanded adding up a score for (total matches ndash mismatches ndash gap penalties) to make the best possible alignment But this is usually for a tiny proportion of the sequences in the database ndash so overall it is much quicker

The highest scoring alignments are reported

But we can potentially miss alignments with no word-size bits in common consider BLASTn with a default word-size of 11

TCGGAAGTGGAAGCTGAACCTGATTGTAGAGTTGGAGGCCAGTGTTCTGGCTGAGC||||||||| ||||| |||||||||| |||||||||| |||| ||||| ||||||| TCGGAAGTGTAAGCTCAACCTGATTGCAGAGTTGGAGTCCAGAGTTCTAGCTGAGC

Care is sometimes neededhellip

BLAST ndashTypical OutputINPUT

gtpartial cDNA sequence Xenopus tropicalisCGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGTTCCCACCTCTCCTCTTTCACCATGAAGCTCAAGGACAAATTCCACTCCCCCAAAATCAAGCGCACCCCGTCCAAGAAGGGGAAGCCGGCCGACCTCACCGTCAAAACAGAAGAGAAACCCGTCAACAAAACCTTAAGCCGCTTGGAGGAACAGGAGAAAGAAGTCGTTAATGCCTTGCGTTACTTTAAGACAATTGTTGACAAGATGGCGGTGGACAAGATGGTGCTGGTGATGCTGCCAGGGTCGGCGA

OUTPUTQuery= (311 letters) Database NCBI Protein Reference Sequences 954378 sequences 347895532 total letters

gtgi|41055060|ref|NP_9574201| similar to guanine nucleotide-releasing factor 2 (specific for crk proto-oncogene) [Danio rerio]

Length=691

Score = 133 bits (335)Expect = 6e-31 Identities = 7698 (77) Positives = 8298 (83) Gaps = 498 (4) Frame = +2

Query 26 MSGKIE-KADSQRSHLSSFTMKLKDKFHSPKIKRTPSKKGKPA--DLTVKTEEKPVNKTL 196 MSGKIE K +SQ+SHLSSFTMKL KFHSPKIKRTPSKKGK + VKT EKPVNK + Sbjct 1 MSGKIESKHESQKSHLSSFTMKLM-KFHSPKIKRTPSKKGKQLQPEPAVKTPEKPVNKKV 59

Query 197 SRLEEQEKEVVNALRYFKTIVDKMAVDKMVLVMLPGSA 310 SRLEEQEK+VV+ALRYFKTIVDKM VD VL MLPGSA Sbjct 60 SRLEEQEKDVVSALRYFKTIVDKMNVDTKVLQMLPGSA 97

When is a match significant

RFKISDCQHPCTYSHNQYMTNHMRECPYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV

NFSWKKTSEKETNCQFDYPNDYNEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFNMCWLEVNSS

RF---KISDCQHPCTYSH-NQYMTNHMREC----PYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV F K S+ + C + + N Y N +C P+ + +W +P + D I N M ++ NFSWKKTSEKETNCQFDYPNDY--NEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFN------MCWLEVNSS

Here is a lsquotypicalrsquo weak alignment from BLASTp

In fact the sequences were randomly generated so there is no biologically significant alignmenthellip

E-values

The number of matches like the discovered match that I would expect to find by chance

An E-value of 00 implies that I would expect no matches like this to arise by chance thereforehellip

An E-value of 1 implies I would expect 1 match like this to arise by chance so if I have a match with such an E-valuehellip

Also ldquoexpect valueldquo or ldquoexpectationrdquo

E-values From First Principles

Some database statistics (23rd July 2005)

Database NCBI RefSeq mRNA 272619 sequences 503566580 total letters (~50 x 108)

Database NCBI nr 3329110 sequences 14601814750 total letters (~14 x 1010)

Notation

12e-35 = 12 x 10-35

48 x 106 = 4800000

We will consider first searching a nucleotide sequence (lsquoACGTAGACGTrsquo) against a nucleotide database eg the RefSeq mRNA above

Then we will consider the more complex case of amino acid sequence (protein) searches Which is of course what we mostly do

Calculating an E-valueThe RefSeq mRNA database has ~ 50 x 108 letters There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA

Query = lsquoArsquo

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACrsquo

AC AC AC AC

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACGrsquo

ACG

Expected number of matches = (50 x 108) 4 = ~12 x 108

Expected number of matches = (50 x 108) (4x 4) = ~31 x 107

Expected number of matches = (50 x 108) (4 x 4 x 4) = ~81 x 106

Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer

Expected number of matches = (50 x 108) (4 x 4 x 4 x 4 hellip 60 times ) = (50 x 108) 1036 = 50 x 10-28

E-value = 50 x 10-28

E-values In PracticeSo if I take a 60 nt sequencegtsequenceACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA and actually BLAST it against the RefSeq mRNA database I get

BLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds

Length=6060 Score = 119 bits (60) Expect = 2e-26 Identities = 6060 (100) Gaps = 060 (0) Strand=PlusPlus

Query 1 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 60 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 2977 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 3036

What do I get if I BLAST it against the larger nr databaseBLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds

Length=6060 Score = 119 bits (60) Expect = 6e-25 Identities = 6060 (100) Gaps = 060 (0) Strand=PlusPlus

Query 1 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 60 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 2977 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 3036

theoretical value was 50e-28 -

E-value Exercise

Given a transcription factor binding site

ACC[TG]TA

How many would you expect to find by chance in a 10k promoter sequence

How would this differ if there was an optional additional base between the 4th and 5th positionsIeACC[TG]TAOR ACC[TG]TA

E-value Exercise AnswerACC[TG]TA

Expect lsquoArsquo every 4 ntExpect lsquoACCrsquo every 4x4x4 = 64 nt

Expect lsquoT or Grsquo every 2nd ntExpect lsquoACC[TG]rsquo every 64x2 nt = 128 nt

Expect lsquoTArsquo every 4x4 = 16 ntExpect lsquoACC[TG]TArsquo every 128x16 nt = 2048 nt (4x4x4x2x4x4)We would expect ~5 of these promoter sites every 10k by chance

If also ACC[TG]TAA allowed

The two motifs independently have the same E-valueTo allow either means we expect twice as many

E-values Effect of Database SizeThe nr mRNA database has ~ 14 x 1010 letters (was RefSeq and 50 x108)There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA

Query = lsquoArsquo

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACrsquo

AC AC AC AC

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACGrsquo

ACG

Expected number of matches = (14 x 1010) 4 = ~12 x 108

Expected number of matches = (14 x 1010) (4x 4) = ~31 x 107

Expected number of matches = (14 x 1010) (4 x 4 x 4) = ~81 x 106

Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer

Expected number of matches = (14 x 1010) (4 x 4 x 4 x 4 hellip 60 times ) = (14 x 1010) 1036 = 14 x 10-26

E-value = 14 x 10-26

(was E-value = 50 x 10-28)

E-values Effect of Database Size

The E-value is simply dependent on database size

RefSeq

nr

14 x 1010 letters

50 x108 letters

30 x bigger

BLAST the same sequenceagainst each E-value = 14e-26

E-value = 50e-28

The database was ~30 times bigger and so the E-value was ~30 times bigger

Why were the values differentOur calculated E-value for searching against the RefSeq mRNA database was 50 x 10-28But our actual BLAST search at NCBI gave a value of

20 x 10-26 - about 40x larger - why is this

Gapped alignments

If we were expecting N matches for a query sequence lsquoACGTACGTACGTrsquo imagine what would happen to N if we allowed gaps in our matches

ACGTACGTACGT

This would now give us additional possible alignments that would meet our lsquomatchrsquo criteria

ACGTACGTACGT ACGTACAGTACGT ACGTACCGTACGT etc|||||||||||| |||||| |||||| |||||| ||||||ACGTACGTACGT ACGTAC-GTACGT ACGTAC-GTACGT

We will expect many more matches in a given database if we allow our alignments to have gaps The E-value will be larger

E-values Effect of Query Length

Biologically itrsquos the same match Does it mean we are any less sure that this match didnrsquot occur by chance The E-value is simply dependent on match length

database

BLAST 500 nt sequence against a database

BLASTn Get a full length match with sequence XYZ at an E-value = 50e-160

gtsequence ACTAGTCTAGCTAGACATCGATCGATGATGCTACACAGATAGACGATAGATAGTAAGTCGATCGATCGCGCATCGATCGTCTAGATCGATCGCTCGCTGTGTAGATAGATCGGCGATAGA

database

BLAST half of the same sequence against the same database

BLASTn

gtsequence ACTAGTCTAGCTAGACATCGATCGATGATGCTACACAGATAGACGATAGATAGTAAGTCG

Get a match with sequence XYZ again but at an E-value = 50e-80

Why not just use identityAt some levels this a good question

But consider two very different searches both of which give a 75 identity match

Query1 was 60 nt longCGGAGCTCAGGGCTTAACGACTGATATCTCCGCGCATGTCGAGAAACGATACAGCCAGCG||||||||||| || | || | || || |||| | | | |||||| | ||||||||||CGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGWhich would have an E-value ~ 50 x 10-19

And Query2 only 16 nt longACGTACGTACGTACGT||| || | |||| ||ACGCACCTTCGTAGGTWhich would have an E-value ~ 30

And intuitively we feel we would expect to see that sort of number of matches in the database just by chancehellip

So whatrsquos the real problemBasically you are usually trying to answer the question

Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism

Are there any useful guidelines though at least for biological meaningfulness

Basically you are usually trying to answer the question

Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism

BLAST

The difficulty is because

ORTHOLOGY

BLAST Similarity + Probability

biological knowledge

nature of query sequence

phylogenetic relationship

match length PI size of databasehellip

Rules of ThumbHow good does an E-value have to be before we might even think we have an ortholog

largerworse smallerbetter

E-values 10-5 10-10 10-40 10-100 00

fantasy borderline encouraging

pretty good canrsquot get

better

But note that in some gene families with closely related members you can get an E-value of 00 for several different matches and then identity may be more sensitive Also bear in mind in cases like this that ideas of lsquofunctionalrsquo orthology may break down with more than one locus producing identical proteins which share the same functionhellip

Protein BLASTItrsquos (nearly) always better to make comparisons at the amino acid level between protein sequences than the DNA level because the amino acid sequence is more conserved than the underlying DNA sequenceDoes this cause us to treat expected values any differently

If we follow the argument as before then for an exact match of a 20 amino acid sequence in the RefSeq protein database each additional amino acid will reduce the E-value by 120th (there are 20 different amino acids) And as there are 347895532 letters in that databaseE-value = ~35 x 108 (20 x 20 x 20 hellip20 times) = ~35 x 10-18

But this is what we get if we run the blast at NCBI

Score = 431 bits (100) Expect = 8e-04 Identities = 2020 (100) Positives = 2020 (100) Gaps = 020 (0) Frame = +3

Query 3 SSSSFRAYRAALSEVEPPCI 62 SSSSFRAYRAALSEVEPPCISbjct 972 SSSSFRAYRAALSEVEPPCI 991

Really too big a discrepancy to easily explain with hand wavinghellip

Amino Acid Substitutions

A SC F LWYG I LMVL IMFVM ILVP V ILMW FY

N DHSQ REHKS ANTT SY HFW

H NQYK RQER QK

D NEE DQK

In fact we need to take into account both amino acid substitutability as well as as before allowing gapped alignments On average any residue can be substituted for by about 2 others so each position has about 17th chance of lsquomatchingrsquo rather than 120th

So now we getE-value = ~35 x 108 (7 x 7 x 7 hellip20 times) = ~44 x 10-9which is much closer to the actual BLAST value

These substitutabilities are dealt with by the BLOSUM and PAM matrices

ExercisesGo to the file random-DNA-sequenceshtml select one of the 20 randomly generated nucleotide sequences and do a BLASTx (translated DNA-gtprotein) at NCBI against the nr protein database

Did you find any lsquosignificantrsquo hits

Repeat with a second sequence

What conclusions might you draw from this exercise

Try the same sequence(s) against the nr nucleotide database

Is there any general difference

Part 4 Tweaking BLASTAlthough you normally see BLAST as a web page with boxes to place data in and tick boxes etc it is actually a command line program that can be run just by typing the appropriate command and options eg

promptgt blastall ndashp ltblast typegt ndashi ltinput sequencegt ndashd ltdatabasegt

This is the simplest form where the basic program lsquoblastallrsquo takes a number of different options or parameters indicated by the ndashx and followed by its value -p ltwhich blast flavour to rungt-i ltfile with query sequence ingt-d ltpre-indexed database namegt

Not All Parameters are hellipThere are many other parameters and if not listed explicitly they will use a default value most appropriate to the blast flavour requested Eg for ndashW ltword sizegt blastn uses ndashW 11 where blastx uses ndashW 3

There are also some options that appear on the web pages that are not really parameters but manage the job in a similar way One of the most useful of these is on the NCBI blast pages where you can use Entrez queries or pick from an organism list to modify your search

The Many Parameters of BLASTThere are almost literally hundreds of parameters but most are way too obscure even for die-hard techies like me Very few of them are regularly useful in any but their default value but just occasionally they are very necessary

Here are some of the ones that I have used

-e max expected value -m output format (graphical or tabularspreadsheet)-F filter query sequence for low complexity (default TRUE)-U use only upper case regions of query (default FALSE)-G gap opening cost-E gap extension cost-q nucleotide mismatch penalty (BLASTx uses matrices)-r nucleotide match reward-b number of matching sequences to report-g allow gaps (default TRUE)-W word size-z effective database size (removes effect of actual database size)-S query strands to search (default both directions)-l restrict database sequences to given list of lsquogilsquo numbers

BLAST Parameters Exercises1 BLASTn vs BLASTx

Open the file example-sequenceshtml copy the sequence gtblastn-vs-blastxThis is a Xenopus tropicalis cDNA sequence

Go to the NCBI BLAST Home PageNucleotide-nucleotide BLAST (blastn) section Paste your sequence into the box

Run BLASTn against the nr nucleotide database using all default optionsThen hit [format] to wait for the results in a new page(hint if you paste the sequence definition line lsquogtnamersquo into the box as well your results will be labelled accordingly which can be useful)

Now repeat but go to the TRANSLATED BLAST section and BLAST against the nr protein database using BLASTx

How might the different results help us view the presence of this gene in other vertebrates

Results for Exercise 1 BLASTn

BLASTx

BLAST Parameters Exercises2 Low complexity filtering

Open the file example-sequenceshtml copy the sequence gtlow-complexity-filtering-A This sequence contains a long AT tandem repeat

Go to the NCBI BLAST Home PageTRANSLATED BLAST sectionBLASTx Paste your sequence into the box

Carefully UNTICK the ldquoChoose filter [ ] Low complexityrdquo BOX in the second section And then run BLASTx against the nr database

What do you feel about these alignmentsRe-run but leave the low-complexity filter ON this timeDoes this change our view of the protein matches

Now continue with gtlow-complexity-filtering-B and ndashCC is an especially interesting case ndash what can we deduce about the cDNA sequence Annotators beware

Results for Exercise 2A (OFF) BLASTn ndash low complexity filtering OFF

Results for Exercise 2A (ON) BLASTn ndash low complexity filtering ON

Results for Exercise 2B

ON OFF

Results for Exercise 2C

There is a sequence error an extra G at position 117 in the sequence cDNA (117)AGAAAAGAAGAAACATGGCAATGGATCAGAA|||||||||||||||| ||||||||||||||AGAAAAGAAGAAACAT-GCAATGGATCAGAA

Genomic sequence

ON

OFF

BLAST Parameters Exercises3 Limit by Entrez query

Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching sequences in fruit fly enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT

Open the file example-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home Page TRANSLATED BLAST sectionBLASTx and paste the sequence

Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip

BLAST Parameters Exercises4 BLASTn vs tBLASTx and nucleotide mismatch penalties

Open the file example-sequenceshtml

Also open the NCBI BLAST Home PageSPECIAL ndash Align two sequences section

There are several Xenopus tropicalis cyclins in the examples fileCopy the sequence gtcyclin-A1-Xt to the Sequence 1 BLAST windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 BLAST window(i) Run the default comparison should be BLASTn Note the alignmentNow run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these

(ii) Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1 Then try reducing the gap open and gap extension penaltieshellipWhat do we learn from this

(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip

Results for Exercise 4 (i)

BLASTn tBLASTx

Results for Exercise 4 (ii)

Mismatch penalty = -2 (default) Mismatch penalty = -1

BLAST Parameters Exercises5 E-Value maximum for reporting

Open the file example-sequenceshtml

Copy the sequence gtsumo-binding-motif and go to the NCBI BLAST Home PageGo to the PROTEIN BLAST section BLASTp and paste the sequence

Run the search with the default values

Now re-run the search setting the maximum E-value in the box

Expect 100

What difference does this make

BLAST Parameters Exercises6 Word Size

Open the file example-sequenceshtml

Copy the sequence gtmorpholino and go to the NCBI BLAST Home PageGo to the NUCLEOTIDE BLAST section BLASTn and paste the sequence

Check OFF the low complexity filter and then run the search

Now re-run the search setting the following parameters

Low complexity OFFExpect 100Word Size 7Other advanced -q-1 (mismatch penalty -1 instead of default -3)

What difference does this make

END

  • Bioinformatics Workshop 1 Sequences and Similarity Searches
  • The Basic Questions
  • Part 1 Structural Genomics
  • Chromosomes and Genes
  • Gene to Protein
  • Sequence Signals
  • Genomic Signals
  • Derivative Sequences
  • Gene Models
  • Sequences and Genes (Accession Numbers and Names)
  • Gene Symbols Names Etc
  • A Gene-Centric View
  • Sequences and Accession Numbers
  • mRNA Splicing Signals
  • Gene Predictions
  • Supporting Evidence
  • TheoreticalPredicted Sequences
  • Sequences for a model organism
  • So Whatrsquos in the Databases Now
  • Part 2 Comparative Genomics
  • Speciation
  • Residual Similarity
  • Computers Can Detect Homology
  • Orthologs
  • Paralogs
  • lsquoOtherrsquo-logs
  • The Essential Paradigm
  • Function Conserved Longer than Detectable Similarity
  • Redundancy in the Genetic Code
  • Protein Similarity Persists Longer
  • Always Compare Protein Sequences
  • Exercise 1 nucleotide vs amino acid search
  • Answers Exercise 1
  • The Essential Task
  • Functional Orthologs
  • Finding Orthologs
  • Using Synteny is Better
  • Metazome
  • Metazome Exercise
  • Part 3 Finding Sequence Similarities
  • Gaps in Alignments
  • The Downside of Gaps
  • BLAST
  • Flavours of BLAST
  • How does it work
  • BLAST WORDS and INDEXING
  • Analyse the Query Sequence
  • Expand from Word Based Matches
  • BLAST ndashTypical Output
  • When is a match significant
  • E-values
  • E-values From First Principles
  • Calculating an E-value
  • E-values In Practice
  • E-value Exercise
  • E-value Exercise Answer
  • E-values Effect of Database Size
  • Slide 58
  • Why were the values different
  • E-values Effect of Query Length
  • Why not just use identity
  • So whatrsquos the real problem
  • Rules of Thumb
  • Protein BLAST
  • Amino Acid Substitutions
  • Exercises
  • Part 4 Tweaking BLAST
  • Not All Parameters are hellip
  • The Many Parameters of BLAST
  • Slide 70
  • BLAST Parameters Exercises
  • Results for Exercise 1
  • Slide 73
  • Results for Exercise 2A (OFF)
  • Results for Exercise 2A (ON)
  • Results for Exercise 2B
  • Results for Exercise 2C
  • Slide 78
  • Slide 79
  • Results for Exercise 4 (i)
  • Results for Exercise 4 (ii)
  • Slide 82
  • Slide 83
  • END
Page 37: Bioinformatics Workshop 1 Sequences and Similarity Searches

Using Synteny is Better

We know that large regions of (say) vertebrate genomes have preserved their overall organisation from one organism to another

And we find the same genes (ie orthologs) in more or less the same order in the syntenic sectionsThese of course represent chromosomal re-arrangements since these organisms diverged

Human chromosome 5

Mouse chromosome 10

Mouse chromosome 2

MetazomeFortunately someone has done all the hard work for ushellip Dan Rokhsar httpwwwmetazomenet

Metazome Exercise

Go back to Entrez Gene and look for your favourite gene again

Pick probable ortholog vertebrate genes from common organisms (human mouse rat chicken frog fish) and paste their protein sequences into a temporary space

Go to Metazome (httpwwwmetazomenet) find the blast window open two versions of it and blast your sequences against the Tetrapod or Jawed vertebrate node

See if you get the same cluster ID as best top hit and have a look at the Metazome alignment(s)hellip

Part 3 Finding Sequence Similarities

We want computer programs which will compare sequences at all possible different alignments looking for a degree of similarity greater than we would expect to find by chance

But first we have to consider the implication of gapshellip

Insertions and deletions are other possible forms of mutations and they can really mess up our simple alignments

ATGCATGCTGCCAACGGATGTCCTG

ATGAAAGCCGCCTACGAAAGTCCTG||| | || ||| ||| | ||||||

ATGCATGCTGGCCAACGGATGTCCTG

ATGAAAGCCGCCTACGAAAGTCCTG||| | || ||| | | | |

Gaps in Alignments

Consider these two obviously similar sequences

TTCCCAACTCTCCTCTTTCACCATGAAGCTCAAGGACAGATTCCACTCGCCCCAAAATCAAGCTCACCCCGTCCAAGAA | || | || |||||||||||||||||||| ||||||||| ||| ||| | ||| | | |TTCCCACCTCTCCTCTTTGCACCATGAAGCTCAAGGACAAATTCCACTCCCCCAAAATCAAGCGCACCCCGTCCCAGAA

In fact we realise that the most probable alignment (regarding biological origin) is with a small gap in each sequence

TTCCCAACTCTCCTCTTT=CACCATGAAGCTCAAGGACAGATTCCACTCGCCCCAAAATCAAGCTCACCCCGTCCAAGAA |||||| ||||||||||| |||||||||||||||||||| ||||||||| |||||||||||||| |||||||||| ||||TTCCCACCTCTCCTCTTTGCACCATGAAGCTCAAGGACAAATTCCACTC=CCCCAAAATCAAGCGCACCCCGTCCCAGAA

So in general we allow ourselves to insert gaps until we find the optimal alignment

But where should this process stop

The Downside of GapsTake two random sequences with no lsquorealrsquo similarity

GACACTAGGTCGATGCGTGGTGGCGAGA

ACGCATCCGGATGTGCACCGTGGAACTG

And allow lsquocost freersquo gaps

GAC--ACT----AGGTCGATGC---GTGG---TGGCGAGA || | | | | | ||| |||| || ACGCA-TCCGGA--T-G-TGCACCGTGGAACTG

Clearly although the alignment has no mismatches it is obviously not biologically meaningful

To prevent this we assign a cost to adding gaps which is offset against the benefit of finding matches ndash and this is the essence of lsquofinding gapped alignmentsrsquo

We want to find the lsquoalignmentrsquo between the two (or more) sequences which shows the greatest degree of similarity while introducing the fewest gaps hellip

BLAST

gtqueryAGACGAACCTAGCACAAGCGCGTCTGGAAAGACCCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGTTCAGGAGTATTTGGACTGCAATATTGGCCCTCGTTCAAGGGCGCCTACCATCACCCGACGGTCATGCCGGTCCCCAGCAGCTGCTAATAACTTCCTTCGCTACTCAAGTTACCACGCTAGCAAAACCCACGGCATACCGTTTACCCTTTAAAATCAGCTTCAACCAGCAACGAA

There are many programs used to find similarities between sequencesThey range from relatively slow programs which find the exact best matching alignment through ones which take progressively inexact shortcuts to speed things up Of this latter class the best known and easily most widely used is BLAST developed by Stephen Altschul and others and continuously refined over the last 10-15 years

The essential idea is to compare your query sequence against a collection or lsquodatabasersquo of target sequences looking for the one(s) that match the query sequence the best

gttarget1AAAACAGGAATATTTACCGGGACCGGGTAATGATGCATCTCGAGGTACACAATATACCTG GAGAACCGAATTATGAGTTGGCCACCTTACTTAACGAAACCAGCAGAGAAAATCCAACAT GGCAACACCCCTCTGACTACACTAGAAGGAACTACTATGTAAGAAAACAGCCTGTCCCTT GCAGTTTGAATGACTGGGTGATGCGAAATGGGGGTCCTGCCATAGAGCGCTTCCATGGTT TACCTTGCACATTTCAGAGAAGTCCTATGCCAGGAGTCCTTCCTACAGGGCCTTCCTGAA ACTATATATGTGCTTATTCTTGTTTGATTTGGCTTTGCAGgttarget2CTCTTAATTTATTTCTCTTCCTGCAGCTCCCTCGCTTTTTCCTTTCCCTGTTACATTCAT CTGACTTGAAGAGTTGCAAATTTTCAGTGTTTCTGTTTTTGTTGCTGATATGTTGTAAAC TTTTTAATAAAATCTATTTCTATAG gttarget3GCAGTTTGAATGACTGGGTGATGCGAAATGGGGGTCCTGCCATAGAGCGCTTCCATGGTT TACCTTGCACATTTCAGAGAAGTCCTATGCCAGGAGTCCTTCCTACAGGGCCTTCCTGAA ACTATATATGTGCTTATTCTTGTTTGATTTGGCTTTGCAGCTAGGGTTTTCACCTTTTCT GGAAAAAAAAATACTGGCTTCC gttarget4CTGCTATTAATGGGCAAAACAACTCAAATAAAGTCCCTCTGCCACCCTCAGACACTGCCC CTGGCCCCCAGCTGCCCGCTGATCCTTGTAGCCAGAGCAGTAAAGTTTTGAAAGTGGAGC CCAAGGAGAATAAAGTTATTAAAGAAACTGGCTTTGAACAAGGTGAAAAGTCTTGTGCAG CACCTCTAGATCATACTGTGAAGGAAAATCTTGGACAAACTTCTAAAGAACAGGTGGTAG

query

database

COMPARE

LIST MATCHES

Flavours of BLAST

ACGATAGATCCCATCCATAAAT ATGACGATAGATCCCATCAT CGATAGGACCACCACA GATAGACCAGGATACATAGGATAATTA AGCTCGCTTGGCTCGATGGCT

MKJLSPWERSYTRGHYTWER MGHTVNBZY MKLPWRHGDBKJGMNDFD MBKLRPIUHDFRTASGSLKWWRTVBN

MKJLSPWERSYTRGHYTWER MGHTVNBZY MKLPWRHGDBKJGMNDFD MBKLRPIUHDFRTASGSLKWWRTVBN

ATGACGATAGATCCCATCAT CGATAGGACCACCACA GATAGACCAGGATACATAGGATAATTA AGCTCGCTTGGCTCGATGGCT

query sequence other operation database sequences

ATGACGATAGATCCCATCAT CGATAGGACCACCACA GATAGACCAGGATACATAGGATAATTA AGCTCGCTTGGCTCGATGGCT

BLASTn

BLASTp

BLASTx

tBLASTn

tBLASTx

ACGATAGATCCCATCCATAAAT

ACGATAGATCCCATCCATAAAT

MQWCGYRWTYQGYRW

MQWCGYRWTYQGYRW

FAST

FAST

SLOW

SLOWER

HORRIBLY

SLOW

6 fra

me

trans

latio

n

How does it work The main task of any sequence comparison program is to test all possible mutual alignments of two sequence and see how good the match is

CTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGT

CCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGTTC | | | | | ||||||||||||||||||||||||| CTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGTCTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGT || | | | | | | | | |||||||||||||||||||||||| | | | | | |

CCGAGCTTCTCATTGCTCTTCCTAACAGTG=TGATAGGCTAACCGTAATGGCGTTC||||||||||||||||||||||||| ||||||||||||||||||||||||

query

1st database sequence

This would actually be a very slow search process if implemented like thishellip

BLAST achieves its speed through two strategies

- it takes a WORD based approach- it pre-INDEXES database sequences

BLAST WORDS and INDEXING1 GACAAATCCAAACCCCTGAAGTTCTCCACCAGCAAAGCCA

2 TAAGCAAATTTAATTTTGTTTACATTTTC

3 GTTAAGACCTTCCCTGACATTTGCAGCAGTTTCAAATGTA

AAAAAAAA 00001 AAAAAAAC 00002 AAAAAAAG 00003 ACAAATCC 07967 ACAAATCC 07968 ACAAATCC 07979 GACAAATC 33568 GACAAATG 33569 TCCAAACC 64321 TCCAAACC 64322

sequence position word

1 1 33658

1 2 07967

1 3 16210

3 15 33568

3 16 07967

Database of sequences

Numbered list of all possible lsquowordsrsquo

Build a position index of all words in the database

Analyse the Query Sequence gtquery AGACAAATCCAAACCCCTGAAGTTCTCCACCAGCAAAGCCA

AAAAAAAA 00001 AAAAAAAC 00002 AAAAAAAG 00003 ACAAATCC 07967 ACAAATCC 07968 ACAAATCC 07979 GACAAATC 33568 GACAAATG 33569 TCCAAACC 64321 TCCAAACC 64322

QUERY SEQUENCE

Numbered list of all possible lsquowordsrsquo

position word

1 14236

2 33658

3 07967

Analyse QUERY SEQUENCE

sequence position word

1 1 33658

1 2 07967

1 3 16210

3 15 33568

3 16 07967

Index of database

Expand from Word Based Matches We lsquoinstantlyrsquo know which sequences in the database have at least a word length match with our query sequence and at what relative position

Next the potential alignments are expanded adding up a score for (total matches ndash mismatches ndash gap penalties) to make the best possible alignment But this is usually for a tiny proportion of the sequences in the database ndash so overall it is much quicker

The highest scoring alignments are reported

But we can potentially miss alignments with no word-size bits in common consider BLASTn with a default word-size of 11

TCGGAAGTGGAAGCTGAACCTGATTGTAGAGTTGGAGGCCAGTGTTCTGGCTGAGC||||||||| ||||| |||||||||| |||||||||| |||| ||||| ||||||| TCGGAAGTGTAAGCTCAACCTGATTGCAGAGTTGGAGTCCAGAGTTCTAGCTGAGC

Care is sometimes neededhellip

BLAST ndashTypical OutputINPUT

gtpartial cDNA sequence Xenopus tropicalisCGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGTTCCCACCTCTCCTCTTTCACCATGAAGCTCAAGGACAAATTCCACTCCCCCAAAATCAAGCGCACCCCGTCCAAGAAGGGGAAGCCGGCCGACCTCACCGTCAAAACAGAAGAGAAACCCGTCAACAAAACCTTAAGCCGCTTGGAGGAACAGGAGAAAGAAGTCGTTAATGCCTTGCGTTACTTTAAGACAATTGTTGACAAGATGGCGGTGGACAAGATGGTGCTGGTGATGCTGCCAGGGTCGGCGA

OUTPUTQuery= (311 letters) Database NCBI Protein Reference Sequences 954378 sequences 347895532 total letters

gtgi|41055060|ref|NP_9574201| similar to guanine nucleotide-releasing factor 2 (specific for crk proto-oncogene) [Danio rerio]

Length=691

Score = 133 bits (335)Expect = 6e-31 Identities = 7698 (77) Positives = 8298 (83) Gaps = 498 (4) Frame = +2

Query 26 MSGKIE-KADSQRSHLSSFTMKLKDKFHSPKIKRTPSKKGKPA--DLTVKTEEKPVNKTL 196 MSGKIE K +SQ+SHLSSFTMKL KFHSPKIKRTPSKKGK + VKT EKPVNK + Sbjct 1 MSGKIESKHESQKSHLSSFTMKLM-KFHSPKIKRTPSKKGKQLQPEPAVKTPEKPVNKKV 59

Query 197 SRLEEQEKEVVNALRYFKTIVDKMAVDKMVLVMLPGSA 310 SRLEEQEK+VV+ALRYFKTIVDKM VD VL MLPGSA Sbjct 60 SRLEEQEKDVVSALRYFKTIVDKMNVDTKVLQMLPGSA 97

When is a match significant

RFKISDCQHPCTYSHNQYMTNHMRECPYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV

NFSWKKTSEKETNCQFDYPNDYNEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFNMCWLEVNSS

RF---KISDCQHPCTYSH-NQYMTNHMREC----PYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV F K S+ + C + + N Y N +C P+ + +W +P + D I N M ++ NFSWKKTSEKETNCQFDYPNDY--NEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFN------MCWLEVNSS

Here is a lsquotypicalrsquo weak alignment from BLASTp

In fact the sequences were randomly generated so there is no biologically significant alignmenthellip

E-values

The number of matches like the discovered match that I would expect to find by chance

An E-value of 00 implies that I would expect no matches like this to arise by chance thereforehellip

An E-value of 1 implies I would expect 1 match like this to arise by chance so if I have a match with such an E-valuehellip

Also ldquoexpect valueldquo or ldquoexpectationrdquo

E-values From First Principles

Some database statistics (23rd July 2005)

Database NCBI RefSeq mRNA 272619 sequences 503566580 total letters (~50 x 108)

Database NCBI nr 3329110 sequences 14601814750 total letters (~14 x 1010)

Notation

12e-35 = 12 x 10-35

48 x 106 = 4800000

We will consider first searching a nucleotide sequence (lsquoACGTAGACGTrsquo) against a nucleotide database eg the RefSeq mRNA above

Then we will consider the more complex case of amino acid sequence (protein) searches Which is of course what we mostly do

Calculating an E-valueThe RefSeq mRNA database has ~ 50 x 108 letters There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA

Query = lsquoArsquo

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACrsquo

AC AC AC AC

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACGrsquo

ACG

Expected number of matches = (50 x 108) 4 = ~12 x 108

Expected number of matches = (50 x 108) (4x 4) = ~31 x 107

Expected number of matches = (50 x 108) (4 x 4 x 4) = ~81 x 106

Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer

Expected number of matches = (50 x 108) (4 x 4 x 4 x 4 hellip 60 times ) = (50 x 108) 1036 = 50 x 10-28

E-value = 50 x 10-28

E-values In PracticeSo if I take a 60 nt sequencegtsequenceACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA and actually BLAST it against the RefSeq mRNA database I get

BLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds

Length=6060 Score = 119 bits (60) Expect = 2e-26 Identities = 6060 (100) Gaps = 060 (0) Strand=PlusPlus

Query 1 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 60 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 2977 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 3036

What do I get if I BLAST it against the larger nr databaseBLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds

Length=6060 Score = 119 bits (60) Expect = 6e-25 Identities = 6060 (100) Gaps = 060 (0) Strand=PlusPlus

Query 1 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 60 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 2977 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 3036

theoretical value was 50e-28 -

E-value Exercise

Given a transcription factor binding site

ACC[TG]TA

How many would you expect to find by chance in a 10k promoter sequence

How would this differ if there was an optional additional base between the 4th and 5th positionsIeACC[TG]TAOR ACC[TG]TA

E-value Exercise AnswerACC[TG]TA

Expect lsquoArsquo every 4 ntExpect lsquoACCrsquo every 4x4x4 = 64 nt

Expect lsquoT or Grsquo every 2nd ntExpect lsquoACC[TG]rsquo every 64x2 nt = 128 nt

Expect lsquoTArsquo every 4x4 = 16 ntExpect lsquoACC[TG]TArsquo every 128x16 nt = 2048 nt (4x4x4x2x4x4)We would expect ~5 of these promoter sites every 10k by chance

If also ACC[TG]TAA allowed

The two motifs independently have the same E-valueTo allow either means we expect twice as many

E-values Effect of Database SizeThe nr mRNA database has ~ 14 x 1010 letters (was RefSeq and 50 x108)There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA

Query = lsquoArsquo

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACrsquo

AC AC AC AC

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACGrsquo

ACG

Expected number of matches = (14 x 1010) 4 = ~12 x 108

Expected number of matches = (14 x 1010) (4x 4) = ~31 x 107

Expected number of matches = (14 x 1010) (4 x 4 x 4) = ~81 x 106

Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer

Expected number of matches = (14 x 1010) (4 x 4 x 4 x 4 hellip 60 times ) = (14 x 1010) 1036 = 14 x 10-26

E-value = 14 x 10-26

(was E-value = 50 x 10-28)

E-values Effect of Database Size

The E-value is simply dependent on database size

RefSeq

nr

14 x 1010 letters

50 x108 letters

30 x bigger

BLAST the same sequenceagainst each E-value = 14e-26

E-value = 50e-28

The database was ~30 times bigger and so the E-value was ~30 times bigger

Why were the values differentOur calculated E-value for searching against the RefSeq mRNA database was 50 x 10-28But our actual BLAST search at NCBI gave a value of

20 x 10-26 - about 40x larger - why is this

Gapped alignments

If we were expecting N matches for a query sequence lsquoACGTACGTACGTrsquo imagine what would happen to N if we allowed gaps in our matches

ACGTACGTACGT

This would now give us additional possible alignments that would meet our lsquomatchrsquo criteria

ACGTACGTACGT ACGTACAGTACGT ACGTACCGTACGT etc|||||||||||| |||||| |||||| |||||| ||||||ACGTACGTACGT ACGTAC-GTACGT ACGTAC-GTACGT

We will expect many more matches in a given database if we allow our alignments to have gaps The E-value will be larger

E-values Effect of Query Length

Biologically itrsquos the same match Does it mean we are any less sure that this match didnrsquot occur by chance The E-value is simply dependent on match length

database

BLAST 500 nt sequence against a database

BLASTn Get a full length match with sequence XYZ at an E-value = 50e-160

gtsequence ACTAGTCTAGCTAGACATCGATCGATGATGCTACACAGATAGACGATAGATAGTAAGTCGATCGATCGCGCATCGATCGTCTAGATCGATCGCTCGCTGTGTAGATAGATCGGCGATAGA

database

BLAST half of the same sequence against the same database

BLASTn

gtsequence ACTAGTCTAGCTAGACATCGATCGATGATGCTACACAGATAGACGATAGATAGTAAGTCG

Get a match with sequence XYZ again but at an E-value = 50e-80

Why not just use identityAt some levels this a good question

But consider two very different searches both of which give a 75 identity match

Query1 was 60 nt longCGGAGCTCAGGGCTTAACGACTGATATCTCCGCGCATGTCGAGAAACGATACAGCCAGCG||||||||||| || | || | || || |||| | | | |||||| | ||||||||||CGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGWhich would have an E-value ~ 50 x 10-19

And Query2 only 16 nt longACGTACGTACGTACGT||| || | |||| ||ACGCACCTTCGTAGGTWhich would have an E-value ~ 30

And intuitively we feel we would expect to see that sort of number of matches in the database just by chancehellip

So whatrsquos the real problemBasically you are usually trying to answer the question

Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism

Are there any useful guidelines though at least for biological meaningfulness

Basically you are usually trying to answer the question

Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism

BLAST

The difficulty is because

ORTHOLOGY

BLAST Similarity + Probability

biological knowledge

nature of query sequence

phylogenetic relationship

match length PI size of databasehellip

Rules of ThumbHow good does an E-value have to be before we might even think we have an ortholog

largerworse smallerbetter

E-values 10-5 10-10 10-40 10-100 00

fantasy borderline encouraging

pretty good canrsquot get

better

But note that in some gene families with closely related members you can get an E-value of 00 for several different matches and then identity may be more sensitive Also bear in mind in cases like this that ideas of lsquofunctionalrsquo orthology may break down with more than one locus producing identical proteins which share the same functionhellip

Protein BLASTItrsquos (nearly) always better to make comparisons at the amino acid level between protein sequences than the DNA level because the amino acid sequence is more conserved than the underlying DNA sequenceDoes this cause us to treat expected values any differently

If we follow the argument as before then for an exact match of a 20 amino acid sequence in the RefSeq protein database each additional amino acid will reduce the E-value by 120th (there are 20 different amino acids) And as there are 347895532 letters in that databaseE-value = ~35 x 108 (20 x 20 x 20 hellip20 times) = ~35 x 10-18

But this is what we get if we run the blast at NCBI

Score = 431 bits (100) Expect = 8e-04 Identities = 2020 (100) Positives = 2020 (100) Gaps = 020 (0) Frame = +3

Query 3 SSSSFRAYRAALSEVEPPCI 62 SSSSFRAYRAALSEVEPPCISbjct 972 SSSSFRAYRAALSEVEPPCI 991

Really too big a discrepancy to easily explain with hand wavinghellip

Amino Acid Substitutions

A SC F LWYG I LMVL IMFVM ILVP V ILMW FY

N DHSQ REHKS ANTT SY HFW

H NQYK RQER QK

D NEE DQK

In fact we need to take into account both amino acid substitutability as well as as before allowing gapped alignments On average any residue can be substituted for by about 2 others so each position has about 17th chance of lsquomatchingrsquo rather than 120th

So now we getE-value = ~35 x 108 (7 x 7 x 7 hellip20 times) = ~44 x 10-9which is much closer to the actual BLAST value

These substitutabilities are dealt with by the BLOSUM and PAM matrices

ExercisesGo to the file random-DNA-sequenceshtml select one of the 20 randomly generated nucleotide sequences and do a BLASTx (translated DNA-gtprotein) at NCBI against the nr protein database

Did you find any lsquosignificantrsquo hits

Repeat with a second sequence

What conclusions might you draw from this exercise

Try the same sequence(s) against the nr nucleotide database

Is there any general difference

Part 4 Tweaking BLASTAlthough you normally see BLAST as a web page with boxes to place data in and tick boxes etc it is actually a command line program that can be run just by typing the appropriate command and options eg

promptgt blastall ndashp ltblast typegt ndashi ltinput sequencegt ndashd ltdatabasegt

This is the simplest form where the basic program lsquoblastallrsquo takes a number of different options or parameters indicated by the ndashx and followed by its value -p ltwhich blast flavour to rungt-i ltfile with query sequence ingt-d ltpre-indexed database namegt

Not All Parameters are hellipThere are many other parameters and if not listed explicitly they will use a default value most appropriate to the blast flavour requested Eg for ndashW ltword sizegt blastn uses ndashW 11 where blastx uses ndashW 3

There are also some options that appear on the web pages that are not really parameters but manage the job in a similar way One of the most useful of these is on the NCBI blast pages where you can use Entrez queries or pick from an organism list to modify your search

The Many Parameters of BLASTThere are almost literally hundreds of parameters but most are way too obscure even for die-hard techies like me Very few of them are regularly useful in any but their default value but just occasionally they are very necessary

Here are some of the ones that I have used

-e max expected value -m output format (graphical or tabularspreadsheet)-F filter query sequence for low complexity (default TRUE)-U use only upper case regions of query (default FALSE)-G gap opening cost-E gap extension cost-q nucleotide mismatch penalty (BLASTx uses matrices)-r nucleotide match reward-b number of matching sequences to report-g allow gaps (default TRUE)-W word size-z effective database size (removes effect of actual database size)-S query strands to search (default both directions)-l restrict database sequences to given list of lsquogilsquo numbers

BLAST Parameters Exercises1 BLASTn vs BLASTx

Open the file example-sequenceshtml copy the sequence gtblastn-vs-blastxThis is a Xenopus tropicalis cDNA sequence

Go to the NCBI BLAST Home PageNucleotide-nucleotide BLAST (blastn) section Paste your sequence into the box

Run BLASTn against the nr nucleotide database using all default optionsThen hit [format] to wait for the results in a new page(hint if you paste the sequence definition line lsquogtnamersquo into the box as well your results will be labelled accordingly which can be useful)

Now repeat but go to the TRANSLATED BLAST section and BLAST against the nr protein database using BLASTx

How might the different results help us view the presence of this gene in other vertebrates

Results for Exercise 1 BLASTn

BLASTx

BLAST Parameters Exercises2 Low complexity filtering

Open the file example-sequenceshtml copy the sequence gtlow-complexity-filtering-A This sequence contains a long AT tandem repeat

Go to the NCBI BLAST Home PageTRANSLATED BLAST sectionBLASTx Paste your sequence into the box

Carefully UNTICK the ldquoChoose filter [ ] Low complexityrdquo BOX in the second section And then run BLASTx against the nr database

What do you feel about these alignmentsRe-run but leave the low-complexity filter ON this timeDoes this change our view of the protein matches

Now continue with gtlow-complexity-filtering-B and ndashCC is an especially interesting case ndash what can we deduce about the cDNA sequence Annotators beware

Results for Exercise 2A (OFF) BLASTn ndash low complexity filtering OFF

Results for Exercise 2A (ON) BLASTn ndash low complexity filtering ON

Results for Exercise 2B

ON OFF

Results for Exercise 2C

There is a sequence error an extra G at position 117 in the sequence cDNA (117)AGAAAAGAAGAAACATGGCAATGGATCAGAA|||||||||||||||| ||||||||||||||AGAAAAGAAGAAACAT-GCAATGGATCAGAA

Genomic sequence

ON

OFF

BLAST Parameters Exercises3 Limit by Entrez query

Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching sequences in fruit fly enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT

Open the file example-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home Page TRANSLATED BLAST sectionBLASTx and paste the sequence

Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip

BLAST Parameters Exercises4 BLASTn vs tBLASTx and nucleotide mismatch penalties

Open the file example-sequenceshtml

Also open the NCBI BLAST Home PageSPECIAL ndash Align two sequences section

There are several Xenopus tropicalis cyclins in the examples fileCopy the sequence gtcyclin-A1-Xt to the Sequence 1 BLAST windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 BLAST window(i) Run the default comparison should be BLASTn Note the alignmentNow run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these

(ii) Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1 Then try reducing the gap open and gap extension penaltieshellipWhat do we learn from this

(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip

Results for Exercise 4 (i)

BLASTn tBLASTx

Results for Exercise 4 (ii)

Mismatch penalty = -2 (default) Mismatch penalty = -1

BLAST Parameters Exercises5 E-Value maximum for reporting

Open the file example-sequenceshtml

Copy the sequence gtsumo-binding-motif and go to the NCBI BLAST Home PageGo to the PROTEIN BLAST section BLASTp and paste the sequence

Run the search with the default values

Now re-run the search setting the maximum E-value in the box

Expect 100

What difference does this make

BLAST Parameters Exercises6 Word Size

Open the file example-sequenceshtml

Copy the sequence gtmorpholino and go to the NCBI BLAST Home PageGo to the NUCLEOTIDE BLAST section BLASTn and paste the sequence

Check OFF the low complexity filter and then run the search

Now re-run the search setting the following parameters

Low complexity OFFExpect 100Word Size 7Other advanced -q-1 (mismatch penalty -1 instead of default -3)

What difference does this make

END

  • Bioinformatics Workshop 1 Sequences and Similarity Searches
  • The Basic Questions
  • Part 1 Structural Genomics
  • Chromosomes and Genes
  • Gene to Protein
  • Sequence Signals
  • Genomic Signals
  • Derivative Sequences
  • Gene Models
  • Sequences and Genes (Accession Numbers and Names)
  • Gene Symbols Names Etc
  • A Gene-Centric View
  • Sequences and Accession Numbers
  • mRNA Splicing Signals
  • Gene Predictions
  • Supporting Evidence
  • TheoreticalPredicted Sequences
  • Sequences for a model organism
  • So Whatrsquos in the Databases Now
  • Part 2 Comparative Genomics
  • Speciation
  • Residual Similarity
  • Computers Can Detect Homology
  • Orthologs
  • Paralogs
  • lsquoOtherrsquo-logs
  • The Essential Paradigm
  • Function Conserved Longer than Detectable Similarity
  • Redundancy in the Genetic Code
  • Protein Similarity Persists Longer
  • Always Compare Protein Sequences
  • Exercise 1 nucleotide vs amino acid search
  • Answers Exercise 1
  • The Essential Task
  • Functional Orthologs
  • Finding Orthologs
  • Using Synteny is Better
  • Metazome
  • Metazome Exercise
  • Part 3 Finding Sequence Similarities
  • Gaps in Alignments
  • The Downside of Gaps
  • BLAST
  • Flavours of BLAST
  • How does it work
  • BLAST WORDS and INDEXING
  • Analyse the Query Sequence
  • Expand from Word Based Matches
  • BLAST ndashTypical Output
  • When is a match significant
  • E-values
  • E-values From First Principles
  • Calculating an E-value
  • E-values In Practice
  • E-value Exercise
  • E-value Exercise Answer
  • E-values Effect of Database Size
  • Slide 58
  • Why were the values different
  • E-values Effect of Query Length
  • Why not just use identity
  • So whatrsquos the real problem
  • Rules of Thumb
  • Protein BLAST
  • Amino Acid Substitutions
  • Exercises
  • Part 4 Tweaking BLAST
  • Not All Parameters are hellip
  • The Many Parameters of BLAST
  • Slide 70
  • BLAST Parameters Exercises
  • Results for Exercise 1
  • Slide 73
  • Results for Exercise 2A (OFF)
  • Results for Exercise 2A (ON)
  • Results for Exercise 2B
  • Results for Exercise 2C
  • Slide 78
  • Slide 79
  • Results for Exercise 4 (i)
  • Results for Exercise 4 (ii)
  • Slide 82
  • Slide 83
  • END
Page 38: Bioinformatics Workshop 1 Sequences and Similarity Searches

MetazomeFortunately someone has done all the hard work for ushellip Dan Rokhsar httpwwwmetazomenet

Metazome Exercise

Go back to Entrez Gene and look for your favourite gene again

Pick probable ortholog vertebrate genes from common organisms (human mouse rat chicken frog fish) and paste their protein sequences into a temporary space

Go to Metazome (httpwwwmetazomenet) find the blast window open two versions of it and blast your sequences against the Tetrapod or Jawed vertebrate node

See if you get the same cluster ID as best top hit and have a look at the Metazome alignment(s)hellip

Part 3 Finding Sequence Similarities

We want computer programs which will compare sequences at all possible different alignments looking for a degree of similarity greater than we would expect to find by chance

But first we have to consider the implication of gapshellip

Insertions and deletions are other possible forms of mutations and they can really mess up our simple alignments

ATGCATGCTGCCAACGGATGTCCTG

ATGAAAGCCGCCTACGAAAGTCCTG||| | || ||| ||| | ||||||

ATGCATGCTGGCCAACGGATGTCCTG

ATGAAAGCCGCCTACGAAAGTCCTG||| | || ||| | | | |

Gaps in Alignments

Consider these two obviously similar sequences

TTCCCAACTCTCCTCTTTCACCATGAAGCTCAAGGACAGATTCCACTCGCCCCAAAATCAAGCTCACCCCGTCCAAGAA | || | || |||||||||||||||||||| ||||||||| ||| ||| | ||| | | |TTCCCACCTCTCCTCTTTGCACCATGAAGCTCAAGGACAAATTCCACTCCCCCAAAATCAAGCGCACCCCGTCCCAGAA

In fact we realise that the most probable alignment (regarding biological origin) is with a small gap in each sequence

TTCCCAACTCTCCTCTTT=CACCATGAAGCTCAAGGACAGATTCCACTCGCCCCAAAATCAAGCTCACCCCGTCCAAGAA |||||| ||||||||||| |||||||||||||||||||| ||||||||| |||||||||||||| |||||||||| ||||TTCCCACCTCTCCTCTTTGCACCATGAAGCTCAAGGACAAATTCCACTC=CCCCAAAATCAAGCGCACCCCGTCCCAGAA

So in general we allow ourselves to insert gaps until we find the optimal alignment

But where should this process stop

The Downside of GapsTake two random sequences with no lsquorealrsquo similarity

GACACTAGGTCGATGCGTGGTGGCGAGA

ACGCATCCGGATGTGCACCGTGGAACTG

And allow lsquocost freersquo gaps

GAC--ACT----AGGTCGATGC---GTGG---TGGCGAGA || | | | | | ||| |||| || ACGCA-TCCGGA--T-G-TGCACCGTGGAACTG

Clearly although the alignment has no mismatches it is obviously not biologically meaningful

To prevent this we assign a cost to adding gaps which is offset against the benefit of finding matches ndash and this is the essence of lsquofinding gapped alignmentsrsquo

We want to find the lsquoalignmentrsquo between the two (or more) sequences which shows the greatest degree of similarity while introducing the fewest gaps hellip

BLAST

gtqueryAGACGAACCTAGCACAAGCGCGTCTGGAAAGACCCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGTTCAGGAGTATTTGGACTGCAATATTGGCCCTCGTTCAAGGGCGCCTACCATCACCCGACGGTCATGCCGGTCCCCAGCAGCTGCTAATAACTTCCTTCGCTACTCAAGTTACCACGCTAGCAAAACCCACGGCATACCGTTTACCCTTTAAAATCAGCTTCAACCAGCAACGAA

There are many programs used to find similarities between sequencesThey range from relatively slow programs which find the exact best matching alignment through ones which take progressively inexact shortcuts to speed things up Of this latter class the best known and easily most widely used is BLAST developed by Stephen Altschul and others and continuously refined over the last 10-15 years

The essential idea is to compare your query sequence against a collection or lsquodatabasersquo of target sequences looking for the one(s) that match the query sequence the best

gttarget1AAAACAGGAATATTTACCGGGACCGGGTAATGATGCATCTCGAGGTACACAATATACCTG GAGAACCGAATTATGAGTTGGCCACCTTACTTAACGAAACCAGCAGAGAAAATCCAACAT GGCAACACCCCTCTGACTACACTAGAAGGAACTACTATGTAAGAAAACAGCCTGTCCCTT GCAGTTTGAATGACTGGGTGATGCGAAATGGGGGTCCTGCCATAGAGCGCTTCCATGGTT TACCTTGCACATTTCAGAGAAGTCCTATGCCAGGAGTCCTTCCTACAGGGCCTTCCTGAA ACTATATATGTGCTTATTCTTGTTTGATTTGGCTTTGCAGgttarget2CTCTTAATTTATTTCTCTTCCTGCAGCTCCCTCGCTTTTTCCTTTCCCTGTTACATTCAT CTGACTTGAAGAGTTGCAAATTTTCAGTGTTTCTGTTTTTGTTGCTGATATGTTGTAAAC TTTTTAATAAAATCTATTTCTATAG gttarget3GCAGTTTGAATGACTGGGTGATGCGAAATGGGGGTCCTGCCATAGAGCGCTTCCATGGTT TACCTTGCACATTTCAGAGAAGTCCTATGCCAGGAGTCCTTCCTACAGGGCCTTCCTGAA ACTATATATGTGCTTATTCTTGTTTGATTTGGCTTTGCAGCTAGGGTTTTCACCTTTTCT GGAAAAAAAAATACTGGCTTCC gttarget4CTGCTATTAATGGGCAAAACAACTCAAATAAAGTCCCTCTGCCACCCTCAGACACTGCCC CTGGCCCCCAGCTGCCCGCTGATCCTTGTAGCCAGAGCAGTAAAGTTTTGAAAGTGGAGC CCAAGGAGAATAAAGTTATTAAAGAAACTGGCTTTGAACAAGGTGAAAAGTCTTGTGCAG CACCTCTAGATCATACTGTGAAGGAAAATCTTGGACAAACTTCTAAAGAACAGGTGGTAG

query

database

COMPARE

LIST MATCHES

Flavours of BLAST

ACGATAGATCCCATCCATAAAT ATGACGATAGATCCCATCAT CGATAGGACCACCACA GATAGACCAGGATACATAGGATAATTA AGCTCGCTTGGCTCGATGGCT

MKJLSPWERSYTRGHYTWER MGHTVNBZY MKLPWRHGDBKJGMNDFD MBKLRPIUHDFRTASGSLKWWRTVBN

MKJLSPWERSYTRGHYTWER MGHTVNBZY MKLPWRHGDBKJGMNDFD MBKLRPIUHDFRTASGSLKWWRTVBN

ATGACGATAGATCCCATCAT CGATAGGACCACCACA GATAGACCAGGATACATAGGATAATTA AGCTCGCTTGGCTCGATGGCT

query sequence other operation database sequences

ATGACGATAGATCCCATCAT CGATAGGACCACCACA GATAGACCAGGATACATAGGATAATTA AGCTCGCTTGGCTCGATGGCT

BLASTn

BLASTp

BLASTx

tBLASTn

tBLASTx

ACGATAGATCCCATCCATAAAT

ACGATAGATCCCATCCATAAAT

MQWCGYRWTYQGYRW

MQWCGYRWTYQGYRW

FAST

FAST

SLOW

SLOWER

HORRIBLY

SLOW

6 fra

me

trans

latio

n

How does it work The main task of any sequence comparison program is to test all possible mutual alignments of two sequence and see how good the match is

CTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGT

CCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGTTC | | | | | ||||||||||||||||||||||||| CTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGTCTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGT || | | | | | | | | |||||||||||||||||||||||| | | | | | |

CCGAGCTTCTCATTGCTCTTCCTAACAGTG=TGATAGGCTAACCGTAATGGCGTTC||||||||||||||||||||||||| ||||||||||||||||||||||||

query

1st database sequence

This would actually be a very slow search process if implemented like thishellip

BLAST achieves its speed through two strategies

- it takes a WORD based approach- it pre-INDEXES database sequences

BLAST WORDS and INDEXING1 GACAAATCCAAACCCCTGAAGTTCTCCACCAGCAAAGCCA

2 TAAGCAAATTTAATTTTGTTTACATTTTC

3 GTTAAGACCTTCCCTGACATTTGCAGCAGTTTCAAATGTA

AAAAAAAA 00001 AAAAAAAC 00002 AAAAAAAG 00003 ACAAATCC 07967 ACAAATCC 07968 ACAAATCC 07979 GACAAATC 33568 GACAAATG 33569 TCCAAACC 64321 TCCAAACC 64322

sequence position word

1 1 33658

1 2 07967

1 3 16210

3 15 33568

3 16 07967

Database of sequences

Numbered list of all possible lsquowordsrsquo

Build a position index of all words in the database

Analyse the Query Sequence gtquery AGACAAATCCAAACCCCTGAAGTTCTCCACCAGCAAAGCCA

AAAAAAAA 00001 AAAAAAAC 00002 AAAAAAAG 00003 ACAAATCC 07967 ACAAATCC 07968 ACAAATCC 07979 GACAAATC 33568 GACAAATG 33569 TCCAAACC 64321 TCCAAACC 64322

QUERY SEQUENCE

Numbered list of all possible lsquowordsrsquo

position word

1 14236

2 33658

3 07967

Analyse QUERY SEQUENCE

sequence position word

1 1 33658

1 2 07967

1 3 16210

3 15 33568

3 16 07967

Index of database

Expand from Word Based Matches We lsquoinstantlyrsquo know which sequences in the database have at least a word length match with our query sequence and at what relative position

Next the potential alignments are expanded adding up a score for (total matches ndash mismatches ndash gap penalties) to make the best possible alignment But this is usually for a tiny proportion of the sequences in the database ndash so overall it is much quicker

The highest scoring alignments are reported

But we can potentially miss alignments with no word-size bits in common consider BLASTn with a default word-size of 11

TCGGAAGTGGAAGCTGAACCTGATTGTAGAGTTGGAGGCCAGTGTTCTGGCTGAGC||||||||| ||||| |||||||||| |||||||||| |||| ||||| ||||||| TCGGAAGTGTAAGCTCAACCTGATTGCAGAGTTGGAGTCCAGAGTTCTAGCTGAGC

Care is sometimes neededhellip

BLAST ndashTypical OutputINPUT

gtpartial cDNA sequence Xenopus tropicalisCGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGTTCCCACCTCTCCTCTTTCACCATGAAGCTCAAGGACAAATTCCACTCCCCCAAAATCAAGCGCACCCCGTCCAAGAAGGGGAAGCCGGCCGACCTCACCGTCAAAACAGAAGAGAAACCCGTCAACAAAACCTTAAGCCGCTTGGAGGAACAGGAGAAAGAAGTCGTTAATGCCTTGCGTTACTTTAAGACAATTGTTGACAAGATGGCGGTGGACAAGATGGTGCTGGTGATGCTGCCAGGGTCGGCGA

OUTPUTQuery= (311 letters) Database NCBI Protein Reference Sequences 954378 sequences 347895532 total letters

gtgi|41055060|ref|NP_9574201| similar to guanine nucleotide-releasing factor 2 (specific for crk proto-oncogene) [Danio rerio]

Length=691

Score = 133 bits (335)Expect = 6e-31 Identities = 7698 (77) Positives = 8298 (83) Gaps = 498 (4) Frame = +2

Query 26 MSGKIE-KADSQRSHLSSFTMKLKDKFHSPKIKRTPSKKGKPA--DLTVKTEEKPVNKTL 196 MSGKIE K +SQ+SHLSSFTMKL KFHSPKIKRTPSKKGK + VKT EKPVNK + Sbjct 1 MSGKIESKHESQKSHLSSFTMKLM-KFHSPKIKRTPSKKGKQLQPEPAVKTPEKPVNKKV 59

Query 197 SRLEEQEKEVVNALRYFKTIVDKMAVDKMVLVMLPGSA 310 SRLEEQEK+VV+ALRYFKTIVDKM VD VL MLPGSA Sbjct 60 SRLEEQEKDVVSALRYFKTIVDKMNVDTKVLQMLPGSA 97

When is a match significant

RFKISDCQHPCTYSHNQYMTNHMRECPYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV

NFSWKKTSEKETNCQFDYPNDYNEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFNMCWLEVNSS

RF---KISDCQHPCTYSH-NQYMTNHMREC----PYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV F K S+ + C + + N Y N +C P+ + +W +P + D I N M ++ NFSWKKTSEKETNCQFDYPNDY--NEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFN------MCWLEVNSS

Here is a lsquotypicalrsquo weak alignment from BLASTp

In fact the sequences were randomly generated so there is no biologically significant alignmenthellip

E-values

The number of matches like the discovered match that I would expect to find by chance

An E-value of 00 implies that I would expect no matches like this to arise by chance thereforehellip

An E-value of 1 implies I would expect 1 match like this to arise by chance so if I have a match with such an E-valuehellip

Also ldquoexpect valueldquo or ldquoexpectationrdquo

E-values From First Principles

Some database statistics (23rd July 2005)

Database NCBI RefSeq mRNA 272619 sequences 503566580 total letters (~50 x 108)

Database NCBI nr 3329110 sequences 14601814750 total letters (~14 x 1010)

Notation

12e-35 = 12 x 10-35

48 x 106 = 4800000

We will consider first searching a nucleotide sequence (lsquoACGTAGACGTrsquo) against a nucleotide database eg the RefSeq mRNA above

Then we will consider the more complex case of amino acid sequence (protein) searches Which is of course what we mostly do

Calculating an E-valueThe RefSeq mRNA database has ~ 50 x 108 letters There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA

Query = lsquoArsquo

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACrsquo

AC AC AC AC

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACGrsquo

ACG

Expected number of matches = (50 x 108) 4 = ~12 x 108

Expected number of matches = (50 x 108) (4x 4) = ~31 x 107

Expected number of matches = (50 x 108) (4 x 4 x 4) = ~81 x 106

Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer

Expected number of matches = (50 x 108) (4 x 4 x 4 x 4 hellip 60 times ) = (50 x 108) 1036 = 50 x 10-28

E-value = 50 x 10-28

E-values In PracticeSo if I take a 60 nt sequencegtsequenceACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA and actually BLAST it against the RefSeq mRNA database I get

BLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds

Length=6060 Score = 119 bits (60) Expect = 2e-26 Identities = 6060 (100) Gaps = 060 (0) Strand=PlusPlus

Query 1 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 60 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 2977 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 3036

What do I get if I BLAST it against the larger nr databaseBLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds

Length=6060 Score = 119 bits (60) Expect = 6e-25 Identities = 6060 (100) Gaps = 060 (0) Strand=PlusPlus

Query 1 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 60 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 2977 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 3036

theoretical value was 50e-28 -

E-value Exercise

Given a transcription factor binding site

ACC[TG]TA

How many would you expect to find by chance in a 10k promoter sequence

How would this differ if there was an optional additional base between the 4th and 5th positionsIeACC[TG]TAOR ACC[TG]TA

E-value Exercise AnswerACC[TG]TA

Expect lsquoArsquo every 4 ntExpect lsquoACCrsquo every 4x4x4 = 64 nt

Expect lsquoT or Grsquo every 2nd ntExpect lsquoACC[TG]rsquo every 64x2 nt = 128 nt

Expect lsquoTArsquo every 4x4 = 16 ntExpect lsquoACC[TG]TArsquo every 128x16 nt = 2048 nt (4x4x4x2x4x4)We would expect ~5 of these promoter sites every 10k by chance

If also ACC[TG]TAA allowed

The two motifs independently have the same E-valueTo allow either means we expect twice as many

E-values Effect of Database SizeThe nr mRNA database has ~ 14 x 1010 letters (was RefSeq and 50 x108)There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA

Query = lsquoArsquo

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACrsquo

AC AC AC AC

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACGrsquo

ACG

Expected number of matches = (14 x 1010) 4 = ~12 x 108

Expected number of matches = (14 x 1010) (4x 4) = ~31 x 107

Expected number of matches = (14 x 1010) (4 x 4 x 4) = ~81 x 106

Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer

Expected number of matches = (14 x 1010) (4 x 4 x 4 x 4 hellip 60 times ) = (14 x 1010) 1036 = 14 x 10-26

E-value = 14 x 10-26

(was E-value = 50 x 10-28)

E-values Effect of Database Size

The E-value is simply dependent on database size

RefSeq

nr

14 x 1010 letters

50 x108 letters

30 x bigger

BLAST the same sequenceagainst each E-value = 14e-26

E-value = 50e-28

The database was ~30 times bigger and so the E-value was ~30 times bigger

Why were the values differentOur calculated E-value for searching against the RefSeq mRNA database was 50 x 10-28But our actual BLAST search at NCBI gave a value of

20 x 10-26 - about 40x larger - why is this

Gapped alignments

If we were expecting N matches for a query sequence lsquoACGTACGTACGTrsquo imagine what would happen to N if we allowed gaps in our matches

ACGTACGTACGT

This would now give us additional possible alignments that would meet our lsquomatchrsquo criteria

ACGTACGTACGT ACGTACAGTACGT ACGTACCGTACGT etc|||||||||||| |||||| |||||| |||||| ||||||ACGTACGTACGT ACGTAC-GTACGT ACGTAC-GTACGT

We will expect many more matches in a given database if we allow our alignments to have gaps The E-value will be larger

E-values Effect of Query Length

Biologically itrsquos the same match Does it mean we are any less sure that this match didnrsquot occur by chance The E-value is simply dependent on match length

database

BLAST 500 nt sequence against a database

BLASTn Get a full length match with sequence XYZ at an E-value = 50e-160

gtsequence ACTAGTCTAGCTAGACATCGATCGATGATGCTACACAGATAGACGATAGATAGTAAGTCGATCGATCGCGCATCGATCGTCTAGATCGATCGCTCGCTGTGTAGATAGATCGGCGATAGA

database

BLAST half of the same sequence against the same database

BLASTn

gtsequence ACTAGTCTAGCTAGACATCGATCGATGATGCTACACAGATAGACGATAGATAGTAAGTCG

Get a match with sequence XYZ again but at an E-value = 50e-80

Why not just use identityAt some levels this a good question

But consider two very different searches both of which give a 75 identity match

Query1 was 60 nt longCGGAGCTCAGGGCTTAACGACTGATATCTCCGCGCATGTCGAGAAACGATACAGCCAGCG||||||||||| || | || | || || |||| | | | |||||| | ||||||||||CGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGWhich would have an E-value ~ 50 x 10-19

And Query2 only 16 nt longACGTACGTACGTACGT||| || | |||| ||ACGCACCTTCGTAGGTWhich would have an E-value ~ 30

And intuitively we feel we would expect to see that sort of number of matches in the database just by chancehellip

So whatrsquos the real problemBasically you are usually trying to answer the question

Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism

Are there any useful guidelines though at least for biological meaningfulness

Basically you are usually trying to answer the question

Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism

BLAST

The difficulty is because

ORTHOLOGY

BLAST Similarity + Probability

biological knowledge

nature of query sequence

phylogenetic relationship

match length PI size of databasehellip

Rules of ThumbHow good does an E-value have to be before we might even think we have an ortholog

largerworse smallerbetter

E-values 10-5 10-10 10-40 10-100 00

fantasy borderline encouraging

pretty good canrsquot get

better

But note that in some gene families with closely related members you can get an E-value of 00 for several different matches and then identity may be more sensitive Also bear in mind in cases like this that ideas of lsquofunctionalrsquo orthology may break down with more than one locus producing identical proteins which share the same functionhellip

Protein BLASTItrsquos (nearly) always better to make comparisons at the amino acid level between protein sequences than the DNA level because the amino acid sequence is more conserved than the underlying DNA sequenceDoes this cause us to treat expected values any differently

If we follow the argument as before then for an exact match of a 20 amino acid sequence in the RefSeq protein database each additional amino acid will reduce the E-value by 120th (there are 20 different amino acids) And as there are 347895532 letters in that databaseE-value = ~35 x 108 (20 x 20 x 20 hellip20 times) = ~35 x 10-18

But this is what we get if we run the blast at NCBI

Score = 431 bits (100) Expect = 8e-04 Identities = 2020 (100) Positives = 2020 (100) Gaps = 020 (0) Frame = +3

Query 3 SSSSFRAYRAALSEVEPPCI 62 SSSSFRAYRAALSEVEPPCISbjct 972 SSSSFRAYRAALSEVEPPCI 991

Really too big a discrepancy to easily explain with hand wavinghellip

Amino Acid Substitutions

A SC F LWYG I LMVL IMFVM ILVP V ILMW FY

N DHSQ REHKS ANTT SY HFW

H NQYK RQER QK

D NEE DQK

In fact we need to take into account both amino acid substitutability as well as as before allowing gapped alignments On average any residue can be substituted for by about 2 others so each position has about 17th chance of lsquomatchingrsquo rather than 120th

So now we getE-value = ~35 x 108 (7 x 7 x 7 hellip20 times) = ~44 x 10-9which is much closer to the actual BLAST value

These substitutabilities are dealt with by the BLOSUM and PAM matrices

ExercisesGo to the file random-DNA-sequenceshtml select one of the 20 randomly generated nucleotide sequences and do a BLASTx (translated DNA-gtprotein) at NCBI against the nr protein database

Did you find any lsquosignificantrsquo hits

Repeat with a second sequence

What conclusions might you draw from this exercise

Try the same sequence(s) against the nr nucleotide database

Is there any general difference

Part 4 Tweaking BLASTAlthough you normally see BLAST as a web page with boxes to place data in and tick boxes etc it is actually a command line program that can be run just by typing the appropriate command and options eg

promptgt blastall ndashp ltblast typegt ndashi ltinput sequencegt ndashd ltdatabasegt

This is the simplest form where the basic program lsquoblastallrsquo takes a number of different options or parameters indicated by the ndashx and followed by its value -p ltwhich blast flavour to rungt-i ltfile with query sequence ingt-d ltpre-indexed database namegt

Not All Parameters are hellipThere are many other parameters and if not listed explicitly they will use a default value most appropriate to the blast flavour requested Eg for ndashW ltword sizegt blastn uses ndashW 11 where blastx uses ndashW 3

There are also some options that appear on the web pages that are not really parameters but manage the job in a similar way One of the most useful of these is on the NCBI blast pages where you can use Entrez queries or pick from an organism list to modify your search

The Many Parameters of BLASTThere are almost literally hundreds of parameters but most are way too obscure even for die-hard techies like me Very few of them are regularly useful in any but their default value but just occasionally they are very necessary

Here are some of the ones that I have used

-e max expected value -m output format (graphical or tabularspreadsheet)-F filter query sequence for low complexity (default TRUE)-U use only upper case regions of query (default FALSE)-G gap opening cost-E gap extension cost-q nucleotide mismatch penalty (BLASTx uses matrices)-r nucleotide match reward-b number of matching sequences to report-g allow gaps (default TRUE)-W word size-z effective database size (removes effect of actual database size)-S query strands to search (default both directions)-l restrict database sequences to given list of lsquogilsquo numbers

BLAST Parameters Exercises1 BLASTn vs BLASTx

Open the file example-sequenceshtml copy the sequence gtblastn-vs-blastxThis is a Xenopus tropicalis cDNA sequence

Go to the NCBI BLAST Home PageNucleotide-nucleotide BLAST (blastn) section Paste your sequence into the box

Run BLASTn against the nr nucleotide database using all default optionsThen hit [format] to wait for the results in a new page(hint if you paste the sequence definition line lsquogtnamersquo into the box as well your results will be labelled accordingly which can be useful)

Now repeat but go to the TRANSLATED BLAST section and BLAST against the nr protein database using BLASTx

How might the different results help us view the presence of this gene in other vertebrates

Results for Exercise 1 BLASTn

BLASTx

BLAST Parameters Exercises2 Low complexity filtering

Open the file example-sequenceshtml copy the sequence gtlow-complexity-filtering-A This sequence contains a long AT tandem repeat

Go to the NCBI BLAST Home PageTRANSLATED BLAST sectionBLASTx Paste your sequence into the box

Carefully UNTICK the ldquoChoose filter [ ] Low complexityrdquo BOX in the second section And then run BLASTx against the nr database

What do you feel about these alignmentsRe-run but leave the low-complexity filter ON this timeDoes this change our view of the protein matches

Now continue with gtlow-complexity-filtering-B and ndashCC is an especially interesting case ndash what can we deduce about the cDNA sequence Annotators beware

Results for Exercise 2A (OFF) BLASTn ndash low complexity filtering OFF

Results for Exercise 2A (ON) BLASTn ndash low complexity filtering ON

Results for Exercise 2B

ON OFF

Results for Exercise 2C

There is a sequence error an extra G at position 117 in the sequence cDNA (117)AGAAAAGAAGAAACATGGCAATGGATCAGAA|||||||||||||||| ||||||||||||||AGAAAAGAAGAAACAT-GCAATGGATCAGAA

Genomic sequence

ON

OFF

BLAST Parameters Exercises3 Limit by Entrez query

Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching sequences in fruit fly enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT

Open the file example-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home Page TRANSLATED BLAST sectionBLASTx and paste the sequence

Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip

BLAST Parameters Exercises4 BLASTn vs tBLASTx and nucleotide mismatch penalties

Open the file example-sequenceshtml

Also open the NCBI BLAST Home PageSPECIAL ndash Align two sequences section

There are several Xenopus tropicalis cyclins in the examples fileCopy the sequence gtcyclin-A1-Xt to the Sequence 1 BLAST windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 BLAST window(i) Run the default comparison should be BLASTn Note the alignmentNow run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these

(ii) Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1 Then try reducing the gap open and gap extension penaltieshellipWhat do we learn from this

(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip

Results for Exercise 4 (i)

BLASTn tBLASTx

Results for Exercise 4 (ii)

Mismatch penalty = -2 (default) Mismatch penalty = -1

BLAST Parameters Exercises5 E-Value maximum for reporting

Open the file example-sequenceshtml

Copy the sequence gtsumo-binding-motif and go to the NCBI BLAST Home PageGo to the PROTEIN BLAST section BLASTp and paste the sequence

Run the search with the default values

Now re-run the search setting the maximum E-value in the box

Expect 100

What difference does this make

BLAST Parameters Exercises6 Word Size

Open the file example-sequenceshtml

Copy the sequence gtmorpholino and go to the NCBI BLAST Home PageGo to the NUCLEOTIDE BLAST section BLASTn and paste the sequence

Check OFF the low complexity filter and then run the search

Now re-run the search setting the following parameters

Low complexity OFFExpect 100Word Size 7Other advanced -q-1 (mismatch penalty -1 instead of default -3)

What difference does this make

END

  • Bioinformatics Workshop 1 Sequences and Similarity Searches
  • The Basic Questions
  • Part 1 Structural Genomics
  • Chromosomes and Genes
  • Gene to Protein
  • Sequence Signals
  • Genomic Signals
  • Derivative Sequences
  • Gene Models
  • Sequences and Genes (Accession Numbers and Names)
  • Gene Symbols Names Etc
  • A Gene-Centric View
  • Sequences and Accession Numbers
  • mRNA Splicing Signals
  • Gene Predictions
  • Supporting Evidence
  • TheoreticalPredicted Sequences
  • Sequences for a model organism
  • So Whatrsquos in the Databases Now
  • Part 2 Comparative Genomics
  • Speciation
  • Residual Similarity
  • Computers Can Detect Homology
  • Orthologs
  • Paralogs
  • lsquoOtherrsquo-logs
  • The Essential Paradigm
  • Function Conserved Longer than Detectable Similarity
  • Redundancy in the Genetic Code
  • Protein Similarity Persists Longer
  • Always Compare Protein Sequences
  • Exercise 1 nucleotide vs amino acid search
  • Answers Exercise 1
  • The Essential Task
  • Functional Orthologs
  • Finding Orthologs
  • Using Synteny is Better
  • Metazome
  • Metazome Exercise
  • Part 3 Finding Sequence Similarities
  • Gaps in Alignments
  • The Downside of Gaps
  • BLAST
  • Flavours of BLAST
  • How does it work
  • BLAST WORDS and INDEXING
  • Analyse the Query Sequence
  • Expand from Word Based Matches
  • BLAST ndashTypical Output
  • When is a match significant
  • E-values
  • E-values From First Principles
  • Calculating an E-value
  • E-values In Practice
  • E-value Exercise
  • E-value Exercise Answer
  • E-values Effect of Database Size
  • Slide 58
  • Why were the values different
  • E-values Effect of Query Length
  • Why not just use identity
  • So whatrsquos the real problem
  • Rules of Thumb
  • Protein BLAST
  • Amino Acid Substitutions
  • Exercises
  • Part 4 Tweaking BLAST
  • Not All Parameters are hellip
  • The Many Parameters of BLAST
  • Slide 70
  • BLAST Parameters Exercises
  • Results for Exercise 1
  • Slide 73
  • Results for Exercise 2A (OFF)
  • Results for Exercise 2A (ON)
  • Results for Exercise 2B
  • Results for Exercise 2C
  • Slide 78
  • Slide 79
  • Results for Exercise 4 (i)
  • Results for Exercise 4 (ii)
  • Slide 82
  • Slide 83
  • END
Page 39: Bioinformatics Workshop 1 Sequences and Similarity Searches

Metazome Exercise

Go back to Entrez Gene and look for your favourite gene again

Pick probable ortholog vertebrate genes from common organisms (human mouse rat chicken frog fish) and paste their protein sequences into a temporary space

Go to Metazome (httpwwwmetazomenet) find the blast window open two versions of it and blast your sequences against the Tetrapod or Jawed vertebrate node

See if you get the same cluster ID as best top hit and have a look at the Metazome alignment(s)hellip

Part 3 Finding Sequence Similarities

We want computer programs which will compare sequences at all possible different alignments looking for a degree of similarity greater than we would expect to find by chance

But first we have to consider the implication of gapshellip

Insertions and deletions are other possible forms of mutations and they can really mess up our simple alignments

ATGCATGCTGCCAACGGATGTCCTG

ATGAAAGCCGCCTACGAAAGTCCTG||| | || ||| ||| | ||||||

ATGCATGCTGGCCAACGGATGTCCTG

ATGAAAGCCGCCTACGAAAGTCCTG||| | || ||| | | | |

Gaps in Alignments

Consider these two obviously similar sequences

TTCCCAACTCTCCTCTTTCACCATGAAGCTCAAGGACAGATTCCACTCGCCCCAAAATCAAGCTCACCCCGTCCAAGAA | || | || |||||||||||||||||||| ||||||||| ||| ||| | ||| | | |TTCCCACCTCTCCTCTTTGCACCATGAAGCTCAAGGACAAATTCCACTCCCCCAAAATCAAGCGCACCCCGTCCCAGAA

In fact we realise that the most probable alignment (regarding biological origin) is with a small gap in each sequence

TTCCCAACTCTCCTCTTT=CACCATGAAGCTCAAGGACAGATTCCACTCGCCCCAAAATCAAGCTCACCCCGTCCAAGAA |||||| ||||||||||| |||||||||||||||||||| ||||||||| |||||||||||||| |||||||||| ||||TTCCCACCTCTCCTCTTTGCACCATGAAGCTCAAGGACAAATTCCACTC=CCCCAAAATCAAGCGCACCCCGTCCCAGAA

So in general we allow ourselves to insert gaps until we find the optimal alignment

But where should this process stop

The Downside of GapsTake two random sequences with no lsquorealrsquo similarity

GACACTAGGTCGATGCGTGGTGGCGAGA

ACGCATCCGGATGTGCACCGTGGAACTG

And allow lsquocost freersquo gaps

GAC--ACT----AGGTCGATGC---GTGG---TGGCGAGA || | | | | | ||| |||| || ACGCA-TCCGGA--T-G-TGCACCGTGGAACTG

Clearly although the alignment has no mismatches it is obviously not biologically meaningful

To prevent this we assign a cost to adding gaps which is offset against the benefit of finding matches ndash and this is the essence of lsquofinding gapped alignmentsrsquo

We want to find the lsquoalignmentrsquo between the two (or more) sequences which shows the greatest degree of similarity while introducing the fewest gaps hellip

BLAST

gtqueryAGACGAACCTAGCACAAGCGCGTCTGGAAAGACCCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGTTCAGGAGTATTTGGACTGCAATATTGGCCCTCGTTCAAGGGCGCCTACCATCACCCGACGGTCATGCCGGTCCCCAGCAGCTGCTAATAACTTCCTTCGCTACTCAAGTTACCACGCTAGCAAAACCCACGGCATACCGTTTACCCTTTAAAATCAGCTTCAACCAGCAACGAA

There are many programs used to find similarities between sequencesThey range from relatively slow programs which find the exact best matching alignment through ones which take progressively inexact shortcuts to speed things up Of this latter class the best known and easily most widely used is BLAST developed by Stephen Altschul and others and continuously refined over the last 10-15 years

The essential idea is to compare your query sequence against a collection or lsquodatabasersquo of target sequences looking for the one(s) that match the query sequence the best

gttarget1AAAACAGGAATATTTACCGGGACCGGGTAATGATGCATCTCGAGGTACACAATATACCTG GAGAACCGAATTATGAGTTGGCCACCTTACTTAACGAAACCAGCAGAGAAAATCCAACAT GGCAACACCCCTCTGACTACACTAGAAGGAACTACTATGTAAGAAAACAGCCTGTCCCTT GCAGTTTGAATGACTGGGTGATGCGAAATGGGGGTCCTGCCATAGAGCGCTTCCATGGTT TACCTTGCACATTTCAGAGAAGTCCTATGCCAGGAGTCCTTCCTACAGGGCCTTCCTGAA ACTATATATGTGCTTATTCTTGTTTGATTTGGCTTTGCAGgttarget2CTCTTAATTTATTTCTCTTCCTGCAGCTCCCTCGCTTTTTCCTTTCCCTGTTACATTCAT CTGACTTGAAGAGTTGCAAATTTTCAGTGTTTCTGTTTTTGTTGCTGATATGTTGTAAAC TTTTTAATAAAATCTATTTCTATAG gttarget3GCAGTTTGAATGACTGGGTGATGCGAAATGGGGGTCCTGCCATAGAGCGCTTCCATGGTT TACCTTGCACATTTCAGAGAAGTCCTATGCCAGGAGTCCTTCCTACAGGGCCTTCCTGAA ACTATATATGTGCTTATTCTTGTTTGATTTGGCTTTGCAGCTAGGGTTTTCACCTTTTCT GGAAAAAAAAATACTGGCTTCC gttarget4CTGCTATTAATGGGCAAAACAACTCAAATAAAGTCCCTCTGCCACCCTCAGACACTGCCC CTGGCCCCCAGCTGCCCGCTGATCCTTGTAGCCAGAGCAGTAAAGTTTTGAAAGTGGAGC CCAAGGAGAATAAAGTTATTAAAGAAACTGGCTTTGAACAAGGTGAAAAGTCTTGTGCAG CACCTCTAGATCATACTGTGAAGGAAAATCTTGGACAAACTTCTAAAGAACAGGTGGTAG

query

database

COMPARE

LIST MATCHES

Flavours of BLAST

ACGATAGATCCCATCCATAAAT ATGACGATAGATCCCATCAT CGATAGGACCACCACA GATAGACCAGGATACATAGGATAATTA AGCTCGCTTGGCTCGATGGCT

MKJLSPWERSYTRGHYTWER MGHTVNBZY MKLPWRHGDBKJGMNDFD MBKLRPIUHDFRTASGSLKWWRTVBN

MKJLSPWERSYTRGHYTWER MGHTVNBZY MKLPWRHGDBKJGMNDFD MBKLRPIUHDFRTASGSLKWWRTVBN

ATGACGATAGATCCCATCAT CGATAGGACCACCACA GATAGACCAGGATACATAGGATAATTA AGCTCGCTTGGCTCGATGGCT

query sequence other operation database sequences

ATGACGATAGATCCCATCAT CGATAGGACCACCACA GATAGACCAGGATACATAGGATAATTA AGCTCGCTTGGCTCGATGGCT

BLASTn

BLASTp

BLASTx

tBLASTn

tBLASTx

ACGATAGATCCCATCCATAAAT

ACGATAGATCCCATCCATAAAT

MQWCGYRWTYQGYRW

MQWCGYRWTYQGYRW

FAST

FAST

SLOW

SLOWER

HORRIBLY

SLOW

6 fra

me

trans

latio

n

How does it work The main task of any sequence comparison program is to test all possible mutual alignments of two sequence and see how good the match is

CTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGT

CCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGTTC | | | | | ||||||||||||||||||||||||| CTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGTCTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGT || | | | | | | | | |||||||||||||||||||||||| | | | | | |

CCGAGCTTCTCATTGCTCTTCCTAACAGTG=TGATAGGCTAACCGTAATGGCGTTC||||||||||||||||||||||||| ||||||||||||||||||||||||

query

1st database sequence

This would actually be a very slow search process if implemented like thishellip

BLAST achieves its speed through two strategies

- it takes a WORD based approach- it pre-INDEXES database sequences

BLAST WORDS and INDEXING1 GACAAATCCAAACCCCTGAAGTTCTCCACCAGCAAAGCCA

2 TAAGCAAATTTAATTTTGTTTACATTTTC

3 GTTAAGACCTTCCCTGACATTTGCAGCAGTTTCAAATGTA

AAAAAAAA 00001 AAAAAAAC 00002 AAAAAAAG 00003 ACAAATCC 07967 ACAAATCC 07968 ACAAATCC 07979 GACAAATC 33568 GACAAATG 33569 TCCAAACC 64321 TCCAAACC 64322

sequence position word

1 1 33658

1 2 07967

1 3 16210

3 15 33568

3 16 07967

Database of sequences

Numbered list of all possible lsquowordsrsquo

Build a position index of all words in the database

Analyse the Query Sequence gtquery AGACAAATCCAAACCCCTGAAGTTCTCCACCAGCAAAGCCA

AAAAAAAA 00001 AAAAAAAC 00002 AAAAAAAG 00003 ACAAATCC 07967 ACAAATCC 07968 ACAAATCC 07979 GACAAATC 33568 GACAAATG 33569 TCCAAACC 64321 TCCAAACC 64322

QUERY SEQUENCE

Numbered list of all possible lsquowordsrsquo

position word

1 14236

2 33658

3 07967

Analyse QUERY SEQUENCE

sequence position word

1 1 33658

1 2 07967

1 3 16210

3 15 33568

3 16 07967

Index of database

Expand from Word Based Matches We lsquoinstantlyrsquo know which sequences in the database have at least a word length match with our query sequence and at what relative position

Next the potential alignments are expanded adding up a score for (total matches ndash mismatches ndash gap penalties) to make the best possible alignment But this is usually for a tiny proportion of the sequences in the database ndash so overall it is much quicker

The highest scoring alignments are reported

But we can potentially miss alignments with no word-size bits in common consider BLASTn with a default word-size of 11

TCGGAAGTGGAAGCTGAACCTGATTGTAGAGTTGGAGGCCAGTGTTCTGGCTGAGC||||||||| ||||| |||||||||| |||||||||| |||| ||||| ||||||| TCGGAAGTGTAAGCTCAACCTGATTGCAGAGTTGGAGTCCAGAGTTCTAGCTGAGC

Care is sometimes neededhellip

BLAST ndashTypical OutputINPUT

gtpartial cDNA sequence Xenopus tropicalisCGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGTTCCCACCTCTCCTCTTTCACCATGAAGCTCAAGGACAAATTCCACTCCCCCAAAATCAAGCGCACCCCGTCCAAGAAGGGGAAGCCGGCCGACCTCACCGTCAAAACAGAAGAGAAACCCGTCAACAAAACCTTAAGCCGCTTGGAGGAACAGGAGAAAGAAGTCGTTAATGCCTTGCGTTACTTTAAGACAATTGTTGACAAGATGGCGGTGGACAAGATGGTGCTGGTGATGCTGCCAGGGTCGGCGA

OUTPUTQuery= (311 letters) Database NCBI Protein Reference Sequences 954378 sequences 347895532 total letters

gtgi|41055060|ref|NP_9574201| similar to guanine nucleotide-releasing factor 2 (specific for crk proto-oncogene) [Danio rerio]

Length=691

Score = 133 bits (335)Expect = 6e-31 Identities = 7698 (77) Positives = 8298 (83) Gaps = 498 (4) Frame = +2

Query 26 MSGKIE-KADSQRSHLSSFTMKLKDKFHSPKIKRTPSKKGKPA--DLTVKTEEKPVNKTL 196 MSGKIE K +SQ+SHLSSFTMKL KFHSPKIKRTPSKKGK + VKT EKPVNK + Sbjct 1 MSGKIESKHESQKSHLSSFTMKLM-KFHSPKIKRTPSKKGKQLQPEPAVKTPEKPVNKKV 59

Query 197 SRLEEQEKEVVNALRYFKTIVDKMAVDKMVLVMLPGSA 310 SRLEEQEK+VV+ALRYFKTIVDKM VD VL MLPGSA Sbjct 60 SRLEEQEKDVVSALRYFKTIVDKMNVDTKVLQMLPGSA 97

When is a match significant

RFKISDCQHPCTYSHNQYMTNHMRECPYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV

NFSWKKTSEKETNCQFDYPNDYNEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFNMCWLEVNSS

RF---KISDCQHPCTYSH-NQYMTNHMREC----PYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV F K S+ + C + + N Y N +C P+ + +W +P + D I N M ++ NFSWKKTSEKETNCQFDYPNDY--NEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFN------MCWLEVNSS

Here is a lsquotypicalrsquo weak alignment from BLASTp

In fact the sequences were randomly generated so there is no biologically significant alignmenthellip

E-values

The number of matches like the discovered match that I would expect to find by chance

An E-value of 00 implies that I would expect no matches like this to arise by chance thereforehellip

An E-value of 1 implies I would expect 1 match like this to arise by chance so if I have a match with such an E-valuehellip

Also ldquoexpect valueldquo or ldquoexpectationrdquo

E-values From First Principles

Some database statistics (23rd July 2005)

Database NCBI RefSeq mRNA 272619 sequences 503566580 total letters (~50 x 108)

Database NCBI nr 3329110 sequences 14601814750 total letters (~14 x 1010)

Notation

12e-35 = 12 x 10-35

48 x 106 = 4800000

We will consider first searching a nucleotide sequence (lsquoACGTAGACGTrsquo) against a nucleotide database eg the RefSeq mRNA above

Then we will consider the more complex case of amino acid sequence (protein) searches Which is of course what we mostly do

Calculating an E-valueThe RefSeq mRNA database has ~ 50 x 108 letters There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA

Query = lsquoArsquo

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACrsquo

AC AC AC AC

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACGrsquo

ACG

Expected number of matches = (50 x 108) 4 = ~12 x 108

Expected number of matches = (50 x 108) (4x 4) = ~31 x 107

Expected number of matches = (50 x 108) (4 x 4 x 4) = ~81 x 106

Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer

Expected number of matches = (50 x 108) (4 x 4 x 4 x 4 hellip 60 times ) = (50 x 108) 1036 = 50 x 10-28

E-value = 50 x 10-28

E-values In PracticeSo if I take a 60 nt sequencegtsequenceACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA and actually BLAST it against the RefSeq mRNA database I get

BLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds

Length=6060 Score = 119 bits (60) Expect = 2e-26 Identities = 6060 (100) Gaps = 060 (0) Strand=PlusPlus

Query 1 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 60 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 2977 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 3036

What do I get if I BLAST it against the larger nr databaseBLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds

Length=6060 Score = 119 bits (60) Expect = 6e-25 Identities = 6060 (100) Gaps = 060 (0) Strand=PlusPlus

Query 1 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 60 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 2977 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 3036

theoretical value was 50e-28 -

E-value Exercise

Given a transcription factor binding site

ACC[TG]TA

How many would you expect to find by chance in a 10k promoter sequence

How would this differ if there was an optional additional base between the 4th and 5th positionsIeACC[TG]TAOR ACC[TG]TA

E-value Exercise AnswerACC[TG]TA

Expect lsquoArsquo every 4 ntExpect lsquoACCrsquo every 4x4x4 = 64 nt

Expect lsquoT or Grsquo every 2nd ntExpect lsquoACC[TG]rsquo every 64x2 nt = 128 nt

Expect lsquoTArsquo every 4x4 = 16 ntExpect lsquoACC[TG]TArsquo every 128x16 nt = 2048 nt (4x4x4x2x4x4)We would expect ~5 of these promoter sites every 10k by chance

If also ACC[TG]TAA allowed

The two motifs independently have the same E-valueTo allow either means we expect twice as many

E-values Effect of Database SizeThe nr mRNA database has ~ 14 x 1010 letters (was RefSeq and 50 x108)There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA

Query = lsquoArsquo

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACrsquo

AC AC AC AC

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACGrsquo

ACG

Expected number of matches = (14 x 1010) 4 = ~12 x 108

Expected number of matches = (14 x 1010) (4x 4) = ~31 x 107

Expected number of matches = (14 x 1010) (4 x 4 x 4) = ~81 x 106

Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer

Expected number of matches = (14 x 1010) (4 x 4 x 4 x 4 hellip 60 times ) = (14 x 1010) 1036 = 14 x 10-26

E-value = 14 x 10-26

(was E-value = 50 x 10-28)

E-values Effect of Database Size

The E-value is simply dependent on database size

RefSeq

nr

14 x 1010 letters

50 x108 letters

30 x bigger

BLAST the same sequenceagainst each E-value = 14e-26

E-value = 50e-28

The database was ~30 times bigger and so the E-value was ~30 times bigger

Why were the values differentOur calculated E-value for searching against the RefSeq mRNA database was 50 x 10-28But our actual BLAST search at NCBI gave a value of

20 x 10-26 - about 40x larger - why is this

Gapped alignments

If we were expecting N matches for a query sequence lsquoACGTACGTACGTrsquo imagine what would happen to N if we allowed gaps in our matches

ACGTACGTACGT

This would now give us additional possible alignments that would meet our lsquomatchrsquo criteria

ACGTACGTACGT ACGTACAGTACGT ACGTACCGTACGT etc|||||||||||| |||||| |||||| |||||| ||||||ACGTACGTACGT ACGTAC-GTACGT ACGTAC-GTACGT

We will expect many more matches in a given database if we allow our alignments to have gaps The E-value will be larger

E-values Effect of Query Length

Biologically itrsquos the same match Does it mean we are any less sure that this match didnrsquot occur by chance The E-value is simply dependent on match length

database

BLAST 500 nt sequence against a database

BLASTn Get a full length match with sequence XYZ at an E-value = 50e-160

gtsequence ACTAGTCTAGCTAGACATCGATCGATGATGCTACACAGATAGACGATAGATAGTAAGTCGATCGATCGCGCATCGATCGTCTAGATCGATCGCTCGCTGTGTAGATAGATCGGCGATAGA

database

BLAST half of the same sequence against the same database

BLASTn

gtsequence ACTAGTCTAGCTAGACATCGATCGATGATGCTACACAGATAGACGATAGATAGTAAGTCG

Get a match with sequence XYZ again but at an E-value = 50e-80

Why not just use identityAt some levels this a good question

But consider two very different searches both of which give a 75 identity match

Query1 was 60 nt longCGGAGCTCAGGGCTTAACGACTGATATCTCCGCGCATGTCGAGAAACGATACAGCCAGCG||||||||||| || | || | || || |||| | | | |||||| | ||||||||||CGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGWhich would have an E-value ~ 50 x 10-19

And Query2 only 16 nt longACGTACGTACGTACGT||| || | |||| ||ACGCACCTTCGTAGGTWhich would have an E-value ~ 30

And intuitively we feel we would expect to see that sort of number of matches in the database just by chancehellip

So whatrsquos the real problemBasically you are usually trying to answer the question

Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism

Are there any useful guidelines though at least for biological meaningfulness

Basically you are usually trying to answer the question

Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism

BLAST

The difficulty is because

ORTHOLOGY

BLAST Similarity + Probability

biological knowledge

nature of query sequence

phylogenetic relationship

match length PI size of databasehellip

Rules of ThumbHow good does an E-value have to be before we might even think we have an ortholog

largerworse smallerbetter

E-values 10-5 10-10 10-40 10-100 00

fantasy borderline encouraging

pretty good canrsquot get

better

But note that in some gene families with closely related members you can get an E-value of 00 for several different matches and then identity may be more sensitive Also bear in mind in cases like this that ideas of lsquofunctionalrsquo orthology may break down with more than one locus producing identical proteins which share the same functionhellip

Protein BLASTItrsquos (nearly) always better to make comparisons at the amino acid level between protein sequences than the DNA level because the amino acid sequence is more conserved than the underlying DNA sequenceDoes this cause us to treat expected values any differently

If we follow the argument as before then for an exact match of a 20 amino acid sequence in the RefSeq protein database each additional amino acid will reduce the E-value by 120th (there are 20 different amino acids) And as there are 347895532 letters in that databaseE-value = ~35 x 108 (20 x 20 x 20 hellip20 times) = ~35 x 10-18

But this is what we get if we run the blast at NCBI

Score = 431 bits (100) Expect = 8e-04 Identities = 2020 (100) Positives = 2020 (100) Gaps = 020 (0) Frame = +3

Query 3 SSSSFRAYRAALSEVEPPCI 62 SSSSFRAYRAALSEVEPPCISbjct 972 SSSSFRAYRAALSEVEPPCI 991

Really too big a discrepancy to easily explain with hand wavinghellip

Amino Acid Substitutions

A SC F LWYG I LMVL IMFVM ILVP V ILMW FY

N DHSQ REHKS ANTT SY HFW

H NQYK RQER QK

D NEE DQK

In fact we need to take into account both amino acid substitutability as well as as before allowing gapped alignments On average any residue can be substituted for by about 2 others so each position has about 17th chance of lsquomatchingrsquo rather than 120th

So now we getE-value = ~35 x 108 (7 x 7 x 7 hellip20 times) = ~44 x 10-9which is much closer to the actual BLAST value

These substitutabilities are dealt with by the BLOSUM and PAM matrices

ExercisesGo to the file random-DNA-sequenceshtml select one of the 20 randomly generated nucleotide sequences and do a BLASTx (translated DNA-gtprotein) at NCBI against the nr protein database

Did you find any lsquosignificantrsquo hits

Repeat with a second sequence

What conclusions might you draw from this exercise

Try the same sequence(s) against the nr nucleotide database

Is there any general difference

Part 4 Tweaking BLASTAlthough you normally see BLAST as a web page with boxes to place data in and tick boxes etc it is actually a command line program that can be run just by typing the appropriate command and options eg

promptgt blastall ndashp ltblast typegt ndashi ltinput sequencegt ndashd ltdatabasegt

This is the simplest form where the basic program lsquoblastallrsquo takes a number of different options or parameters indicated by the ndashx and followed by its value -p ltwhich blast flavour to rungt-i ltfile with query sequence ingt-d ltpre-indexed database namegt

Not All Parameters are hellipThere are many other parameters and if not listed explicitly they will use a default value most appropriate to the blast flavour requested Eg for ndashW ltword sizegt blastn uses ndashW 11 where blastx uses ndashW 3

There are also some options that appear on the web pages that are not really parameters but manage the job in a similar way One of the most useful of these is on the NCBI blast pages where you can use Entrez queries or pick from an organism list to modify your search

The Many Parameters of BLASTThere are almost literally hundreds of parameters but most are way too obscure even for die-hard techies like me Very few of them are regularly useful in any but their default value but just occasionally they are very necessary

Here are some of the ones that I have used

-e max expected value -m output format (graphical or tabularspreadsheet)-F filter query sequence for low complexity (default TRUE)-U use only upper case regions of query (default FALSE)-G gap opening cost-E gap extension cost-q nucleotide mismatch penalty (BLASTx uses matrices)-r nucleotide match reward-b number of matching sequences to report-g allow gaps (default TRUE)-W word size-z effective database size (removes effect of actual database size)-S query strands to search (default both directions)-l restrict database sequences to given list of lsquogilsquo numbers

BLAST Parameters Exercises1 BLASTn vs BLASTx

Open the file example-sequenceshtml copy the sequence gtblastn-vs-blastxThis is a Xenopus tropicalis cDNA sequence

Go to the NCBI BLAST Home PageNucleotide-nucleotide BLAST (blastn) section Paste your sequence into the box

Run BLASTn against the nr nucleotide database using all default optionsThen hit [format] to wait for the results in a new page(hint if you paste the sequence definition line lsquogtnamersquo into the box as well your results will be labelled accordingly which can be useful)

Now repeat but go to the TRANSLATED BLAST section and BLAST against the nr protein database using BLASTx

How might the different results help us view the presence of this gene in other vertebrates

Results for Exercise 1 BLASTn

BLASTx

BLAST Parameters Exercises2 Low complexity filtering

Open the file example-sequenceshtml copy the sequence gtlow-complexity-filtering-A This sequence contains a long AT tandem repeat

Go to the NCBI BLAST Home PageTRANSLATED BLAST sectionBLASTx Paste your sequence into the box

Carefully UNTICK the ldquoChoose filter [ ] Low complexityrdquo BOX in the second section And then run BLASTx against the nr database

What do you feel about these alignmentsRe-run but leave the low-complexity filter ON this timeDoes this change our view of the protein matches

Now continue with gtlow-complexity-filtering-B and ndashCC is an especially interesting case ndash what can we deduce about the cDNA sequence Annotators beware

Results for Exercise 2A (OFF) BLASTn ndash low complexity filtering OFF

Results for Exercise 2A (ON) BLASTn ndash low complexity filtering ON

Results for Exercise 2B

ON OFF

Results for Exercise 2C

There is a sequence error an extra G at position 117 in the sequence cDNA (117)AGAAAAGAAGAAACATGGCAATGGATCAGAA|||||||||||||||| ||||||||||||||AGAAAAGAAGAAACAT-GCAATGGATCAGAA

Genomic sequence

ON

OFF

BLAST Parameters Exercises3 Limit by Entrez query

Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching sequences in fruit fly enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT

Open the file example-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home Page TRANSLATED BLAST sectionBLASTx and paste the sequence

Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip

BLAST Parameters Exercises4 BLASTn vs tBLASTx and nucleotide mismatch penalties

Open the file example-sequenceshtml

Also open the NCBI BLAST Home PageSPECIAL ndash Align two sequences section

There are several Xenopus tropicalis cyclins in the examples fileCopy the sequence gtcyclin-A1-Xt to the Sequence 1 BLAST windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 BLAST window(i) Run the default comparison should be BLASTn Note the alignmentNow run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these

(ii) Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1 Then try reducing the gap open and gap extension penaltieshellipWhat do we learn from this

(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip

Results for Exercise 4 (i)

BLASTn tBLASTx

Results for Exercise 4 (ii)

Mismatch penalty = -2 (default) Mismatch penalty = -1

BLAST Parameters Exercises5 E-Value maximum for reporting

Open the file example-sequenceshtml

Copy the sequence gtsumo-binding-motif and go to the NCBI BLAST Home PageGo to the PROTEIN BLAST section BLASTp and paste the sequence

Run the search with the default values

Now re-run the search setting the maximum E-value in the box

Expect 100

What difference does this make

BLAST Parameters Exercises6 Word Size

Open the file example-sequenceshtml

Copy the sequence gtmorpholino and go to the NCBI BLAST Home PageGo to the NUCLEOTIDE BLAST section BLASTn and paste the sequence

Check OFF the low complexity filter and then run the search

Now re-run the search setting the following parameters

Low complexity OFFExpect 100Word Size 7Other advanced -q-1 (mismatch penalty -1 instead of default -3)

What difference does this make

END

  • Bioinformatics Workshop 1 Sequences and Similarity Searches
  • The Basic Questions
  • Part 1 Structural Genomics
  • Chromosomes and Genes
  • Gene to Protein
  • Sequence Signals
  • Genomic Signals
  • Derivative Sequences
  • Gene Models
  • Sequences and Genes (Accession Numbers and Names)
  • Gene Symbols Names Etc
  • A Gene-Centric View
  • Sequences and Accession Numbers
  • mRNA Splicing Signals
  • Gene Predictions
  • Supporting Evidence
  • TheoreticalPredicted Sequences
  • Sequences for a model organism
  • So Whatrsquos in the Databases Now
  • Part 2 Comparative Genomics
  • Speciation
  • Residual Similarity
  • Computers Can Detect Homology
  • Orthologs
  • Paralogs
  • lsquoOtherrsquo-logs
  • The Essential Paradigm
  • Function Conserved Longer than Detectable Similarity
  • Redundancy in the Genetic Code
  • Protein Similarity Persists Longer
  • Always Compare Protein Sequences
  • Exercise 1 nucleotide vs amino acid search
  • Answers Exercise 1
  • The Essential Task
  • Functional Orthologs
  • Finding Orthologs
  • Using Synteny is Better
  • Metazome
  • Metazome Exercise
  • Part 3 Finding Sequence Similarities
  • Gaps in Alignments
  • The Downside of Gaps
  • BLAST
  • Flavours of BLAST
  • How does it work
  • BLAST WORDS and INDEXING
  • Analyse the Query Sequence
  • Expand from Word Based Matches
  • BLAST ndashTypical Output
  • When is a match significant
  • E-values
  • E-values From First Principles
  • Calculating an E-value
  • E-values In Practice
  • E-value Exercise
  • E-value Exercise Answer
  • E-values Effect of Database Size
  • Slide 58
  • Why were the values different
  • E-values Effect of Query Length
  • Why not just use identity
  • So whatrsquos the real problem
  • Rules of Thumb
  • Protein BLAST
  • Amino Acid Substitutions
  • Exercises
  • Part 4 Tweaking BLAST
  • Not All Parameters are hellip
  • The Many Parameters of BLAST
  • Slide 70
  • BLAST Parameters Exercises
  • Results for Exercise 1
  • Slide 73
  • Results for Exercise 2A (OFF)
  • Results for Exercise 2A (ON)
  • Results for Exercise 2B
  • Results for Exercise 2C
  • Slide 78
  • Slide 79
  • Results for Exercise 4 (i)
  • Results for Exercise 4 (ii)
  • Slide 82
  • Slide 83
  • END
Page 40: Bioinformatics Workshop 1 Sequences and Similarity Searches

Part 3 Finding Sequence Similarities

We want computer programs which will compare sequences at all possible different alignments looking for a degree of similarity greater than we would expect to find by chance

But first we have to consider the implication of gapshellip

Insertions and deletions are other possible forms of mutations and they can really mess up our simple alignments

ATGCATGCTGCCAACGGATGTCCTG

ATGAAAGCCGCCTACGAAAGTCCTG||| | || ||| ||| | ||||||

ATGCATGCTGGCCAACGGATGTCCTG

ATGAAAGCCGCCTACGAAAGTCCTG||| | || ||| | | | |

Gaps in Alignments

Consider these two obviously similar sequences

TTCCCAACTCTCCTCTTTCACCATGAAGCTCAAGGACAGATTCCACTCGCCCCAAAATCAAGCTCACCCCGTCCAAGAA | || | || |||||||||||||||||||| ||||||||| ||| ||| | ||| | | |TTCCCACCTCTCCTCTTTGCACCATGAAGCTCAAGGACAAATTCCACTCCCCCAAAATCAAGCGCACCCCGTCCCAGAA

In fact we realise that the most probable alignment (regarding biological origin) is with a small gap in each sequence

TTCCCAACTCTCCTCTTT=CACCATGAAGCTCAAGGACAGATTCCACTCGCCCCAAAATCAAGCTCACCCCGTCCAAGAA |||||| ||||||||||| |||||||||||||||||||| ||||||||| |||||||||||||| |||||||||| ||||TTCCCACCTCTCCTCTTTGCACCATGAAGCTCAAGGACAAATTCCACTC=CCCCAAAATCAAGCGCACCCCGTCCCAGAA

So in general we allow ourselves to insert gaps until we find the optimal alignment

But where should this process stop

The Downside of GapsTake two random sequences with no lsquorealrsquo similarity

GACACTAGGTCGATGCGTGGTGGCGAGA

ACGCATCCGGATGTGCACCGTGGAACTG

And allow lsquocost freersquo gaps

GAC--ACT----AGGTCGATGC---GTGG---TGGCGAGA || | | | | | ||| |||| || ACGCA-TCCGGA--T-G-TGCACCGTGGAACTG

Clearly although the alignment has no mismatches it is obviously not biologically meaningful

To prevent this we assign a cost to adding gaps which is offset against the benefit of finding matches ndash and this is the essence of lsquofinding gapped alignmentsrsquo

We want to find the lsquoalignmentrsquo between the two (or more) sequences which shows the greatest degree of similarity while introducing the fewest gaps hellip

BLAST

gtqueryAGACGAACCTAGCACAAGCGCGTCTGGAAAGACCCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGTTCAGGAGTATTTGGACTGCAATATTGGCCCTCGTTCAAGGGCGCCTACCATCACCCGACGGTCATGCCGGTCCCCAGCAGCTGCTAATAACTTCCTTCGCTACTCAAGTTACCACGCTAGCAAAACCCACGGCATACCGTTTACCCTTTAAAATCAGCTTCAACCAGCAACGAA

There are many programs used to find similarities between sequencesThey range from relatively slow programs which find the exact best matching alignment through ones which take progressively inexact shortcuts to speed things up Of this latter class the best known and easily most widely used is BLAST developed by Stephen Altschul and others and continuously refined over the last 10-15 years

The essential idea is to compare your query sequence against a collection or lsquodatabasersquo of target sequences looking for the one(s) that match the query sequence the best

gttarget1AAAACAGGAATATTTACCGGGACCGGGTAATGATGCATCTCGAGGTACACAATATACCTG GAGAACCGAATTATGAGTTGGCCACCTTACTTAACGAAACCAGCAGAGAAAATCCAACAT GGCAACACCCCTCTGACTACACTAGAAGGAACTACTATGTAAGAAAACAGCCTGTCCCTT GCAGTTTGAATGACTGGGTGATGCGAAATGGGGGTCCTGCCATAGAGCGCTTCCATGGTT TACCTTGCACATTTCAGAGAAGTCCTATGCCAGGAGTCCTTCCTACAGGGCCTTCCTGAA ACTATATATGTGCTTATTCTTGTTTGATTTGGCTTTGCAGgttarget2CTCTTAATTTATTTCTCTTCCTGCAGCTCCCTCGCTTTTTCCTTTCCCTGTTACATTCAT CTGACTTGAAGAGTTGCAAATTTTCAGTGTTTCTGTTTTTGTTGCTGATATGTTGTAAAC TTTTTAATAAAATCTATTTCTATAG gttarget3GCAGTTTGAATGACTGGGTGATGCGAAATGGGGGTCCTGCCATAGAGCGCTTCCATGGTT TACCTTGCACATTTCAGAGAAGTCCTATGCCAGGAGTCCTTCCTACAGGGCCTTCCTGAA ACTATATATGTGCTTATTCTTGTTTGATTTGGCTTTGCAGCTAGGGTTTTCACCTTTTCT GGAAAAAAAAATACTGGCTTCC gttarget4CTGCTATTAATGGGCAAAACAACTCAAATAAAGTCCCTCTGCCACCCTCAGACACTGCCC CTGGCCCCCAGCTGCCCGCTGATCCTTGTAGCCAGAGCAGTAAAGTTTTGAAAGTGGAGC CCAAGGAGAATAAAGTTATTAAAGAAACTGGCTTTGAACAAGGTGAAAAGTCTTGTGCAG CACCTCTAGATCATACTGTGAAGGAAAATCTTGGACAAACTTCTAAAGAACAGGTGGTAG

query

database

COMPARE

LIST MATCHES

Flavours of BLAST

ACGATAGATCCCATCCATAAAT ATGACGATAGATCCCATCAT CGATAGGACCACCACA GATAGACCAGGATACATAGGATAATTA AGCTCGCTTGGCTCGATGGCT

MKJLSPWERSYTRGHYTWER MGHTVNBZY MKLPWRHGDBKJGMNDFD MBKLRPIUHDFRTASGSLKWWRTVBN

MKJLSPWERSYTRGHYTWER MGHTVNBZY MKLPWRHGDBKJGMNDFD MBKLRPIUHDFRTASGSLKWWRTVBN

ATGACGATAGATCCCATCAT CGATAGGACCACCACA GATAGACCAGGATACATAGGATAATTA AGCTCGCTTGGCTCGATGGCT

query sequence other operation database sequences

ATGACGATAGATCCCATCAT CGATAGGACCACCACA GATAGACCAGGATACATAGGATAATTA AGCTCGCTTGGCTCGATGGCT

BLASTn

BLASTp

BLASTx

tBLASTn

tBLASTx

ACGATAGATCCCATCCATAAAT

ACGATAGATCCCATCCATAAAT

MQWCGYRWTYQGYRW

MQWCGYRWTYQGYRW

FAST

FAST

SLOW

SLOWER

HORRIBLY

SLOW

6 fra

me

trans

latio

n

How does it work The main task of any sequence comparison program is to test all possible mutual alignments of two sequence and see how good the match is

CTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGT

CCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGTTC | | | | | ||||||||||||||||||||||||| CTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGTCTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGT || | | | | | | | | |||||||||||||||||||||||| | | | | | |

CCGAGCTTCTCATTGCTCTTCCTAACAGTG=TGATAGGCTAACCGTAATGGCGTTC||||||||||||||||||||||||| ||||||||||||||||||||||||

query

1st database sequence

This would actually be a very slow search process if implemented like thishellip

BLAST achieves its speed through two strategies

- it takes a WORD based approach- it pre-INDEXES database sequences

BLAST WORDS and INDEXING1 GACAAATCCAAACCCCTGAAGTTCTCCACCAGCAAAGCCA

2 TAAGCAAATTTAATTTTGTTTACATTTTC

3 GTTAAGACCTTCCCTGACATTTGCAGCAGTTTCAAATGTA

AAAAAAAA 00001 AAAAAAAC 00002 AAAAAAAG 00003 ACAAATCC 07967 ACAAATCC 07968 ACAAATCC 07979 GACAAATC 33568 GACAAATG 33569 TCCAAACC 64321 TCCAAACC 64322

sequence position word

1 1 33658

1 2 07967

1 3 16210

3 15 33568

3 16 07967

Database of sequences

Numbered list of all possible lsquowordsrsquo

Build a position index of all words in the database

Analyse the Query Sequence gtquery AGACAAATCCAAACCCCTGAAGTTCTCCACCAGCAAAGCCA

AAAAAAAA 00001 AAAAAAAC 00002 AAAAAAAG 00003 ACAAATCC 07967 ACAAATCC 07968 ACAAATCC 07979 GACAAATC 33568 GACAAATG 33569 TCCAAACC 64321 TCCAAACC 64322

QUERY SEQUENCE

Numbered list of all possible lsquowordsrsquo

position word

1 14236

2 33658

3 07967

Analyse QUERY SEQUENCE

sequence position word

1 1 33658

1 2 07967

1 3 16210

3 15 33568

3 16 07967

Index of database

Expand from Word Based Matches We lsquoinstantlyrsquo know which sequences in the database have at least a word length match with our query sequence and at what relative position

Next the potential alignments are expanded adding up a score for (total matches ndash mismatches ndash gap penalties) to make the best possible alignment But this is usually for a tiny proportion of the sequences in the database ndash so overall it is much quicker

The highest scoring alignments are reported

But we can potentially miss alignments with no word-size bits in common consider BLASTn with a default word-size of 11

TCGGAAGTGGAAGCTGAACCTGATTGTAGAGTTGGAGGCCAGTGTTCTGGCTGAGC||||||||| ||||| |||||||||| |||||||||| |||| ||||| ||||||| TCGGAAGTGTAAGCTCAACCTGATTGCAGAGTTGGAGTCCAGAGTTCTAGCTGAGC

Care is sometimes neededhellip

BLAST ndashTypical OutputINPUT

gtpartial cDNA sequence Xenopus tropicalisCGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGTTCCCACCTCTCCTCTTTCACCATGAAGCTCAAGGACAAATTCCACTCCCCCAAAATCAAGCGCACCCCGTCCAAGAAGGGGAAGCCGGCCGACCTCACCGTCAAAACAGAAGAGAAACCCGTCAACAAAACCTTAAGCCGCTTGGAGGAACAGGAGAAAGAAGTCGTTAATGCCTTGCGTTACTTTAAGACAATTGTTGACAAGATGGCGGTGGACAAGATGGTGCTGGTGATGCTGCCAGGGTCGGCGA

OUTPUTQuery= (311 letters) Database NCBI Protein Reference Sequences 954378 sequences 347895532 total letters

gtgi|41055060|ref|NP_9574201| similar to guanine nucleotide-releasing factor 2 (specific for crk proto-oncogene) [Danio rerio]

Length=691

Score = 133 bits (335)Expect = 6e-31 Identities = 7698 (77) Positives = 8298 (83) Gaps = 498 (4) Frame = +2

Query 26 MSGKIE-KADSQRSHLSSFTMKLKDKFHSPKIKRTPSKKGKPA--DLTVKTEEKPVNKTL 196 MSGKIE K +SQ+SHLSSFTMKL KFHSPKIKRTPSKKGK + VKT EKPVNK + Sbjct 1 MSGKIESKHESQKSHLSSFTMKLM-KFHSPKIKRTPSKKGKQLQPEPAVKTPEKPVNKKV 59

Query 197 SRLEEQEKEVVNALRYFKTIVDKMAVDKMVLVMLPGSA 310 SRLEEQEK+VV+ALRYFKTIVDKM VD VL MLPGSA Sbjct 60 SRLEEQEKDVVSALRYFKTIVDKMNVDTKVLQMLPGSA 97

When is a match significant

RFKISDCQHPCTYSHNQYMTNHMRECPYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV

NFSWKKTSEKETNCQFDYPNDYNEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFNMCWLEVNSS

RF---KISDCQHPCTYSH-NQYMTNHMREC----PYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV F K S+ + C + + N Y N +C P+ + +W +P + D I N M ++ NFSWKKTSEKETNCQFDYPNDY--NEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFN------MCWLEVNSS

Here is a lsquotypicalrsquo weak alignment from BLASTp

In fact the sequences were randomly generated so there is no biologically significant alignmenthellip

E-values

The number of matches like the discovered match that I would expect to find by chance

An E-value of 00 implies that I would expect no matches like this to arise by chance thereforehellip

An E-value of 1 implies I would expect 1 match like this to arise by chance so if I have a match with such an E-valuehellip

Also ldquoexpect valueldquo or ldquoexpectationrdquo

E-values From First Principles

Some database statistics (23rd July 2005)

Database NCBI RefSeq mRNA 272619 sequences 503566580 total letters (~50 x 108)

Database NCBI nr 3329110 sequences 14601814750 total letters (~14 x 1010)

Notation

12e-35 = 12 x 10-35

48 x 106 = 4800000

We will consider first searching a nucleotide sequence (lsquoACGTAGACGTrsquo) against a nucleotide database eg the RefSeq mRNA above

Then we will consider the more complex case of amino acid sequence (protein) searches Which is of course what we mostly do

Calculating an E-valueThe RefSeq mRNA database has ~ 50 x 108 letters There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA

Query = lsquoArsquo

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACrsquo

AC AC AC AC

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACGrsquo

ACG

Expected number of matches = (50 x 108) 4 = ~12 x 108

Expected number of matches = (50 x 108) (4x 4) = ~31 x 107

Expected number of matches = (50 x 108) (4 x 4 x 4) = ~81 x 106

Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer

Expected number of matches = (50 x 108) (4 x 4 x 4 x 4 hellip 60 times ) = (50 x 108) 1036 = 50 x 10-28

E-value = 50 x 10-28

E-values In PracticeSo if I take a 60 nt sequencegtsequenceACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA and actually BLAST it against the RefSeq mRNA database I get

BLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds

Length=6060 Score = 119 bits (60) Expect = 2e-26 Identities = 6060 (100) Gaps = 060 (0) Strand=PlusPlus

Query 1 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 60 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 2977 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 3036

What do I get if I BLAST it against the larger nr databaseBLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds

Length=6060 Score = 119 bits (60) Expect = 6e-25 Identities = 6060 (100) Gaps = 060 (0) Strand=PlusPlus

Query 1 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 60 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 2977 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 3036

theoretical value was 50e-28 -

E-value Exercise

Given a transcription factor binding site

ACC[TG]TA

How many would you expect to find by chance in a 10k promoter sequence

How would this differ if there was an optional additional base between the 4th and 5th positionsIeACC[TG]TAOR ACC[TG]TA

E-value Exercise AnswerACC[TG]TA

Expect lsquoArsquo every 4 ntExpect lsquoACCrsquo every 4x4x4 = 64 nt

Expect lsquoT or Grsquo every 2nd ntExpect lsquoACC[TG]rsquo every 64x2 nt = 128 nt

Expect lsquoTArsquo every 4x4 = 16 ntExpect lsquoACC[TG]TArsquo every 128x16 nt = 2048 nt (4x4x4x2x4x4)We would expect ~5 of these promoter sites every 10k by chance

If also ACC[TG]TAA allowed

The two motifs independently have the same E-valueTo allow either means we expect twice as many

E-values Effect of Database SizeThe nr mRNA database has ~ 14 x 1010 letters (was RefSeq and 50 x108)There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA

Query = lsquoArsquo

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACrsquo

AC AC AC AC

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACGrsquo

ACG

Expected number of matches = (14 x 1010) 4 = ~12 x 108

Expected number of matches = (14 x 1010) (4x 4) = ~31 x 107

Expected number of matches = (14 x 1010) (4 x 4 x 4) = ~81 x 106

Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer

Expected number of matches = (14 x 1010) (4 x 4 x 4 x 4 hellip 60 times ) = (14 x 1010) 1036 = 14 x 10-26

E-value = 14 x 10-26

(was E-value = 50 x 10-28)

E-values Effect of Database Size

The E-value is simply dependent on database size

RefSeq

nr

14 x 1010 letters

50 x108 letters

30 x bigger

BLAST the same sequenceagainst each E-value = 14e-26

E-value = 50e-28

The database was ~30 times bigger and so the E-value was ~30 times bigger

Why were the values differentOur calculated E-value for searching against the RefSeq mRNA database was 50 x 10-28But our actual BLAST search at NCBI gave a value of

20 x 10-26 - about 40x larger - why is this

Gapped alignments

If we were expecting N matches for a query sequence lsquoACGTACGTACGTrsquo imagine what would happen to N if we allowed gaps in our matches

ACGTACGTACGT

This would now give us additional possible alignments that would meet our lsquomatchrsquo criteria

ACGTACGTACGT ACGTACAGTACGT ACGTACCGTACGT etc|||||||||||| |||||| |||||| |||||| ||||||ACGTACGTACGT ACGTAC-GTACGT ACGTAC-GTACGT

We will expect many more matches in a given database if we allow our alignments to have gaps The E-value will be larger

E-values Effect of Query Length

Biologically itrsquos the same match Does it mean we are any less sure that this match didnrsquot occur by chance The E-value is simply dependent on match length

database

BLAST 500 nt sequence against a database

BLASTn Get a full length match with sequence XYZ at an E-value = 50e-160

gtsequence ACTAGTCTAGCTAGACATCGATCGATGATGCTACACAGATAGACGATAGATAGTAAGTCGATCGATCGCGCATCGATCGTCTAGATCGATCGCTCGCTGTGTAGATAGATCGGCGATAGA

database

BLAST half of the same sequence against the same database

BLASTn

gtsequence ACTAGTCTAGCTAGACATCGATCGATGATGCTACACAGATAGACGATAGATAGTAAGTCG

Get a match with sequence XYZ again but at an E-value = 50e-80

Why not just use identityAt some levels this a good question

But consider two very different searches both of which give a 75 identity match

Query1 was 60 nt longCGGAGCTCAGGGCTTAACGACTGATATCTCCGCGCATGTCGAGAAACGATACAGCCAGCG||||||||||| || | || | || || |||| | | | |||||| | ||||||||||CGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGWhich would have an E-value ~ 50 x 10-19

And Query2 only 16 nt longACGTACGTACGTACGT||| || | |||| ||ACGCACCTTCGTAGGTWhich would have an E-value ~ 30

And intuitively we feel we would expect to see that sort of number of matches in the database just by chancehellip

So whatrsquos the real problemBasically you are usually trying to answer the question

Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism

Are there any useful guidelines though at least for biological meaningfulness

Basically you are usually trying to answer the question

Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism

BLAST

The difficulty is because

ORTHOLOGY

BLAST Similarity + Probability

biological knowledge

nature of query sequence

phylogenetic relationship

match length PI size of databasehellip

Rules of ThumbHow good does an E-value have to be before we might even think we have an ortholog

largerworse smallerbetter

E-values 10-5 10-10 10-40 10-100 00

fantasy borderline encouraging

pretty good canrsquot get

better

But note that in some gene families with closely related members you can get an E-value of 00 for several different matches and then identity may be more sensitive Also bear in mind in cases like this that ideas of lsquofunctionalrsquo orthology may break down with more than one locus producing identical proteins which share the same functionhellip

Protein BLASTItrsquos (nearly) always better to make comparisons at the amino acid level between protein sequences than the DNA level because the amino acid sequence is more conserved than the underlying DNA sequenceDoes this cause us to treat expected values any differently

If we follow the argument as before then for an exact match of a 20 amino acid sequence in the RefSeq protein database each additional amino acid will reduce the E-value by 120th (there are 20 different amino acids) And as there are 347895532 letters in that databaseE-value = ~35 x 108 (20 x 20 x 20 hellip20 times) = ~35 x 10-18

But this is what we get if we run the blast at NCBI

Score = 431 bits (100) Expect = 8e-04 Identities = 2020 (100) Positives = 2020 (100) Gaps = 020 (0) Frame = +3

Query 3 SSSSFRAYRAALSEVEPPCI 62 SSSSFRAYRAALSEVEPPCISbjct 972 SSSSFRAYRAALSEVEPPCI 991

Really too big a discrepancy to easily explain with hand wavinghellip

Amino Acid Substitutions

A SC F LWYG I LMVL IMFVM ILVP V ILMW FY

N DHSQ REHKS ANTT SY HFW

H NQYK RQER QK

D NEE DQK

In fact we need to take into account both amino acid substitutability as well as as before allowing gapped alignments On average any residue can be substituted for by about 2 others so each position has about 17th chance of lsquomatchingrsquo rather than 120th

So now we getE-value = ~35 x 108 (7 x 7 x 7 hellip20 times) = ~44 x 10-9which is much closer to the actual BLAST value

These substitutabilities are dealt with by the BLOSUM and PAM matrices

ExercisesGo to the file random-DNA-sequenceshtml select one of the 20 randomly generated nucleotide sequences and do a BLASTx (translated DNA-gtprotein) at NCBI against the nr protein database

Did you find any lsquosignificantrsquo hits

Repeat with a second sequence

What conclusions might you draw from this exercise

Try the same sequence(s) against the nr nucleotide database

Is there any general difference

Part 4 Tweaking BLASTAlthough you normally see BLAST as a web page with boxes to place data in and tick boxes etc it is actually a command line program that can be run just by typing the appropriate command and options eg

promptgt blastall ndashp ltblast typegt ndashi ltinput sequencegt ndashd ltdatabasegt

This is the simplest form where the basic program lsquoblastallrsquo takes a number of different options or parameters indicated by the ndashx and followed by its value -p ltwhich blast flavour to rungt-i ltfile with query sequence ingt-d ltpre-indexed database namegt

Not All Parameters are hellipThere are many other parameters and if not listed explicitly they will use a default value most appropriate to the blast flavour requested Eg for ndashW ltword sizegt blastn uses ndashW 11 where blastx uses ndashW 3

There are also some options that appear on the web pages that are not really parameters but manage the job in a similar way One of the most useful of these is on the NCBI blast pages where you can use Entrez queries or pick from an organism list to modify your search

The Many Parameters of BLASTThere are almost literally hundreds of parameters but most are way too obscure even for die-hard techies like me Very few of them are regularly useful in any but their default value but just occasionally they are very necessary

Here are some of the ones that I have used

-e max expected value -m output format (graphical or tabularspreadsheet)-F filter query sequence for low complexity (default TRUE)-U use only upper case regions of query (default FALSE)-G gap opening cost-E gap extension cost-q nucleotide mismatch penalty (BLASTx uses matrices)-r nucleotide match reward-b number of matching sequences to report-g allow gaps (default TRUE)-W word size-z effective database size (removes effect of actual database size)-S query strands to search (default both directions)-l restrict database sequences to given list of lsquogilsquo numbers

BLAST Parameters Exercises1 BLASTn vs BLASTx

Open the file example-sequenceshtml copy the sequence gtblastn-vs-blastxThis is a Xenopus tropicalis cDNA sequence

Go to the NCBI BLAST Home PageNucleotide-nucleotide BLAST (blastn) section Paste your sequence into the box

Run BLASTn against the nr nucleotide database using all default optionsThen hit [format] to wait for the results in a new page(hint if you paste the sequence definition line lsquogtnamersquo into the box as well your results will be labelled accordingly which can be useful)

Now repeat but go to the TRANSLATED BLAST section and BLAST against the nr protein database using BLASTx

How might the different results help us view the presence of this gene in other vertebrates

Results for Exercise 1 BLASTn

BLASTx

BLAST Parameters Exercises2 Low complexity filtering

Open the file example-sequenceshtml copy the sequence gtlow-complexity-filtering-A This sequence contains a long AT tandem repeat

Go to the NCBI BLAST Home PageTRANSLATED BLAST sectionBLASTx Paste your sequence into the box

Carefully UNTICK the ldquoChoose filter [ ] Low complexityrdquo BOX in the second section And then run BLASTx against the nr database

What do you feel about these alignmentsRe-run but leave the low-complexity filter ON this timeDoes this change our view of the protein matches

Now continue with gtlow-complexity-filtering-B and ndashCC is an especially interesting case ndash what can we deduce about the cDNA sequence Annotators beware

Results for Exercise 2A (OFF) BLASTn ndash low complexity filtering OFF

Results for Exercise 2A (ON) BLASTn ndash low complexity filtering ON

Results for Exercise 2B

ON OFF

Results for Exercise 2C

There is a sequence error an extra G at position 117 in the sequence cDNA (117)AGAAAAGAAGAAACATGGCAATGGATCAGAA|||||||||||||||| ||||||||||||||AGAAAAGAAGAAACAT-GCAATGGATCAGAA

Genomic sequence

ON

OFF

BLAST Parameters Exercises3 Limit by Entrez query

Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching sequences in fruit fly enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT

Open the file example-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home Page TRANSLATED BLAST sectionBLASTx and paste the sequence

Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip

BLAST Parameters Exercises4 BLASTn vs tBLASTx and nucleotide mismatch penalties

Open the file example-sequenceshtml

Also open the NCBI BLAST Home PageSPECIAL ndash Align two sequences section

There are several Xenopus tropicalis cyclins in the examples fileCopy the sequence gtcyclin-A1-Xt to the Sequence 1 BLAST windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 BLAST window(i) Run the default comparison should be BLASTn Note the alignmentNow run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these

(ii) Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1 Then try reducing the gap open and gap extension penaltieshellipWhat do we learn from this

(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip

Results for Exercise 4 (i)

BLASTn tBLASTx

Results for Exercise 4 (ii)

Mismatch penalty = -2 (default) Mismatch penalty = -1

BLAST Parameters Exercises5 E-Value maximum for reporting

Open the file example-sequenceshtml

Copy the sequence gtsumo-binding-motif and go to the NCBI BLAST Home PageGo to the PROTEIN BLAST section BLASTp and paste the sequence

Run the search with the default values

Now re-run the search setting the maximum E-value in the box

Expect 100

What difference does this make

BLAST Parameters Exercises6 Word Size

Open the file example-sequenceshtml

Copy the sequence gtmorpholino and go to the NCBI BLAST Home PageGo to the NUCLEOTIDE BLAST section BLASTn and paste the sequence

Check OFF the low complexity filter and then run the search

Now re-run the search setting the following parameters

Low complexity OFFExpect 100Word Size 7Other advanced -q-1 (mismatch penalty -1 instead of default -3)

What difference does this make

END

  • Bioinformatics Workshop 1 Sequences and Similarity Searches
  • The Basic Questions
  • Part 1 Structural Genomics
  • Chromosomes and Genes
  • Gene to Protein
  • Sequence Signals
  • Genomic Signals
  • Derivative Sequences
  • Gene Models
  • Sequences and Genes (Accession Numbers and Names)
  • Gene Symbols Names Etc
  • A Gene-Centric View
  • Sequences and Accession Numbers
  • mRNA Splicing Signals
  • Gene Predictions
  • Supporting Evidence
  • TheoreticalPredicted Sequences
  • Sequences for a model organism
  • So Whatrsquos in the Databases Now
  • Part 2 Comparative Genomics
  • Speciation
  • Residual Similarity
  • Computers Can Detect Homology
  • Orthologs
  • Paralogs
  • lsquoOtherrsquo-logs
  • The Essential Paradigm
  • Function Conserved Longer than Detectable Similarity
  • Redundancy in the Genetic Code
  • Protein Similarity Persists Longer
  • Always Compare Protein Sequences
  • Exercise 1 nucleotide vs amino acid search
  • Answers Exercise 1
  • The Essential Task
  • Functional Orthologs
  • Finding Orthologs
  • Using Synteny is Better
  • Metazome
  • Metazome Exercise
  • Part 3 Finding Sequence Similarities
  • Gaps in Alignments
  • The Downside of Gaps
  • BLAST
  • Flavours of BLAST
  • How does it work
  • BLAST WORDS and INDEXING
  • Analyse the Query Sequence
  • Expand from Word Based Matches
  • BLAST ndashTypical Output
  • When is a match significant
  • E-values
  • E-values From First Principles
  • Calculating an E-value
  • E-values In Practice
  • E-value Exercise
  • E-value Exercise Answer
  • E-values Effect of Database Size
  • Slide 58
  • Why were the values different
  • E-values Effect of Query Length
  • Why not just use identity
  • So whatrsquos the real problem
  • Rules of Thumb
  • Protein BLAST
  • Amino Acid Substitutions
  • Exercises
  • Part 4 Tweaking BLAST
  • Not All Parameters are hellip
  • The Many Parameters of BLAST
  • Slide 70
  • BLAST Parameters Exercises
  • Results for Exercise 1
  • Slide 73
  • Results for Exercise 2A (OFF)
  • Results for Exercise 2A (ON)
  • Results for Exercise 2B
  • Results for Exercise 2C
  • Slide 78
  • Slide 79
  • Results for Exercise 4 (i)
  • Results for Exercise 4 (ii)
  • Slide 82
  • Slide 83
  • END
Page 41: Bioinformatics Workshop 1 Sequences and Similarity Searches

Gaps in Alignments

Consider these two obviously similar sequences

TTCCCAACTCTCCTCTTTCACCATGAAGCTCAAGGACAGATTCCACTCGCCCCAAAATCAAGCTCACCCCGTCCAAGAA | || | || |||||||||||||||||||| ||||||||| ||| ||| | ||| | | |TTCCCACCTCTCCTCTTTGCACCATGAAGCTCAAGGACAAATTCCACTCCCCCAAAATCAAGCGCACCCCGTCCCAGAA

In fact we realise that the most probable alignment (regarding biological origin) is with a small gap in each sequence

TTCCCAACTCTCCTCTTT=CACCATGAAGCTCAAGGACAGATTCCACTCGCCCCAAAATCAAGCTCACCCCGTCCAAGAA |||||| ||||||||||| |||||||||||||||||||| ||||||||| |||||||||||||| |||||||||| ||||TTCCCACCTCTCCTCTTTGCACCATGAAGCTCAAGGACAAATTCCACTC=CCCCAAAATCAAGCGCACCCCGTCCCAGAA

So in general we allow ourselves to insert gaps until we find the optimal alignment

But where should this process stop

The Downside of GapsTake two random sequences with no lsquorealrsquo similarity

GACACTAGGTCGATGCGTGGTGGCGAGA

ACGCATCCGGATGTGCACCGTGGAACTG

And allow lsquocost freersquo gaps

GAC--ACT----AGGTCGATGC---GTGG---TGGCGAGA || | | | | | ||| |||| || ACGCA-TCCGGA--T-G-TGCACCGTGGAACTG

Clearly although the alignment has no mismatches it is obviously not biologically meaningful

To prevent this we assign a cost to adding gaps which is offset against the benefit of finding matches ndash and this is the essence of lsquofinding gapped alignmentsrsquo

We want to find the lsquoalignmentrsquo between the two (or more) sequences which shows the greatest degree of similarity while introducing the fewest gaps hellip

BLAST

gtqueryAGACGAACCTAGCACAAGCGCGTCTGGAAAGACCCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGTTCAGGAGTATTTGGACTGCAATATTGGCCCTCGTTCAAGGGCGCCTACCATCACCCGACGGTCATGCCGGTCCCCAGCAGCTGCTAATAACTTCCTTCGCTACTCAAGTTACCACGCTAGCAAAACCCACGGCATACCGTTTACCCTTTAAAATCAGCTTCAACCAGCAACGAA

There are many programs used to find similarities between sequencesThey range from relatively slow programs which find the exact best matching alignment through ones which take progressively inexact shortcuts to speed things up Of this latter class the best known and easily most widely used is BLAST developed by Stephen Altschul and others and continuously refined over the last 10-15 years

The essential idea is to compare your query sequence against a collection or lsquodatabasersquo of target sequences looking for the one(s) that match the query sequence the best

gttarget1AAAACAGGAATATTTACCGGGACCGGGTAATGATGCATCTCGAGGTACACAATATACCTG GAGAACCGAATTATGAGTTGGCCACCTTACTTAACGAAACCAGCAGAGAAAATCCAACAT GGCAACACCCCTCTGACTACACTAGAAGGAACTACTATGTAAGAAAACAGCCTGTCCCTT GCAGTTTGAATGACTGGGTGATGCGAAATGGGGGTCCTGCCATAGAGCGCTTCCATGGTT TACCTTGCACATTTCAGAGAAGTCCTATGCCAGGAGTCCTTCCTACAGGGCCTTCCTGAA ACTATATATGTGCTTATTCTTGTTTGATTTGGCTTTGCAGgttarget2CTCTTAATTTATTTCTCTTCCTGCAGCTCCCTCGCTTTTTCCTTTCCCTGTTACATTCAT CTGACTTGAAGAGTTGCAAATTTTCAGTGTTTCTGTTTTTGTTGCTGATATGTTGTAAAC TTTTTAATAAAATCTATTTCTATAG gttarget3GCAGTTTGAATGACTGGGTGATGCGAAATGGGGGTCCTGCCATAGAGCGCTTCCATGGTT TACCTTGCACATTTCAGAGAAGTCCTATGCCAGGAGTCCTTCCTACAGGGCCTTCCTGAA ACTATATATGTGCTTATTCTTGTTTGATTTGGCTTTGCAGCTAGGGTTTTCACCTTTTCT GGAAAAAAAAATACTGGCTTCC gttarget4CTGCTATTAATGGGCAAAACAACTCAAATAAAGTCCCTCTGCCACCCTCAGACACTGCCC CTGGCCCCCAGCTGCCCGCTGATCCTTGTAGCCAGAGCAGTAAAGTTTTGAAAGTGGAGC CCAAGGAGAATAAAGTTATTAAAGAAACTGGCTTTGAACAAGGTGAAAAGTCTTGTGCAG CACCTCTAGATCATACTGTGAAGGAAAATCTTGGACAAACTTCTAAAGAACAGGTGGTAG

query

database

COMPARE

LIST MATCHES

Flavours of BLAST

ACGATAGATCCCATCCATAAAT ATGACGATAGATCCCATCAT CGATAGGACCACCACA GATAGACCAGGATACATAGGATAATTA AGCTCGCTTGGCTCGATGGCT

MKJLSPWERSYTRGHYTWER MGHTVNBZY MKLPWRHGDBKJGMNDFD MBKLRPIUHDFRTASGSLKWWRTVBN

MKJLSPWERSYTRGHYTWER MGHTVNBZY MKLPWRHGDBKJGMNDFD MBKLRPIUHDFRTASGSLKWWRTVBN

ATGACGATAGATCCCATCAT CGATAGGACCACCACA GATAGACCAGGATACATAGGATAATTA AGCTCGCTTGGCTCGATGGCT

query sequence other operation database sequences

ATGACGATAGATCCCATCAT CGATAGGACCACCACA GATAGACCAGGATACATAGGATAATTA AGCTCGCTTGGCTCGATGGCT

BLASTn

BLASTp

BLASTx

tBLASTn

tBLASTx

ACGATAGATCCCATCCATAAAT

ACGATAGATCCCATCCATAAAT

MQWCGYRWTYQGYRW

MQWCGYRWTYQGYRW

FAST

FAST

SLOW

SLOWER

HORRIBLY

SLOW

6 fra

me

trans

latio

n

How does it work The main task of any sequence comparison program is to test all possible mutual alignments of two sequence and see how good the match is

CTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGT

CCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGTTC | | | | | ||||||||||||||||||||||||| CTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGTCTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGT || | | | | | | | | |||||||||||||||||||||||| | | | | | |

CCGAGCTTCTCATTGCTCTTCCTAACAGTG=TGATAGGCTAACCGTAATGGCGTTC||||||||||||||||||||||||| ||||||||||||||||||||||||

query

1st database sequence

This would actually be a very slow search process if implemented like thishellip

BLAST achieves its speed through two strategies

- it takes a WORD based approach- it pre-INDEXES database sequences

BLAST WORDS and INDEXING1 GACAAATCCAAACCCCTGAAGTTCTCCACCAGCAAAGCCA

2 TAAGCAAATTTAATTTTGTTTACATTTTC

3 GTTAAGACCTTCCCTGACATTTGCAGCAGTTTCAAATGTA

AAAAAAAA 00001 AAAAAAAC 00002 AAAAAAAG 00003 ACAAATCC 07967 ACAAATCC 07968 ACAAATCC 07979 GACAAATC 33568 GACAAATG 33569 TCCAAACC 64321 TCCAAACC 64322

sequence position word

1 1 33658

1 2 07967

1 3 16210

3 15 33568

3 16 07967

Database of sequences

Numbered list of all possible lsquowordsrsquo

Build a position index of all words in the database

Analyse the Query Sequence gtquery AGACAAATCCAAACCCCTGAAGTTCTCCACCAGCAAAGCCA

AAAAAAAA 00001 AAAAAAAC 00002 AAAAAAAG 00003 ACAAATCC 07967 ACAAATCC 07968 ACAAATCC 07979 GACAAATC 33568 GACAAATG 33569 TCCAAACC 64321 TCCAAACC 64322

QUERY SEQUENCE

Numbered list of all possible lsquowordsrsquo

position word

1 14236

2 33658

3 07967

Analyse QUERY SEQUENCE

sequence position word

1 1 33658

1 2 07967

1 3 16210

3 15 33568

3 16 07967

Index of database

Expand from Word Based Matches We lsquoinstantlyrsquo know which sequences in the database have at least a word length match with our query sequence and at what relative position

Next the potential alignments are expanded adding up a score for (total matches ndash mismatches ndash gap penalties) to make the best possible alignment But this is usually for a tiny proportion of the sequences in the database ndash so overall it is much quicker

The highest scoring alignments are reported

But we can potentially miss alignments with no word-size bits in common consider BLASTn with a default word-size of 11

TCGGAAGTGGAAGCTGAACCTGATTGTAGAGTTGGAGGCCAGTGTTCTGGCTGAGC||||||||| ||||| |||||||||| |||||||||| |||| ||||| ||||||| TCGGAAGTGTAAGCTCAACCTGATTGCAGAGTTGGAGTCCAGAGTTCTAGCTGAGC

Care is sometimes neededhellip

BLAST ndashTypical OutputINPUT

gtpartial cDNA sequence Xenopus tropicalisCGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGTTCCCACCTCTCCTCTTTCACCATGAAGCTCAAGGACAAATTCCACTCCCCCAAAATCAAGCGCACCCCGTCCAAGAAGGGGAAGCCGGCCGACCTCACCGTCAAAACAGAAGAGAAACCCGTCAACAAAACCTTAAGCCGCTTGGAGGAACAGGAGAAAGAAGTCGTTAATGCCTTGCGTTACTTTAAGACAATTGTTGACAAGATGGCGGTGGACAAGATGGTGCTGGTGATGCTGCCAGGGTCGGCGA

OUTPUTQuery= (311 letters) Database NCBI Protein Reference Sequences 954378 sequences 347895532 total letters

gtgi|41055060|ref|NP_9574201| similar to guanine nucleotide-releasing factor 2 (specific for crk proto-oncogene) [Danio rerio]

Length=691

Score = 133 bits (335)Expect = 6e-31 Identities = 7698 (77) Positives = 8298 (83) Gaps = 498 (4) Frame = +2

Query 26 MSGKIE-KADSQRSHLSSFTMKLKDKFHSPKIKRTPSKKGKPA--DLTVKTEEKPVNKTL 196 MSGKIE K +SQ+SHLSSFTMKL KFHSPKIKRTPSKKGK + VKT EKPVNK + Sbjct 1 MSGKIESKHESQKSHLSSFTMKLM-KFHSPKIKRTPSKKGKQLQPEPAVKTPEKPVNKKV 59

Query 197 SRLEEQEKEVVNALRYFKTIVDKMAVDKMVLVMLPGSA 310 SRLEEQEK+VV+ALRYFKTIVDKM VD VL MLPGSA Sbjct 60 SRLEEQEKDVVSALRYFKTIVDKMNVDTKVLQMLPGSA 97

When is a match significant

RFKISDCQHPCTYSHNQYMTNHMRECPYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV

NFSWKKTSEKETNCQFDYPNDYNEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFNMCWLEVNSS

RF---KISDCQHPCTYSH-NQYMTNHMREC----PYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV F K S+ + C + + N Y N +C P+ + +W +P + D I N M ++ NFSWKKTSEKETNCQFDYPNDY--NEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFN------MCWLEVNSS

Here is a lsquotypicalrsquo weak alignment from BLASTp

In fact the sequences were randomly generated so there is no biologically significant alignmenthellip

E-values

The number of matches like the discovered match that I would expect to find by chance

An E-value of 00 implies that I would expect no matches like this to arise by chance thereforehellip

An E-value of 1 implies I would expect 1 match like this to arise by chance so if I have a match with such an E-valuehellip

Also ldquoexpect valueldquo or ldquoexpectationrdquo

E-values From First Principles

Some database statistics (23rd July 2005)

Database NCBI RefSeq mRNA 272619 sequences 503566580 total letters (~50 x 108)

Database NCBI nr 3329110 sequences 14601814750 total letters (~14 x 1010)

Notation

12e-35 = 12 x 10-35

48 x 106 = 4800000

We will consider first searching a nucleotide sequence (lsquoACGTAGACGTrsquo) against a nucleotide database eg the RefSeq mRNA above

Then we will consider the more complex case of amino acid sequence (protein) searches Which is of course what we mostly do

Calculating an E-valueThe RefSeq mRNA database has ~ 50 x 108 letters There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA

Query = lsquoArsquo

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACrsquo

AC AC AC AC

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACGrsquo

ACG

Expected number of matches = (50 x 108) 4 = ~12 x 108

Expected number of matches = (50 x 108) (4x 4) = ~31 x 107

Expected number of matches = (50 x 108) (4 x 4 x 4) = ~81 x 106

Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer

Expected number of matches = (50 x 108) (4 x 4 x 4 x 4 hellip 60 times ) = (50 x 108) 1036 = 50 x 10-28

E-value = 50 x 10-28

E-values In PracticeSo if I take a 60 nt sequencegtsequenceACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA and actually BLAST it against the RefSeq mRNA database I get

BLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds

Length=6060 Score = 119 bits (60) Expect = 2e-26 Identities = 6060 (100) Gaps = 060 (0) Strand=PlusPlus

Query 1 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 60 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 2977 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 3036

What do I get if I BLAST it against the larger nr databaseBLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds

Length=6060 Score = 119 bits (60) Expect = 6e-25 Identities = 6060 (100) Gaps = 060 (0) Strand=PlusPlus

Query 1 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 60 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 2977 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 3036

theoretical value was 50e-28 -

E-value Exercise

Given a transcription factor binding site

ACC[TG]TA

How many would you expect to find by chance in a 10k promoter sequence

How would this differ if there was an optional additional base between the 4th and 5th positionsIeACC[TG]TAOR ACC[TG]TA

E-value Exercise AnswerACC[TG]TA

Expect lsquoArsquo every 4 ntExpect lsquoACCrsquo every 4x4x4 = 64 nt

Expect lsquoT or Grsquo every 2nd ntExpect lsquoACC[TG]rsquo every 64x2 nt = 128 nt

Expect lsquoTArsquo every 4x4 = 16 ntExpect lsquoACC[TG]TArsquo every 128x16 nt = 2048 nt (4x4x4x2x4x4)We would expect ~5 of these promoter sites every 10k by chance

If also ACC[TG]TAA allowed

The two motifs independently have the same E-valueTo allow either means we expect twice as many

E-values Effect of Database SizeThe nr mRNA database has ~ 14 x 1010 letters (was RefSeq and 50 x108)There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA

Query = lsquoArsquo

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACrsquo

AC AC AC AC

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACGrsquo

ACG

Expected number of matches = (14 x 1010) 4 = ~12 x 108

Expected number of matches = (14 x 1010) (4x 4) = ~31 x 107

Expected number of matches = (14 x 1010) (4 x 4 x 4) = ~81 x 106

Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer

Expected number of matches = (14 x 1010) (4 x 4 x 4 x 4 hellip 60 times ) = (14 x 1010) 1036 = 14 x 10-26

E-value = 14 x 10-26

(was E-value = 50 x 10-28)

E-values Effect of Database Size

The E-value is simply dependent on database size

RefSeq

nr

14 x 1010 letters

50 x108 letters

30 x bigger

BLAST the same sequenceagainst each E-value = 14e-26

E-value = 50e-28

The database was ~30 times bigger and so the E-value was ~30 times bigger

Why were the values differentOur calculated E-value for searching against the RefSeq mRNA database was 50 x 10-28But our actual BLAST search at NCBI gave a value of

20 x 10-26 - about 40x larger - why is this

Gapped alignments

If we were expecting N matches for a query sequence lsquoACGTACGTACGTrsquo imagine what would happen to N if we allowed gaps in our matches

ACGTACGTACGT

This would now give us additional possible alignments that would meet our lsquomatchrsquo criteria

ACGTACGTACGT ACGTACAGTACGT ACGTACCGTACGT etc|||||||||||| |||||| |||||| |||||| ||||||ACGTACGTACGT ACGTAC-GTACGT ACGTAC-GTACGT

We will expect many more matches in a given database if we allow our alignments to have gaps The E-value will be larger

E-values Effect of Query Length

Biologically itrsquos the same match Does it mean we are any less sure that this match didnrsquot occur by chance The E-value is simply dependent on match length

database

BLAST 500 nt sequence against a database

BLASTn Get a full length match with sequence XYZ at an E-value = 50e-160

gtsequence ACTAGTCTAGCTAGACATCGATCGATGATGCTACACAGATAGACGATAGATAGTAAGTCGATCGATCGCGCATCGATCGTCTAGATCGATCGCTCGCTGTGTAGATAGATCGGCGATAGA

database

BLAST half of the same sequence against the same database

BLASTn

gtsequence ACTAGTCTAGCTAGACATCGATCGATGATGCTACACAGATAGACGATAGATAGTAAGTCG

Get a match with sequence XYZ again but at an E-value = 50e-80

Why not just use identityAt some levels this a good question

But consider two very different searches both of which give a 75 identity match

Query1 was 60 nt longCGGAGCTCAGGGCTTAACGACTGATATCTCCGCGCATGTCGAGAAACGATACAGCCAGCG||||||||||| || | || | || || |||| | | | |||||| | ||||||||||CGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGWhich would have an E-value ~ 50 x 10-19

And Query2 only 16 nt longACGTACGTACGTACGT||| || | |||| ||ACGCACCTTCGTAGGTWhich would have an E-value ~ 30

And intuitively we feel we would expect to see that sort of number of matches in the database just by chancehellip

So whatrsquos the real problemBasically you are usually trying to answer the question

Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism

Are there any useful guidelines though at least for biological meaningfulness

Basically you are usually trying to answer the question

Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism

BLAST

The difficulty is because

ORTHOLOGY

BLAST Similarity + Probability

biological knowledge

nature of query sequence

phylogenetic relationship

match length PI size of databasehellip

Rules of ThumbHow good does an E-value have to be before we might even think we have an ortholog

largerworse smallerbetter

E-values 10-5 10-10 10-40 10-100 00

fantasy borderline encouraging

pretty good canrsquot get

better

But note that in some gene families with closely related members you can get an E-value of 00 for several different matches and then identity may be more sensitive Also bear in mind in cases like this that ideas of lsquofunctionalrsquo orthology may break down with more than one locus producing identical proteins which share the same functionhellip

Protein BLASTItrsquos (nearly) always better to make comparisons at the amino acid level between protein sequences than the DNA level because the amino acid sequence is more conserved than the underlying DNA sequenceDoes this cause us to treat expected values any differently

If we follow the argument as before then for an exact match of a 20 amino acid sequence in the RefSeq protein database each additional amino acid will reduce the E-value by 120th (there are 20 different amino acids) And as there are 347895532 letters in that databaseE-value = ~35 x 108 (20 x 20 x 20 hellip20 times) = ~35 x 10-18

But this is what we get if we run the blast at NCBI

Score = 431 bits (100) Expect = 8e-04 Identities = 2020 (100) Positives = 2020 (100) Gaps = 020 (0) Frame = +3

Query 3 SSSSFRAYRAALSEVEPPCI 62 SSSSFRAYRAALSEVEPPCISbjct 972 SSSSFRAYRAALSEVEPPCI 991

Really too big a discrepancy to easily explain with hand wavinghellip

Amino Acid Substitutions

A SC F LWYG I LMVL IMFVM ILVP V ILMW FY

N DHSQ REHKS ANTT SY HFW

H NQYK RQER QK

D NEE DQK

In fact we need to take into account both amino acid substitutability as well as as before allowing gapped alignments On average any residue can be substituted for by about 2 others so each position has about 17th chance of lsquomatchingrsquo rather than 120th

So now we getE-value = ~35 x 108 (7 x 7 x 7 hellip20 times) = ~44 x 10-9which is much closer to the actual BLAST value

These substitutabilities are dealt with by the BLOSUM and PAM matrices

ExercisesGo to the file random-DNA-sequenceshtml select one of the 20 randomly generated nucleotide sequences and do a BLASTx (translated DNA-gtprotein) at NCBI against the nr protein database

Did you find any lsquosignificantrsquo hits

Repeat with a second sequence

What conclusions might you draw from this exercise

Try the same sequence(s) against the nr nucleotide database

Is there any general difference

Part 4 Tweaking BLASTAlthough you normally see BLAST as a web page with boxes to place data in and tick boxes etc it is actually a command line program that can be run just by typing the appropriate command and options eg

promptgt blastall ndashp ltblast typegt ndashi ltinput sequencegt ndashd ltdatabasegt

This is the simplest form where the basic program lsquoblastallrsquo takes a number of different options or parameters indicated by the ndashx and followed by its value -p ltwhich blast flavour to rungt-i ltfile with query sequence ingt-d ltpre-indexed database namegt

Not All Parameters are hellipThere are many other parameters and if not listed explicitly they will use a default value most appropriate to the blast flavour requested Eg for ndashW ltword sizegt blastn uses ndashW 11 where blastx uses ndashW 3

There are also some options that appear on the web pages that are not really parameters but manage the job in a similar way One of the most useful of these is on the NCBI blast pages where you can use Entrez queries or pick from an organism list to modify your search

The Many Parameters of BLASTThere are almost literally hundreds of parameters but most are way too obscure even for die-hard techies like me Very few of them are regularly useful in any but their default value but just occasionally they are very necessary

Here are some of the ones that I have used

-e max expected value -m output format (graphical or tabularspreadsheet)-F filter query sequence for low complexity (default TRUE)-U use only upper case regions of query (default FALSE)-G gap opening cost-E gap extension cost-q nucleotide mismatch penalty (BLASTx uses matrices)-r nucleotide match reward-b number of matching sequences to report-g allow gaps (default TRUE)-W word size-z effective database size (removes effect of actual database size)-S query strands to search (default both directions)-l restrict database sequences to given list of lsquogilsquo numbers

BLAST Parameters Exercises1 BLASTn vs BLASTx

Open the file example-sequenceshtml copy the sequence gtblastn-vs-blastxThis is a Xenopus tropicalis cDNA sequence

Go to the NCBI BLAST Home PageNucleotide-nucleotide BLAST (blastn) section Paste your sequence into the box

Run BLASTn against the nr nucleotide database using all default optionsThen hit [format] to wait for the results in a new page(hint if you paste the sequence definition line lsquogtnamersquo into the box as well your results will be labelled accordingly which can be useful)

Now repeat but go to the TRANSLATED BLAST section and BLAST against the nr protein database using BLASTx

How might the different results help us view the presence of this gene in other vertebrates

Results for Exercise 1 BLASTn

BLASTx

BLAST Parameters Exercises2 Low complexity filtering

Open the file example-sequenceshtml copy the sequence gtlow-complexity-filtering-A This sequence contains a long AT tandem repeat

Go to the NCBI BLAST Home PageTRANSLATED BLAST sectionBLASTx Paste your sequence into the box

Carefully UNTICK the ldquoChoose filter [ ] Low complexityrdquo BOX in the second section And then run BLASTx against the nr database

What do you feel about these alignmentsRe-run but leave the low-complexity filter ON this timeDoes this change our view of the protein matches

Now continue with gtlow-complexity-filtering-B and ndashCC is an especially interesting case ndash what can we deduce about the cDNA sequence Annotators beware

Results for Exercise 2A (OFF) BLASTn ndash low complexity filtering OFF

Results for Exercise 2A (ON) BLASTn ndash low complexity filtering ON

Results for Exercise 2B

ON OFF

Results for Exercise 2C

There is a sequence error an extra G at position 117 in the sequence cDNA (117)AGAAAAGAAGAAACATGGCAATGGATCAGAA|||||||||||||||| ||||||||||||||AGAAAAGAAGAAACAT-GCAATGGATCAGAA

Genomic sequence

ON

OFF

BLAST Parameters Exercises3 Limit by Entrez query

Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching sequences in fruit fly enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT

Open the file example-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home Page TRANSLATED BLAST sectionBLASTx and paste the sequence

Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip

BLAST Parameters Exercises4 BLASTn vs tBLASTx and nucleotide mismatch penalties

Open the file example-sequenceshtml

Also open the NCBI BLAST Home PageSPECIAL ndash Align two sequences section

There are several Xenopus tropicalis cyclins in the examples fileCopy the sequence gtcyclin-A1-Xt to the Sequence 1 BLAST windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 BLAST window(i) Run the default comparison should be BLASTn Note the alignmentNow run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these

(ii) Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1 Then try reducing the gap open and gap extension penaltieshellipWhat do we learn from this

(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip

Results for Exercise 4 (i)

BLASTn tBLASTx

Results for Exercise 4 (ii)

Mismatch penalty = -2 (default) Mismatch penalty = -1

BLAST Parameters Exercises5 E-Value maximum for reporting

Open the file example-sequenceshtml

Copy the sequence gtsumo-binding-motif and go to the NCBI BLAST Home PageGo to the PROTEIN BLAST section BLASTp and paste the sequence

Run the search with the default values

Now re-run the search setting the maximum E-value in the box

Expect 100

What difference does this make

BLAST Parameters Exercises6 Word Size

Open the file example-sequenceshtml

Copy the sequence gtmorpholino and go to the NCBI BLAST Home PageGo to the NUCLEOTIDE BLAST section BLASTn and paste the sequence

Check OFF the low complexity filter and then run the search

Now re-run the search setting the following parameters

Low complexity OFFExpect 100Word Size 7Other advanced -q-1 (mismatch penalty -1 instead of default -3)

What difference does this make

END

  • Bioinformatics Workshop 1 Sequences and Similarity Searches
  • The Basic Questions
  • Part 1 Structural Genomics
  • Chromosomes and Genes
  • Gene to Protein
  • Sequence Signals
  • Genomic Signals
  • Derivative Sequences
  • Gene Models
  • Sequences and Genes (Accession Numbers and Names)
  • Gene Symbols Names Etc
  • A Gene-Centric View
  • Sequences and Accession Numbers
  • mRNA Splicing Signals
  • Gene Predictions
  • Supporting Evidence
  • TheoreticalPredicted Sequences
  • Sequences for a model organism
  • So Whatrsquos in the Databases Now
  • Part 2 Comparative Genomics
  • Speciation
  • Residual Similarity
  • Computers Can Detect Homology
  • Orthologs
  • Paralogs
  • lsquoOtherrsquo-logs
  • The Essential Paradigm
  • Function Conserved Longer than Detectable Similarity
  • Redundancy in the Genetic Code
  • Protein Similarity Persists Longer
  • Always Compare Protein Sequences
  • Exercise 1 nucleotide vs amino acid search
  • Answers Exercise 1
  • The Essential Task
  • Functional Orthologs
  • Finding Orthologs
  • Using Synteny is Better
  • Metazome
  • Metazome Exercise
  • Part 3 Finding Sequence Similarities
  • Gaps in Alignments
  • The Downside of Gaps
  • BLAST
  • Flavours of BLAST
  • How does it work
  • BLAST WORDS and INDEXING
  • Analyse the Query Sequence
  • Expand from Word Based Matches
  • BLAST ndashTypical Output
  • When is a match significant
  • E-values
  • E-values From First Principles
  • Calculating an E-value
  • E-values In Practice
  • E-value Exercise
  • E-value Exercise Answer
  • E-values Effect of Database Size
  • Slide 58
  • Why were the values different
  • E-values Effect of Query Length
  • Why not just use identity
  • So whatrsquos the real problem
  • Rules of Thumb
  • Protein BLAST
  • Amino Acid Substitutions
  • Exercises
  • Part 4 Tweaking BLAST
  • Not All Parameters are hellip
  • The Many Parameters of BLAST
  • Slide 70
  • BLAST Parameters Exercises
  • Results for Exercise 1
  • Slide 73
  • Results for Exercise 2A (OFF)
  • Results for Exercise 2A (ON)
  • Results for Exercise 2B
  • Results for Exercise 2C
  • Slide 78
  • Slide 79
  • Results for Exercise 4 (i)
  • Results for Exercise 4 (ii)
  • Slide 82
  • Slide 83
  • END
Page 42: Bioinformatics Workshop 1 Sequences and Similarity Searches

The Downside of GapsTake two random sequences with no lsquorealrsquo similarity

GACACTAGGTCGATGCGTGGTGGCGAGA

ACGCATCCGGATGTGCACCGTGGAACTG

And allow lsquocost freersquo gaps

GAC--ACT----AGGTCGATGC---GTGG---TGGCGAGA || | | | | | ||| |||| || ACGCA-TCCGGA--T-G-TGCACCGTGGAACTG

Clearly although the alignment has no mismatches it is obviously not biologically meaningful

To prevent this we assign a cost to adding gaps which is offset against the benefit of finding matches ndash and this is the essence of lsquofinding gapped alignmentsrsquo

We want to find the lsquoalignmentrsquo between the two (or more) sequences which shows the greatest degree of similarity while introducing the fewest gaps hellip

BLAST

gtqueryAGACGAACCTAGCACAAGCGCGTCTGGAAAGACCCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGTTCAGGAGTATTTGGACTGCAATATTGGCCCTCGTTCAAGGGCGCCTACCATCACCCGACGGTCATGCCGGTCCCCAGCAGCTGCTAATAACTTCCTTCGCTACTCAAGTTACCACGCTAGCAAAACCCACGGCATACCGTTTACCCTTTAAAATCAGCTTCAACCAGCAACGAA

There are many programs used to find similarities between sequencesThey range from relatively slow programs which find the exact best matching alignment through ones which take progressively inexact shortcuts to speed things up Of this latter class the best known and easily most widely used is BLAST developed by Stephen Altschul and others and continuously refined over the last 10-15 years

The essential idea is to compare your query sequence against a collection or lsquodatabasersquo of target sequences looking for the one(s) that match the query sequence the best

gttarget1AAAACAGGAATATTTACCGGGACCGGGTAATGATGCATCTCGAGGTACACAATATACCTG GAGAACCGAATTATGAGTTGGCCACCTTACTTAACGAAACCAGCAGAGAAAATCCAACAT GGCAACACCCCTCTGACTACACTAGAAGGAACTACTATGTAAGAAAACAGCCTGTCCCTT GCAGTTTGAATGACTGGGTGATGCGAAATGGGGGTCCTGCCATAGAGCGCTTCCATGGTT TACCTTGCACATTTCAGAGAAGTCCTATGCCAGGAGTCCTTCCTACAGGGCCTTCCTGAA ACTATATATGTGCTTATTCTTGTTTGATTTGGCTTTGCAGgttarget2CTCTTAATTTATTTCTCTTCCTGCAGCTCCCTCGCTTTTTCCTTTCCCTGTTACATTCAT CTGACTTGAAGAGTTGCAAATTTTCAGTGTTTCTGTTTTTGTTGCTGATATGTTGTAAAC TTTTTAATAAAATCTATTTCTATAG gttarget3GCAGTTTGAATGACTGGGTGATGCGAAATGGGGGTCCTGCCATAGAGCGCTTCCATGGTT TACCTTGCACATTTCAGAGAAGTCCTATGCCAGGAGTCCTTCCTACAGGGCCTTCCTGAA ACTATATATGTGCTTATTCTTGTTTGATTTGGCTTTGCAGCTAGGGTTTTCACCTTTTCT GGAAAAAAAAATACTGGCTTCC gttarget4CTGCTATTAATGGGCAAAACAACTCAAATAAAGTCCCTCTGCCACCCTCAGACACTGCCC CTGGCCCCCAGCTGCCCGCTGATCCTTGTAGCCAGAGCAGTAAAGTTTTGAAAGTGGAGC CCAAGGAGAATAAAGTTATTAAAGAAACTGGCTTTGAACAAGGTGAAAAGTCTTGTGCAG CACCTCTAGATCATACTGTGAAGGAAAATCTTGGACAAACTTCTAAAGAACAGGTGGTAG

query

database

COMPARE

LIST MATCHES

Flavours of BLAST

ACGATAGATCCCATCCATAAAT ATGACGATAGATCCCATCAT CGATAGGACCACCACA GATAGACCAGGATACATAGGATAATTA AGCTCGCTTGGCTCGATGGCT

MKJLSPWERSYTRGHYTWER MGHTVNBZY MKLPWRHGDBKJGMNDFD MBKLRPIUHDFRTASGSLKWWRTVBN

MKJLSPWERSYTRGHYTWER MGHTVNBZY MKLPWRHGDBKJGMNDFD MBKLRPIUHDFRTASGSLKWWRTVBN

ATGACGATAGATCCCATCAT CGATAGGACCACCACA GATAGACCAGGATACATAGGATAATTA AGCTCGCTTGGCTCGATGGCT

query sequence other operation database sequences

ATGACGATAGATCCCATCAT CGATAGGACCACCACA GATAGACCAGGATACATAGGATAATTA AGCTCGCTTGGCTCGATGGCT

BLASTn

BLASTp

BLASTx

tBLASTn

tBLASTx

ACGATAGATCCCATCCATAAAT

ACGATAGATCCCATCCATAAAT

MQWCGYRWTYQGYRW

MQWCGYRWTYQGYRW

FAST

FAST

SLOW

SLOWER

HORRIBLY

SLOW

6 fra

me

trans

latio

n

How does it work The main task of any sequence comparison program is to test all possible mutual alignments of two sequence and see how good the match is

CTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGT

CCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGTTC | | | | | ||||||||||||||||||||||||| CTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGTCTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGT || | | | | | | | | |||||||||||||||||||||||| | | | | | |

CCGAGCTTCTCATTGCTCTTCCTAACAGTG=TGATAGGCTAACCGTAATGGCGTTC||||||||||||||||||||||||| ||||||||||||||||||||||||

query

1st database sequence

This would actually be a very slow search process if implemented like thishellip

BLAST achieves its speed through two strategies

- it takes a WORD based approach- it pre-INDEXES database sequences

BLAST WORDS and INDEXING1 GACAAATCCAAACCCCTGAAGTTCTCCACCAGCAAAGCCA

2 TAAGCAAATTTAATTTTGTTTACATTTTC

3 GTTAAGACCTTCCCTGACATTTGCAGCAGTTTCAAATGTA

AAAAAAAA 00001 AAAAAAAC 00002 AAAAAAAG 00003 ACAAATCC 07967 ACAAATCC 07968 ACAAATCC 07979 GACAAATC 33568 GACAAATG 33569 TCCAAACC 64321 TCCAAACC 64322

sequence position word

1 1 33658

1 2 07967

1 3 16210

3 15 33568

3 16 07967

Database of sequences

Numbered list of all possible lsquowordsrsquo

Build a position index of all words in the database

Analyse the Query Sequence gtquery AGACAAATCCAAACCCCTGAAGTTCTCCACCAGCAAAGCCA

AAAAAAAA 00001 AAAAAAAC 00002 AAAAAAAG 00003 ACAAATCC 07967 ACAAATCC 07968 ACAAATCC 07979 GACAAATC 33568 GACAAATG 33569 TCCAAACC 64321 TCCAAACC 64322

QUERY SEQUENCE

Numbered list of all possible lsquowordsrsquo

position word

1 14236

2 33658

3 07967

Analyse QUERY SEQUENCE

sequence position word

1 1 33658

1 2 07967

1 3 16210

3 15 33568

3 16 07967

Index of database

Expand from Word Based Matches We lsquoinstantlyrsquo know which sequences in the database have at least a word length match with our query sequence and at what relative position

Next the potential alignments are expanded adding up a score for (total matches ndash mismatches ndash gap penalties) to make the best possible alignment But this is usually for a tiny proportion of the sequences in the database ndash so overall it is much quicker

The highest scoring alignments are reported

But we can potentially miss alignments with no word-size bits in common consider BLASTn with a default word-size of 11

TCGGAAGTGGAAGCTGAACCTGATTGTAGAGTTGGAGGCCAGTGTTCTGGCTGAGC||||||||| ||||| |||||||||| |||||||||| |||| ||||| ||||||| TCGGAAGTGTAAGCTCAACCTGATTGCAGAGTTGGAGTCCAGAGTTCTAGCTGAGC

Care is sometimes neededhellip

BLAST ndashTypical OutputINPUT

gtpartial cDNA sequence Xenopus tropicalisCGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGTTCCCACCTCTCCTCTTTCACCATGAAGCTCAAGGACAAATTCCACTCCCCCAAAATCAAGCGCACCCCGTCCAAGAAGGGGAAGCCGGCCGACCTCACCGTCAAAACAGAAGAGAAACCCGTCAACAAAACCTTAAGCCGCTTGGAGGAACAGGAGAAAGAAGTCGTTAATGCCTTGCGTTACTTTAAGACAATTGTTGACAAGATGGCGGTGGACAAGATGGTGCTGGTGATGCTGCCAGGGTCGGCGA

OUTPUTQuery= (311 letters) Database NCBI Protein Reference Sequences 954378 sequences 347895532 total letters

gtgi|41055060|ref|NP_9574201| similar to guanine nucleotide-releasing factor 2 (specific for crk proto-oncogene) [Danio rerio]

Length=691

Score = 133 bits (335)Expect = 6e-31 Identities = 7698 (77) Positives = 8298 (83) Gaps = 498 (4) Frame = +2

Query 26 MSGKIE-KADSQRSHLSSFTMKLKDKFHSPKIKRTPSKKGKPA--DLTVKTEEKPVNKTL 196 MSGKIE K +SQ+SHLSSFTMKL KFHSPKIKRTPSKKGK + VKT EKPVNK + Sbjct 1 MSGKIESKHESQKSHLSSFTMKLM-KFHSPKIKRTPSKKGKQLQPEPAVKTPEKPVNKKV 59

Query 197 SRLEEQEKEVVNALRYFKTIVDKMAVDKMVLVMLPGSA 310 SRLEEQEK+VV+ALRYFKTIVDKM VD VL MLPGSA Sbjct 60 SRLEEQEKDVVSALRYFKTIVDKMNVDTKVLQMLPGSA 97

When is a match significant

RFKISDCQHPCTYSHNQYMTNHMRECPYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV

NFSWKKTSEKETNCQFDYPNDYNEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFNMCWLEVNSS

RF---KISDCQHPCTYSH-NQYMTNHMREC----PYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV F K S+ + C + + N Y N +C P+ + +W +P + D I N M ++ NFSWKKTSEKETNCQFDYPNDY--NEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFN------MCWLEVNSS

Here is a lsquotypicalrsquo weak alignment from BLASTp

In fact the sequences were randomly generated so there is no biologically significant alignmenthellip

E-values

The number of matches like the discovered match that I would expect to find by chance

An E-value of 00 implies that I would expect no matches like this to arise by chance thereforehellip

An E-value of 1 implies I would expect 1 match like this to arise by chance so if I have a match with such an E-valuehellip

Also ldquoexpect valueldquo or ldquoexpectationrdquo

E-values From First Principles

Some database statistics (23rd July 2005)

Database NCBI RefSeq mRNA 272619 sequences 503566580 total letters (~50 x 108)

Database NCBI nr 3329110 sequences 14601814750 total letters (~14 x 1010)

Notation

12e-35 = 12 x 10-35

48 x 106 = 4800000

We will consider first searching a nucleotide sequence (lsquoACGTAGACGTrsquo) against a nucleotide database eg the RefSeq mRNA above

Then we will consider the more complex case of amino acid sequence (protein) searches Which is of course what we mostly do

Calculating an E-valueThe RefSeq mRNA database has ~ 50 x 108 letters There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA

Query = lsquoArsquo

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACrsquo

AC AC AC AC

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACGrsquo

ACG

Expected number of matches = (50 x 108) 4 = ~12 x 108

Expected number of matches = (50 x 108) (4x 4) = ~31 x 107

Expected number of matches = (50 x 108) (4 x 4 x 4) = ~81 x 106

Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer

Expected number of matches = (50 x 108) (4 x 4 x 4 x 4 hellip 60 times ) = (50 x 108) 1036 = 50 x 10-28

E-value = 50 x 10-28

E-values In PracticeSo if I take a 60 nt sequencegtsequenceACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA and actually BLAST it against the RefSeq mRNA database I get

BLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds

Length=6060 Score = 119 bits (60) Expect = 2e-26 Identities = 6060 (100) Gaps = 060 (0) Strand=PlusPlus

Query 1 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 60 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 2977 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 3036

What do I get if I BLAST it against the larger nr databaseBLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds

Length=6060 Score = 119 bits (60) Expect = 6e-25 Identities = 6060 (100) Gaps = 060 (0) Strand=PlusPlus

Query 1 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 60 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 2977 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 3036

theoretical value was 50e-28 -

E-value Exercise

Given a transcription factor binding site

ACC[TG]TA

How many would you expect to find by chance in a 10k promoter sequence

How would this differ if there was an optional additional base between the 4th and 5th positionsIeACC[TG]TAOR ACC[TG]TA

E-value Exercise AnswerACC[TG]TA

Expect lsquoArsquo every 4 ntExpect lsquoACCrsquo every 4x4x4 = 64 nt

Expect lsquoT or Grsquo every 2nd ntExpect lsquoACC[TG]rsquo every 64x2 nt = 128 nt

Expect lsquoTArsquo every 4x4 = 16 ntExpect lsquoACC[TG]TArsquo every 128x16 nt = 2048 nt (4x4x4x2x4x4)We would expect ~5 of these promoter sites every 10k by chance

If also ACC[TG]TAA allowed

The two motifs independently have the same E-valueTo allow either means we expect twice as many

E-values Effect of Database SizeThe nr mRNA database has ~ 14 x 1010 letters (was RefSeq and 50 x108)There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA

Query = lsquoArsquo

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACrsquo

AC AC AC AC

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACGrsquo

ACG

Expected number of matches = (14 x 1010) 4 = ~12 x 108

Expected number of matches = (14 x 1010) (4x 4) = ~31 x 107

Expected number of matches = (14 x 1010) (4 x 4 x 4) = ~81 x 106

Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer

Expected number of matches = (14 x 1010) (4 x 4 x 4 x 4 hellip 60 times ) = (14 x 1010) 1036 = 14 x 10-26

E-value = 14 x 10-26

(was E-value = 50 x 10-28)

E-values Effect of Database Size

The E-value is simply dependent on database size

RefSeq

nr

14 x 1010 letters

50 x108 letters

30 x bigger

BLAST the same sequenceagainst each E-value = 14e-26

E-value = 50e-28

The database was ~30 times bigger and so the E-value was ~30 times bigger

Why were the values differentOur calculated E-value for searching against the RefSeq mRNA database was 50 x 10-28But our actual BLAST search at NCBI gave a value of

20 x 10-26 - about 40x larger - why is this

Gapped alignments

If we were expecting N matches for a query sequence lsquoACGTACGTACGTrsquo imagine what would happen to N if we allowed gaps in our matches

ACGTACGTACGT

This would now give us additional possible alignments that would meet our lsquomatchrsquo criteria

ACGTACGTACGT ACGTACAGTACGT ACGTACCGTACGT etc|||||||||||| |||||| |||||| |||||| ||||||ACGTACGTACGT ACGTAC-GTACGT ACGTAC-GTACGT

We will expect many more matches in a given database if we allow our alignments to have gaps The E-value will be larger

E-values Effect of Query Length

Biologically itrsquos the same match Does it mean we are any less sure that this match didnrsquot occur by chance The E-value is simply dependent on match length

database

BLAST 500 nt sequence against a database

BLASTn Get a full length match with sequence XYZ at an E-value = 50e-160

gtsequence ACTAGTCTAGCTAGACATCGATCGATGATGCTACACAGATAGACGATAGATAGTAAGTCGATCGATCGCGCATCGATCGTCTAGATCGATCGCTCGCTGTGTAGATAGATCGGCGATAGA

database

BLAST half of the same sequence against the same database

BLASTn

gtsequence ACTAGTCTAGCTAGACATCGATCGATGATGCTACACAGATAGACGATAGATAGTAAGTCG

Get a match with sequence XYZ again but at an E-value = 50e-80

Why not just use identityAt some levels this a good question

But consider two very different searches both of which give a 75 identity match

Query1 was 60 nt longCGGAGCTCAGGGCTTAACGACTGATATCTCCGCGCATGTCGAGAAACGATACAGCCAGCG||||||||||| || | || | || || |||| | | | |||||| | ||||||||||CGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGWhich would have an E-value ~ 50 x 10-19

And Query2 only 16 nt longACGTACGTACGTACGT||| || | |||| ||ACGCACCTTCGTAGGTWhich would have an E-value ~ 30

And intuitively we feel we would expect to see that sort of number of matches in the database just by chancehellip

So whatrsquos the real problemBasically you are usually trying to answer the question

Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism

Are there any useful guidelines though at least for biological meaningfulness

Basically you are usually trying to answer the question

Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism

BLAST

The difficulty is because

ORTHOLOGY

BLAST Similarity + Probability

biological knowledge

nature of query sequence

phylogenetic relationship

match length PI size of databasehellip

Rules of ThumbHow good does an E-value have to be before we might even think we have an ortholog

largerworse smallerbetter

E-values 10-5 10-10 10-40 10-100 00

fantasy borderline encouraging

pretty good canrsquot get

better

But note that in some gene families with closely related members you can get an E-value of 00 for several different matches and then identity may be more sensitive Also bear in mind in cases like this that ideas of lsquofunctionalrsquo orthology may break down with more than one locus producing identical proteins which share the same functionhellip

Protein BLASTItrsquos (nearly) always better to make comparisons at the amino acid level between protein sequences than the DNA level because the amino acid sequence is more conserved than the underlying DNA sequenceDoes this cause us to treat expected values any differently

If we follow the argument as before then for an exact match of a 20 amino acid sequence in the RefSeq protein database each additional amino acid will reduce the E-value by 120th (there are 20 different amino acids) And as there are 347895532 letters in that databaseE-value = ~35 x 108 (20 x 20 x 20 hellip20 times) = ~35 x 10-18

But this is what we get if we run the blast at NCBI

Score = 431 bits (100) Expect = 8e-04 Identities = 2020 (100) Positives = 2020 (100) Gaps = 020 (0) Frame = +3

Query 3 SSSSFRAYRAALSEVEPPCI 62 SSSSFRAYRAALSEVEPPCISbjct 972 SSSSFRAYRAALSEVEPPCI 991

Really too big a discrepancy to easily explain with hand wavinghellip

Amino Acid Substitutions

A SC F LWYG I LMVL IMFVM ILVP V ILMW FY

N DHSQ REHKS ANTT SY HFW

H NQYK RQER QK

D NEE DQK

In fact we need to take into account both amino acid substitutability as well as as before allowing gapped alignments On average any residue can be substituted for by about 2 others so each position has about 17th chance of lsquomatchingrsquo rather than 120th

So now we getE-value = ~35 x 108 (7 x 7 x 7 hellip20 times) = ~44 x 10-9which is much closer to the actual BLAST value

These substitutabilities are dealt with by the BLOSUM and PAM matrices

ExercisesGo to the file random-DNA-sequenceshtml select one of the 20 randomly generated nucleotide sequences and do a BLASTx (translated DNA-gtprotein) at NCBI against the nr protein database

Did you find any lsquosignificantrsquo hits

Repeat with a second sequence

What conclusions might you draw from this exercise

Try the same sequence(s) against the nr nucleotide database

Is there any general difference

Part 4 Tweaking BLASTAlthough you normally see BLAST as a web page with boxes to place data in and tick boxes etc it is actually a command line program that can be run just by typing the appropriate command and options eg

promptgt blastall ndashp ltblast typegt ndashi ltinput sequencegt ndashd ltdatabasegt

This is the simplest form where the basic program lsquoblastallrsquo takes a number of different options or parameters indicated by the ndashx and followed by its value -p ltwhich blast flavour to rungt-i ltfile with query sequence ingt-d ltpre-indexed database namegt

Not All Parameters are hellipThere are many other parameters and if not listed explicitly they will use a default value most appropriate to the blast flavour requested Eg for ndashW ltword sizegt blastn uses ndashW 11 where blastx uses ndashW 3

There are also some options that appear on the web pages that are not really parameters but manage the job in a similar way One of the most useful of these is on the NCBI blast pages where you can use Entrez queries or pick from an organism list to modify your search

The Many Parameters of BLASTThere are almost literally hundreds of parameters but most are way too obscure even for die-hard techies like me Very few of them are regularly useful in any but their default value but just occasionally they are very necessary

Here are some of the ones that I have used

-e max expected value -m output format (graphical or tabularspreadsheet)-F filter query sequence for low complexity (default TRUE)-U use only upper case regions of query (default FALSE)-G gap opening cost-E gap extension cost-q nucleotide mismatch penalty (BLASTx uses matrices)-r nucleotide match reward-b number of matching sequences to report-g allow gaps (default TRUE)-W word size-z effective database size (removes effect of actual database size)-S query strands to search (default both directions)-l restrict database sequences to given list of lsquogilsquo numbers

BLAST Parameters Exercises1 BLASTn vs BLASTx

Open the file example-sequenceshtml copy the sequence gtblastn-vs-blastxThis is a Xenopus tropicalis cDNA sequence

Go to the NCBI BLAST Home PageNucleotide-nucleotide BLAST (blastn) section Paste your sequence into the box

Run BLASTn against the nr nucleotide database using all default optionsThen hit [format] to wait for the results in a new page(hint if you paste the sequence definition line lsquogtnamersquo into the box as well your results will be labelled accordingly which can be useful)

Now repeat but go to the TRANSLATED BLAST section and BLAST against the nr protein database using BLASTx

How might the different results help us view the presence of this gene in other vertebrates

Results for Exercise 1 BLASTn

BLASTx

BLAST Parameters Exercises2 Low complexity filtering

Open the file example-sequenceshtml copy the sequence gtlow-complexity-filtering-A This sequence contains a long AT tandem repeat

Go to the NCBI BLAST Home PageTRANSLATED BLAST sectionBLASTx Paste your sequence into the box

Carefully UNTICK the ldquoChoose filter [ ] Low complexityrdquo BOX in the second section And then run BLASTx against the nr database

What do you feel about these alignmentsRe-run but leave the low-complexity filter ON this timeDoes this change our view of the protein matches

Now continue with gtlow-complexity-filtering-B and ndashCC is an especially interesting case ndash what can we deduce about the cDNA sequence Annotators beware

Results for Exercise 2A (OFF) BLASTn ndash low complexity filtering OFF

Results for Exercise 2A (ON) BLASTn ndash low complexity filtering ON

Results for Exercise 2B

ON OFF

Results for Exercise 2C

There is a sequence error an extra G at position 117 in the sequence cDNA (117)AGAAAAGAAGAAACATGGCAATGGATCAGAA|||||||||||||||| ||||||||||||||AGAAAAGAAGAAACAT-GCAATGGATCAGAA

Genomic sequence

ON

OFF

BLAST Parameters Exercises3 Limit by Entrez query

Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching sequences in fruit fly enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT

Open the file example-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home Page TRANSLATED BLAST sectionBLASTx and paste the sequence

Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip

BLAST Parameters Exercises4 BLASTn vs tBLASTx and nucleotide mismatch penalties

Open the file example-sequenceshtml

Also open the NCBI BLAST Home PageSPECIAL ndash Align two sequences section

There are several Xenopus tropicalis cyclins in the examples fileCopy the sequence gtcyclin-A1-Xt to the Sequence 1 BLAST windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 BLAST window(i) Run the default comparison should be BLASTn Note the alignmentNow run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these

(ii) Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1 Then try reducing the gap open and gap extension penaltieshellipWhat do we learn from this

(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip

Results for Exercise 4 (i)

BLASTn tBLASTx

Results for Exercise 4 (ii)

Mismatch penalty = -2 (default) Mismatch penalty = -1

BLAST Parameters Exercises5 E-Value maximum for reporting

Open the file example-sequenceshtml

Copy the sequence gtsumo-binding-motif and go to the NCBI BLAST Home PageGo to the PROTEIN BLAST section BLASTp and paste the sequence

Run the search with the default values

Now re-run the search setting the maximum E-value in the box

Expect 100

What difference does this make

BLAST Parameters Exercises6 Word Size

Open the file example-sequenceshtml

Copy the sequence gtmorpholino and go to the NCBI BLAST Home PageGo to the NUCLEOTIDE BLAST section BLASTn and paste the sequence

Check OFF the low complexity filter and then run the search

Now re-run the search setting the following parameters

Low complexity OFFExpect 100Word Size 7Other advanced -q-1 (mismatch penalty -1 instead of default -3)

What difference does this make

END

  • Bioinformatics Workshop 1 Sequences and Similarity Searches
  • The Basic Questions
  • Part 1 Structural Genomics
  • Chromosomes and Genes
  • Gene to Protein
  • Sequence Signals
  • Genomic Signals
  • Derivative Sequences
  • Gene Models
  • Sequences and Genes (Accession Numbers and Names)
  • Gene Symbols Names Etc
  • A Gene-Centric View
  • Sequences and Accession Numbers
  • mRNA Splicing Signals
  • Gene Predictions
  • Supporting Evidence
  • TheoreticalPredicted Sequences
  • Sequences for a model organism
  • So Whatrsquos in the Databases Now
  • Part 2 Comparative Genomics
  • Speciation
  • Residual Similarity
  • Computers Can Detect Homology
  • Orthologs
  • Paralogs
  • lsquoOtherrsquo-logs
  • The Essential Paradigm
  • Function Conserved Longer than Detectable Similarity
  • Redundancy in the Genetic Code
  • Protein Similarity Persists Longer
  • Always Compare Protein Sequences
  • Exercise 1 nucleotide vs amino acid search
  • Answers Exercise 1
  • The Essential Task
  • Functional Orthologs
  • Finding Orthologs
  • Using Synteny is Better
  • Metazome
  • Metazome Exercise
  • Part 3 Finding Sequence Similarities
  • Gaps in Alignments
  • The Downside of Gaps
  • BLAST
  • Flavours of BLAST
  • How does it work
  • BLAST WORDS and INDEXING
  • Analyse the Query Sequence
  • Expand from Word Based Matches
  • BLAST ndashTypical Output
  • When is a match significant
  • E-values
  • E-values From First Principles
  • Calculating an E-value
  • E-values In Practice
  • E-value Exercise
  • E-value Exercise Answer
  • E-values Effect of Database Size
  • Slide 58
  • Why were the values different
  • E-values Effect of Query Length
  • Why not just use identity
  • So whatrsquos the real problem
  • Rules of Thumb
  • Protein BLAST
  • Amino Acid Substitutions
  • Exercises
  • Part 4 Tweaking BLAST
  • Not All Parameters are hellip
  • The Many Parameters of BLAST
  • Slide 70
  • BLAST Parameters Exercises
  • Results for Exercise 1
  • Slide 73
  • Results for Exercise 2A (OFF)
  • Results for Exercise 2A (ON)
  • Results for Exercise 2B
  • Results for Exercise 2C
  • Slide 78
  • Slide 79
  • Results for Exercise 4 (i)
  • Results for Exercise 4 (ii)
  • Slide 82
  • Slide 83
  • END
Page 43: Bioinformatics Workshop 1 Sequences and Similarity Searches

BLAST

gtqueryAGACGAACCTAGCACAAGCGCGTCTGGAAAGACCCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGTTCAGGAGTATTTGGACTGCAATATTGGCCCTCGTTCAAGGGCGCCTACCATCACCCGACGGTCATGCCGGTCCCCAGCAGCTGCTAATAACTTCCTTCGCTACTCAAGTTACCACGCTAGCAAAACCCACGGCATACCGTTTACCCTTTAAAATCAGCTTCAACCAGCAACGAA

There are many programs used to find similarities between sequencesThey range from relatively slow programs which find the exact best matching alignment through ones which take progressively inexact shortcuts to speed things up Of this latter class the best known and easily most widely used is BLAST developed by Stephen Altschul and others and continuously refined over the last 10-15 years

The essential idea is to compare your query sequence against a collection or lsquodatabasersquo of target sequences looking for the one(s) that match the query sequence the best

gttarget1AAAACAGGAATATTTACCGGGACCGGGTAATGATGCATCTCGAGGTACACAATATACCTG GAGAACCGAATTATGAGTTGGCCACCTTACTTAACGAAACCAGCAGAGAAAATCCAACAT GGCAACACCCCTCTGACTACACTAGAAGGAACTACTATGTAAGAAAACAGCCTGTCCCTT GCAGTTTGAATGACTGGGTGATGCGAAATGGGGGTCCTGCCATAGAGCGCTTCCATGGTT TACCTTGCACATTTCAGAGAAGTCCTATGCCAGGAGTCCTTCCTACAGGGCCTTCCTGAA ACTATATATGTGCTTATTCTTGTTTGATTTGGCTTTGCAGgttarget2CTCTTAATTTATTTCTCTTCCTGCAGCTCCCTCGCTTTTTCCTTTCCCTGTTACATTCAT CTGACTTGAAGAGTTGCAAATTTTCAGTGTTTCTGTTTTTGTTGCTGATATGTTGTAAAC TTTTTAATAAAATCTATTTCTATAG gttarget3GCAGTTTGAATGACTGGGTGATGCGAAATGGGGGTCCTGCCATAGAGCGCTTCCATGGTT TACCTTGCACATTTCAGAGAAGTCCTATGCCAGGAGTCCTTCCTACAGGGCCTTCCTGAA ACTATATATGTGCTTATTCTTGTTTGATTTGGCTTTGCAGCTAGGGTTTTCACCTTTTCT GGAAAAAAAAATACTGGCTTCC gttarget4CTGCTATTAATGGGCAAAACAACTCAAATAAAGTCCCTCTGCCACCCTCAGACACTGCCC CTGGCCCCCAGCTGCCCGCTGATCCTTGTAGCCAGAGCAGTAAAGTTTTGAAAGTGGAGC CCAAGGAGAATAAAGTTATTAAAGAAACTGGCTTTGAACAAGGTGAAAAGTCTTGTGCAG CACCTCTAGATCATACTGTGAAGGAAAATCTTGGACAAACTTCTAAAGAACAGGTGGTAG

query

database

COMPARE

LIST MATCHES

Flavours of BLAST

ACGATAGATCCCATCCATAAAT ATGACGATAGATCCCATCAT CGATAGGACCACCACA GATAGACCAGGATACATAGGATAATTA AGCTCGCTTGGCTCGATGGCT

MKJLSPWERSYTRGHYTWER MGHTVNBZY MKLPWRHGDBKJGMNDFD MBKLRPIUHDFRTASGSLKWWRTVBN

MKJLSPWERSYTRGHYTWER MGHTVNBZY MKLPWRHGDBKJGMNDFD MBKLRPIUHDFRTASGSLKWWRTVBN

ATGACGATAGATCCCATCAT CGATAGGACCACCACA GATAGACCAGGATACATAGGATAATTA AGCTCGCTTGGCTCGATGGCT

query sequence other operation database sequences

ATGACGATAGATCCCATCAT CGATAGGACCACCACA GATAGACCAGGATACATAGGATAATTA AGCTCGCTTGGCTCGATGGCT

BLASTn

BLASTp

BLASTx

tBLASTn

tBLASTx

ACGATAGATCCCATCCATAAAT

ACGATAGATCCCATCCATAAAT

MQWCGYRWTYQGYRW

MQWCGYRWTYQGYRW

FAST

FAST

SLOW

SLOWER

HORRIBLY

SLOW

6 fra

me

trans

latio

n

How does it work The main task of any sequence comparison program is to test all possible mutual alignments of two sequence and see how good the match is

CTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGT

CCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGTTC | | | | | ||||||||||||||||||||||||| CTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGTCTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGT || | | | | | | | | |||||||||||||||||||||||| | | | | | |

CCGAGCTTCTCATTGCTCTTCCTAACAGTG=TGATAGGCTAACCGTAATGGCGTTC||||||||||||||||||||||||| ||||||||||||||||||||||||

query

1st database sequence

This would actually be a very slow search process if implemented like thishellip

BLAST achieves its speed through two strategies

- it takes a WORD based approach- it pre-INDEXES database sequences

BLAST WORDS and INDEXING1 GACAAATCCAAACCCCTGAAGTTCTCCACCAGCAAAGCCA

2 TAAGCAAATTTAATTTTGTTTACATTTTC

3 GTTAAGACCTTCCCTGACATTTGCAGCAGTTTCAAATGTA

AAAAAAAA 00001 AAAAAAAC 00002 AAAAAAAG 00003 ACAAATCC 07967 ACAAATCC 07968 ACAAATCC 07979 GACAAATC 33568 GACAAATG 33569 TCCAAACC 64321 TCCAAACC 64322

sequence position word

1 1 33658

1 2 07967

1 3 16210

3 15 33568

3 16 07967

Database of sequences

Numbered list of all possible lsquowordsrsquo

Build a position index of all words in the database

Analyse the Query Sequence gtquery AGACAAATCCAAACCCCTGAAGTTCTCCACCAGCAAAGCCA

AAAAAAAA 00001 AAAAAAAC 00002 AAAAAAAG 00003 ACAAATCC 07967 ACAAATCC 07968 ACAAATCC 07979 GACAAATC 33568 GACAAATG 33569 TCCAAACC 64321 TCCAAACC 64322

QUERY SEQUENCE

Numbered list of all possible lsquowordsrsquo

position word

1 14236

2 33658

3 07967

Analyse QUERY SEQUENCE

sequence position word

1 1 33658

1 2 07967

1 3 16210

3 15 33568

3 16 07967

Index of database

Expand from Word Based Matches We lsquoinstantlyrsquo know which sequences in the database have at least a word length match with our query sequence and at what relative position

Next the potential alignments are expanded adding up a score for (total matches ndash mismatches ndash gap penalties) to make the best possible alignment But this is usually for a tiny proportion of the sequences in the database ndash so overall it is much quicker

The highest scoring alignments are reported

But we can potentially miss alignments with no word-size bits in common consider BLASTn with a default word-size of 11

TCGGAAGTGGAAGCTGAACCTGATTGTAGAGTTGGAGGCCAGTGTTCTGGCTGAGC||||||||| ||||| |||||||||| |||||||||| |||| ||||| ||||||| TCGGAAGTGTAAGCTCAACCTGATTGCAGAGTTGGAGTCCAGAGTTCTAGCTGAGC

Care is sometimes neededhellip

BLAST ndashTypical OutputINPUT

gtpartial cDNA sequence Xenopus tropicalisCGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGTTCCCACCTCTCCTCTTTCACCATGAAGCTCAAGGACAAATTCCACTCCCCCAAAATCAAGCGCACCCCGTCCAAGAAGGGGAAGCCGGCCGACCTCACCGTCAAAACAGAAGAGAAACCCGTCAACAAAACCTTAAGCCGCTTGGAGGAACAGGAGAAAGAAGTCGTTAATGCCTTGCGTTACTTTAAGACAATTGTTGACAAGATGGCGGTGGACAAGATGGTGCTGGTGATGCTGCCAGGGTCGGCGA

OUTPUTQuery= (311 letters) Database NCBI Protein Reference Sequences 954378 sequences 347895532 total letters

gtgi|41055060|ref|NP_9574201| similar to guanine nucleotide-releasing factor 2 (specific for crk proto-oncogene) [Danio rerio]

Length=691

Score = 133 bits (335)Expect = 6e-31 Identities = 7698 (77) Positives = 8298 (83) Gaps = 498 (4) Frame = +2

Query 26 MSGKIE-KADSQRSHLSSFTMKLKDKFHSPKIKRTPSKKGKPA--DLTVKTEEKPVNKTL 196 MSGKIE K +SQ+SHLSSFTMKL KFHSPKIKRTPSKKGK + VKT EKPVNK + Sbjct 1 MSGKIESKHESQKSHLSSFTMKLM-KFHSPKIKRTPSKKGKQLQPEPAVKTPEKPVNKKV 59

Query 197 SRLEEQEKEVVNALRYFKTIVDKMAVDKMVLVMLPGSA 310 SRLEEQEK+VV+ALRYFKTIVDKM VD VL MLPGSA Sbjct 60 SRLEEQEKDVVSALRYFKTIVDKMNVDTKVLQMLPGSA 97

When is a match significant

RFKISDCQHPCTYSHNQYMTNHMRECPYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV

NFSWKKTSEKETNCQFDYPNDYNEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFNMCWLEVNSS

RF---KISDCQHPCTYSH-NQYMTNHMREC----PYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV F K S+ + C + + N Y N +C P+ + +W +P + D I N M ++ NFSWKKTSEKETNCQFDYPNDY--NEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFN------MCWLEVNSS

Here is a lsquotypicalrsquo weak alignment from BLASTp

In fact the sequences were randomly generated so there is no biologically significant alignmenthellip

E-values

The number of matches like the discovered match that I would expect to find by chance

An E-value of 00 implies that I would expect no matches like this to arise by chance thereforehellip

An E-value of 1 implies I would expect 1 match like this to arise by chance so if I have a match with such an E-valuehellip

Also ldquoexpect valueldquo or ldquoexpectationrdquo

E-values From First Principles

Some database statistics (23rd July 2005)

Database NCBI RefSeq mRNA 272619 sequences 503566580 total letters (~50 x 108)

Database NCBI nr 3329110 sequences 14601814750 total letters (~14 x 1010)

Notation

12e-35 = 12 x 10-35

48 x 106 = 4800000

We will consider first searching a nucleotide sequence (lsquoACGTAGACGTrsquo) against a nucleotide database eg the RefSeq mRNA above

Then we will consider the more complex case of amino acid sequence (protein) searches Which is of course what we mostly do

Calculating an E-valueThe RefSeq mRNA database has ~ 50 x 108 letters There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA

Query = lsquoArsquo

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACrsquo

AC AC AC AC

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACGrsquo

ACG

Expected number of matches = (50 x 108) 4 = ~12 x 108

Expected number of matches = (50 x 108) (4x 4) = ~31 x 107

Expected number of matches = (50 x 108) (4 x 4 x 4) = ~81 x 106

Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer

Expected number of matches = (50 x 108) (4 x 4 x 4 x 4 hellip 60 times ) = (50 x 108) 1036 = 50 x 10-28

E-value = 50 x 10-28

E-values In PracticeSo if I take a 60 nt sequencegtsequenceACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA and actually BLAST it against the RefSeq mRNA database I get

BLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds

Length=6060 Score = 119 bits (60) Expect = 2e-26 Identities = 6060 (100) Gaps = 060 (0) Strand=PlusPlus

Query 1 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 60 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 2977 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 3036

What do I get if I BLAST it against the larger nr databaseBLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds

Length=6060 Score = 119 bits (60) Expect = 6e-25 Identities = 6060 (100) Gaps = 060 (0) Strand=PlusPlus

Query 1 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 60 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 2977 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 3036

theoretical value was 50e-28 -

E-value Exercise

Given a transcription factor binding site

ACC[TG]TA

How many would you expect to find by chance in a 10k promoter sequence

How would this differ if there was an optional additional base between the 4th and 5th positionsIeACC[TG]TAOR ACC[TG]TA

E-value Exercise AnswerACC[TG]TA

Expect lsquoArsquo every 4 ntExpect lsquoACCrsquo every 4x4x4 = 64 nt

Expect lsquoT or Grsquo every 2nd ntExpect lsquoACC[TG]rsquo every 64x2 nt = 128 nt

Expect lsquoTArsquo every 4x4 = 16 ntExpect lsquoACC[TG]TArsquo every 128x16 nt = 2048 nt (4x4x4x2x4x4)We would expect ~5 of these promoter sites every 10k by chance

If also ACC[TG]TAA allowed

The two motifs independently have the same E-valueTo allow either means we expect twice as many

E-values Effect of Database SizeThe nr mRNA database has ~ 14 x 1010 letters (was RefSeq and 50 x108)There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA

Query = lsquoArsquo

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACrsquo

AC AC AC AC

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACGrsquo

ACG

Expected number of matches = (14 x 1010) 4 = ~12 x 108

Expected number of matches = (14 x 1010) (4x 4) = ~31 x 107

Expected number of matches = (14 x 1010) (4 x 4 x 4) = ~81 x 106

Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer

Expected number of matches = (14 x 1010) (4 x 4 x 4 x 4 hellip 60 times ) = (14 x 1010) 1036 = 14 x 10-26

E-value = 14 x 10-26

(was E-value = 50 x 10-28)

E-values Effect of Database Size

The E-value is simply dependent on database size

RefSeq

nr

14 x 1010 letters

50 x108 letters

30 x bigger

BLAST the same sequenceagainst each E-value = 14e-26

E-value = 50e-28

The database was ~30 times bigger and so the E-value was ~30 times bigger

Why were the values differentOur calculated E-value for searching against the RefSeq mRNA database was 50 x 10-28But our actual BLAST search at NCBI gave a value of

20 x 10-26 - about 40x larger - why is this

Gapped alignments

If we were expecting N matches for a query sequence lsquoACGTACGTACGTrsquo imagine what would happen to N if we allowed gaps in our matches

ACGTACGTACGT

This would now give us additional possible alignments that would meet our lsquomatchrsquo criteria

ACGTACGTACGT ACGTACAGTACGT ACGTACCGTACGT etc|||||||||||| |||||| |||||| |||||| ||||||ACGTACGTACGT ACGTAC-GTACGT ACGTAC-GTACGT

We will expect many more matches in a given database if we allow our alignments to have gaps The E-value will be larger

E-values Effect of Query Length

Biologically itrsquos the same match Does it mean we are any less sure that this match didnrsquot occur by chance The E-value is simply dependent on match length

database

BLAST 500 nt sequence against a database

BLASTn Get a full length match with sequence XYZ at an E-value = 50e-160

gtsequence ACTAGTCTAGCTAGACATCGATCGATGATGCTACACAGATAGACGATAGATAGTAAGTCGATCGATCGCGCATCGATCGTCTAGATCGATCGCTCGCTGTGTAGATAGATCGGCGATAGA

database

BLAST half of the same sequence against the same database

BLASTn

gtsequence ACTAGTCTAGCTAGACATCGATCGATGATGCTACACAGATAGACGATAGATAGTAAGTCG

Get a match with sequence XYZ again but at an E-value = 50e-80

Why not just use identityAt some levels this a good question

But consider two very different searches both of which give a 75 identity match

Query1 was 60 nt longCGGAGCTCAGGGCTTAACGACTGATATCTCCGCGCATGTCGAGAAACGATACAGCCAGCG||||||||||| || | || | || || |||| | | | |||||| | ||||||||||CGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGWhich would have an E-value ~ 50 x 10-19

And Query2 only 16 nt longACGTACGTACGTACGT||| || | |||| ||ACGCACCTTCGTAGGTWhich would have an E-value ~ 30

And intuitively we feel we would expect to see that sort of number of matches in the database just by chancehellip

So whatrsquos the real problemBasically you are usually trying to answer the question

Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism

Are there any useful guidelines though at least for biological meaningfulness

Basically you are usually trying to answer the question

Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism

BLAST

The difficulty is because

ORTHOLOGY

BLAST Similarity + Probability

biological knowledge

nature of query sequence

phylogenetic relationship

match length PI size of databasehellip

Rules of ThumbHow good does an E-value have to be before we might even think we have an ortholog

largerworse smallerbetter

E-values 10-5 10-10 10-40 10-100 00

fantasy borderline encouraging

pretty good canrsquot get

better

But note that in some gene families with closely related members you can get an E-value of 00 for several different matches and then identity may be more sensitive Also bear in mind in cases like this that ideas of lsquofunctionalrsquo orthology may break down with more than one locus producing identical proteins which share the same functionhellip

Protein BLASTItrsquos (nearly) always better to make comparisons at the amino acid level between protein sequences than the DNA level because the amino acid sequence is more conserved than the underlying DNA sequenceDoes this cause us to treat expected values any differently

If we follow the argument as before then for an exact match of a 20 amino acid sequence in the RefSeq protein database each additional amino acid will reduce the E-value by 120th (there are 20 different amino acids) And as there are 347895532 letters in that databaseE-value = ~35 x 108 (20 x 20 x 20 hellip20 times) = ~35 x 10-18

But this is what we get if we run the blast at NCBI

Score = 431 bits (100) Expect = 8e-04 Identities = 2020 (100) Positives = 2020 (100) Gaps = 020 (0) Frame = +3

Query 3 SSSSFRAYRAALSEVEPPCI 62 SSSSFRAYRAALSEVEPPCISbjct 972 SSSSFRAYRAALSEVEPPCI 991

Really too big a discrepancy to easily explain with hand wavinghellip

Amino Acid Substitutions

A SC F LWYG I LMVL IMFVM ILVP V ILMW FY

N DHSQ REHKS ANTT SY HFW

H NQYK RQER QK

D NEE DQK

In fact we need to take into account both amino acid substitutability as well as as before allowing gapped alignments On average any residue can be substituted for by about 2 others so each position has about 17th chance of lsquomatchingrsquo rather than 120th

So now we getE-value = ~35 x 108 (7 x 7 x 7 hellip20 times) = ~44 x 10-9which is much closer to the actual BLAST value

These substitutabilities are dealt with by the BLOSUM and PAM matrices

ExercisesGo to the file random-DNA-sequenceshtml select one of the 20 randomly generated nucleotide sequences and do a BLASTx (translated DNA-gtprotein) at NCBI against the nr protein database

Did you find any lsquosignificantrsquo hits

Repeat with a second sequence

What conclusions might you draw from this exercise

Try the same sequence(s) against the nr nucleotide database

Is there any general difference

Part 4 Tweaking BLASTAlthough you normally see BLAST as a web page with boxes to place data in and tick boxes etc it is actually a command line program that can be run just by typing the appropriate command and options eg

promptgt blastall ndashp ltblast typegt ndashi ltinput sequencegt ndashd ltdatabasegt

This is the simplest form where the basic program lsquoblastallrsquo takes a number of different options or parameters indicated by the ndashx and followed by its value -p ltwhich blast flavour to rungt-i ltfile with query sequence ingt-d ltpre-indexed database namegt

Not All Parameters are hellipThere are many other parameters and if not listed explicitly they will use a default value most appropriate to the blast flavour requested Eg for ndashW ltword sizegt blastn uses ndashW 11 where blastx uses ndashW 3

There are also some options that appear on the web pages that are not really parameters but manage the job in a similar way One of the most useful of these is on the NCBI blast pages where you can use Entrez queries or pick from an organism list to modify your search

The Many Parameters of BLASTThere are almost literally hundreds of parameters but most are way too obscure even for die-hard techies like me Very few of them are regularly useful in any but their default value but just occasionally they are very necessary

Here are some of the ones that I have used

-e max expected value -m output format (graphical or tabularspreadsheet)-F filter query sequence for low complexity (default TRUE)-U use only upper case regions of query (default FALSE)-G gap opening cost-E gap extension cost-q nucleotide mismatch penalty (BLASTx uses matrices)-r nucleotide match reward-b number of matching sequences to report-g allow gaps (default TRUE)-W word size-z effective database size (removes effect of actual database size)-S query strands to search (default both directions)-l restrict database sequences to given list of lsquogilsquo numbers

BLAST Parameters Exercises1 BLASTn vs BLASTx

Open the file example-sequenceshtml copy the sequence gtblastn-vs-blastxThis is a Xenopus tropicalis cDNA sequence

Go to the NCBI BLAST Home PageNucleotide-nucleotide BLAST (blastn) section Paste your sequence into the box

Run BLASTn against the nr nucleotide database using all default optionsThen hit [format] to wait for the results in a new page(hint if you paste the sequence definition line lsquogtnamersquo into the box as well your results will be labelled accordingly which can be useful)

Now repeat but go to the TRANSLATED BLAST section and BLAST against the nr protein database using BLASTx

How might the different results help us view the presence of this gene in other vertebrates

Results for Exercise 1 BLASTn

BLASTx

BLAST Parameters Exercises2 Low complexity filtering

Open the file example-sequenceshtml copy the sequence gtlow-complexity-filtering-A This sequence contains a long AT tandem repeat

Go to the NCBI BLAST Home PageTRANSLATED BLAST sectionBLASTx Paste your sequence into the box

Carefully UNTICK the ldquoChoose filter [ ] Low complexityrdquo BOX in the second section And then run BLASTx against the nr database

What do you feel about these alignmentsRe-run but leave the low-complexity filter ON this timeDoes this change our view of the protein matches

Now continue with gtlow-complexity-filtering-B and ndashCC is an especially interesting case ndash what can we deduce about the cDNA sequence Annotators beware

Results for Exercise 2A (OFF) BLASTn ndash low complexity filtering OFF

Results for Exercise 2A (ON) BLASTn ndash low complexity filtering ON

Results for Exercise 2B

ON OFF

Results for Exercise 2C

There is a sequence error an extra G at position 117 in the sequence cDNA (117)AGAAAAGAAGAAACATGGCAATGGATCAGAA|||||||||||||||| ||||||||||||||AGAAAAGAAGAAACAT-GCAATGGATCAGAA

Genomic sequence

ON

OFF

BLAST Parameters Exercises3 Limit by Entrez query

Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching sequences in fruit fly enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT

Open the file example-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home Page TRANSLATED BLAST sectionBLASTx and paste the sequence

Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip

BLAST Parameters Exercises4 BLASTn vs tBLASTx and nucleotide mismatch penalties

Open the file example-sequenceshtml

Also open the NCBI BLAST Home PageSPECIAL ndash Align two sequences section

There are several Xenopus tropicalis cyclins in the examples fileCopy the sequence gtcyclin-A1-Xt to the Sequence 1 BLAST windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 BLAST window(i) Run the default comparison should be BLASTn Note the alignmentNow run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these

(ii) Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1 Then try reducing the gap open and gap extension penaltieshellipWhat do we learn from this

(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip

Results for Exercise 4 (i)

BLASTn tBLASTx

Results for Exercise 4 (ii)

Mismatch penalty = -2 (default) Mismatch penalty = -1

BLAST Parameters Exercises5 E-Value maximum for reporting

Open the file example-sequenceshtml

Copy the sequence gtsumo-binding-motif and go to the NCBI BLAST Home PageGo to the PROTEIN BLAST section BLASTp and paste the sequence

Run the search with the default values

Now re-run the search setting the maximum E-value in the box

Expect 100

What difference does this make

BLAST Parameters Exercises6 Word Size

Open the file example-sequenceshtml

Copy the sequence gtmorpholino and go to the NCBI BLAST Home PageGo to the NUCLEOTIDE BLAST section BLASTn and paste the sequence

Check OFF the low complexity filter and then run the search

Now re-run the search setting the following parameters

Low complexity OFFExpect 100Word Size 7Other advanced -q-1 (mismatch penalty -1 instead of default -3)

What difference does this make

END

  • Bioinformatics Workshop 1 Sequences and Similarity Searches
  • The Basic Questions
  • Part 1 Structural Genomics
  • Chromosomes and Genes
  • Gene to Protein
  • Sequence Signals
  • Genomic Signals
  • Derivative Sequences
  • Gene Models
  • Sequences and Genes (Accession Numbers and Names)
  • Gene Symbols Names Etc
  • A Gene-Centric View
  • Sequences and Accession Numbers
  • mRNA Splicing Signals
  • Gene Predictions
  • Supporting Evidence
  • TheoreticalPredicted Sequences
  • Sequences for a model organism
  • So Whatrsquos in the Databases Now
  • Part 2 Comparative Genomics
  • Speciation
  • Residual Similarity
  • Computers Can Detect Homology
  • Orthologs
  • Paralogs
  • lsquoOtherrsquo-logs
  • The Essential Paradigm
  • Function Conserved Longer than Detectable Similarity
  • Redundancy in the Genetic Code
  • Protein Similarity Persists Longer
  • Always Compare Protein Sequences
  • Exercise 1 nucleotide vs amino acid search
  • Answers Exercise 1
  • The Essential Task
  • Functional Orthologs
  • Finding Orthologs
  • Using Synteny is Better
  • Metazome
  • Metazome Exercise
  • Part 3 Finding Sequence Similarities
  • Gaps in Alignments
  • The Downside of Gaps
  • BLAST
  • Flavours of BLAST
  • How does it work
  • BLAST WORDS and INDEXING
  • Analyse the Query Sequence
  • Expand from Word Based Matches
  • BLAST ndashTypical Output
  • When is a match significant
  • E-values
  • E-values From First Principles
  • Calculating an E-value
  • E-values In Practice
  • E-value Exercise
  • E-value Exercise Answer
  • E-values Effect of Database Size
  • Slide 58
  • Why were the values different
  • E-values Effect of Query Length
  • Why not just use identity
  • So whatrsquos the real problem
  • Rules of Thumb
  • Protein BLAST
  • Amino Acid Substitutions
  • Exercises
  • Part 4 Tweaking BLAST
  • Not All Parameters are hellip
  • The Many Parameters of BLAST
  • Slide 70
  • BLAST Parameters Exercises
  • Results for Exercise 1
  • Slide 73
  • Results for Exercise 2A (OFF)
  • Results for Exercise 2A (ON)
  • Results for Exercise 2B
  • Results for Exercise 2C
  • Slide 78
  • Slide 79
  • Results for Exercise 4 (i)
  • Results for Exercise 4 (ii)
  • Slide 82
  • Slide 83
  • END
Page 44: Bioinformatics Workshop 1 Sequences and Similarity Searches

Flavours of BLAST

ACGATAGATCCCATCCATAAAT ATGACGATAGATCCCATCAT CGATAGGACCACCACA GATAGACCAGGATACATAGGATAATTA AGCTCGCTTGGCTCGATGGCT

MKJLSPWERSYTRGHYTWER MGHTVNBZY MKLPWRHGDBKJGMNDFD MBKLRPIUHDFRTASGSLKWWRTVBN

MKJLSPWERSYTRGHYTWER MGHTVNBZY MKLPWRHGDBKJGMNDFD MBKLRPIUHDFRTASGSLKWWRTVBN

ATGACGATAGATCCCATCAT CGATAGGACCACCACA GATAGACCAGGATACATAGGATAATTA AGCTCGCTTGGCTCGATGGCT

query sequence other operation database sequences

ATGACGATAGATCCCATCAT CGATAGGACCACCACA GATAGACCAGGATACATAGGATAATTA AGCTCGCTTGGCTCGATGGCT

BLASTn

BLASTp

BLASTx

tBLASTn

tBLASTx

ACGATAGATCCCATCCATAAAT

ACGATAGATCCCATCCATAAAT

MQWCGYRWTYQGYRW

MQWCGYRWTYQGYRW

FAST

FAST

SLOW

SLOWER

HORRIBLY

SLOW

6 fra

me

trans

latio

n

How does it work The main task of any sequence comparison program is to test all possible mutual alignments of two sequence and see how good the match is

CTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGT

CCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGTTC | | | | | ||||||||||||||||||||||||| CTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGTCTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGT || | | | | | | | | |||||||||||||||||||||||| | | | | | |

CCGAGCTTCTCATTGCTCTTCCTAACAGTG=TGATAGGCTAACCGTAATGGCGTTC||||||||||||||||||||||||| ||||||||||||||||||||||||

query

1st database sequence

This would actually be a very slow search process if implemented like thishellip

BLAST achieves its speed through two strategies

- it takes a WORD based approach- it pre-INDEXES database sequences

BLAST WORDS and INDEXING1 GACAAATCCAAACCCCTGAAGTTCTCCACCAGCAAAGCCA

2 TAAGCAAATTTAATTTTGTTTACATTTTC

3 GTTAAGACCTTCCCTGACATTTGCAGCAGTTTCAAATGTA

AAAAAAAA 00001 AAAAAAAC 00002 AAAAAAAG 00003 ACAAATCC 07967 ACAAATCC 07968 ACAAATCC 07979 GACAAATC 33568 GACAAATG 33569 TCCAAACC 64321 TCCAAACC 64322

sequence position word

1 1 33658

1 2 07967

1 3 16210

3 15 33568

3 16 07967

Database of sequences

Numbered list of all possible lsquowordsrsquo

Build a position index of all words in the database

Analyse the Query Sequence gtquery AGACAAATCCAAACCCCTGAAGTTCTCCACCAGCAAAGCCA

AAAAAAAA 00001 AAAAAAAC 00002 AAAAAAAG 00003 ACAAATCC 07967 ACAAATCC 07968 ACAAATCC 07979 GACAAATC 33568 GACAAATG 33569 TCCAAACC 64321 TCCAAACC 64322

QUERY SEQUENCE

Numbered list of all possible lsquowordsrsquo

position word

1 14236

2 33658

3 07967

Analyse QUERY SEQUENCE

sequence position word

1 1 33658

1 2 07967

1 3 16210

3 15 33568

3 16 07967

Index of database

Expand from Word Based Matches We lsquoinstantlyrsquo know which sequences in the database have at least a word length match with our query sequence and at what relative position

Next the potential alignments are expanded adding up a score for (total matches ndash mismatches ndash gap penalties) to make the best possible alignment But this is usually for a tiny proportion of the sequences in the database ndash so overall it is much quicker

The highest scoring alignments are reported

But we can potentially miss alignments with no word-size bits in common consider BLASTn with a default word-size of 11

TCGGAAGTGGAAGCTGAACCTGATTGTAGAGTTGGAGGCCAGTGTTCTGGCTGAGC||||||||| ||||| |||||||||| |||||||||| |||| ||||| ||||||| TCGGAAGTGTAAGCTCAACCTGATTGCAGAGTTGGAGTCCAGAGTTCTAGCTGAGC

Care is sometimes neededhellip

BLAST ndashTypical OutputINPUT

gtpartial cDNA sequence Xenopus tropicalisCGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGTTCCCACCTCTCCTCTTTCACCATGAAGCTCAAGGACAAATTCCACTCCCCCAAAATCAAGCGCACCCCGTCCAAGAAGGGGAAGCCGGCCGACCTCACCGTCAAAACAGAAGAGAAACCCGTCAACAAAACCTTAAGCCGCTTGGAGGAACAGGAGAAAGAAGTCGTTAATGCCTTGCGTTACTTTAAGACAATTGTTGACAAGATGGCGGTGGACAAGATGGTGCTGGTGATGCTGCCAGGGTCGGCGA

OUTPUTQuery= (311 letters) Database NCBI Protein Reference Sequences 954378 sequences 347895532 total letters

gtgi|41055060|ref|NP_9574201| similar to guanine nucleotide-releasing factor 2 (specific for crk proto-oncogene) [Danio rerio]

Length=691

Score = 133 bits (335)Expect = 6e-31 Identities = 7698 (77) Positives = 8298 (83) Gaps = 498 (4) Frame = +2

Query 26 MSGKIE-KADSQRSHLSSFTMKLKDKFHSPKIKRTPSKKGKPA--DLTVKTEEKPVNKTL 196 MSGKIE K +SQ+SHLSSFTMKL KFHSPKIKRTPSKKGK + VKT EKPVNK + Sbjct 1 MSGKIESKHESQKSHLSSFTMKLM-KFHSPKIKRTPSKKGKQLQPEPAVKTPEKPVNKKV 59

Query 197 SRLEEQEKEVVNALRYFKTIVDKMAVDKMVLVMLPGSA 310 SRLEEQEK+VV+ALRYFKTIVDKM VD VL MLPGSA Sbjct 60 SRLEEQEKDVVSALRYFKTIVDKMNVDTKVLQMLPGSA 97

When is a match significant

RFKISDCQHPCTYSHNQYMTNHMRECPYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV

NFSWKKTSEKETNCQFDYPNDYNEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFNMCWLEVNSS

RF---KISDCQHPCTYSH-NQYMTNHMREC----PYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV F K S+ + C + + N Y N +C P+ + +W +P + D I N M ++ NFSWKKTSEKETNCQFDYPNDY--NEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFN------MCWLEVNSS

Here is a lsquotypicalrsquo weak alignment from BLASTp

In fact the sequences were randomly generated so there is no biologically significant alignmenthellip

E-values

The number of matches like the discovered match that I would expect to find by chance

An E-value of 00 implies that I would expect no matches like this to arise by chance thereforehellip

An E-value of 1 implies I would expect 1 match like this to arise by chance so if I have a match with such an E-valuehellip

Also ldquoexpect valueldquo or ldquoexpectationrdquo

E-values From First Principles

Some database statistics (23rd July 2005)

Database NCBI RefSeq mRNA 272619 sequences 503566580 total letters (~50 x 108)

Database NCBI nr 3329110 sequences 14601814750 total letters (~14 x 1010)

Notation

12e-35 = 12 x 10-35

48 x 106 = 4800000

We will consider first searching a nucleotide sequence (lsquoACGTAGACGTrsquo) against a nucleotide database eg the RefSeq mRNA above

Then we will consider the more complex case of amino acid sequence (protein) searches Which is of course what we mostly do

Calculating an E-valueThe RefSeq mRNA database has ~ 50 x 108 letters There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA

Query = lsquoArsquo

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACrsquo

AC AC AC AC

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACGrsquo

ACG

Expected number of matches = (50 x 108) 4 = ~12 x 108

Expected number of matches = (50 x 108) (4x 4) = ~31 x 107

Expected number of matches = (50 x 108) (4 x 4 x 4) = ~81 x 106

Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer

Expected number of matches = (50 x 108) (4 x 4 x 4 x 4 hellip 60 times ) = (50 x 108) 1036 = 50 x 10-28

E-value = 50 x 10-28

E-values In PracticeSo if I take a 60 nt sequencegtsequenceACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA and actually BLAST it against the RefSeq mRNA database I get

BLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds

Length=6060 Score = 119 bits (60) Expect = 2e-26 Identities = 6060 (100) Gaps = 060 (0) Strand=PlusPlus

Query 1 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 60 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 2977 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 3036

What do I get if I BLAST it against the larger nr databaseBLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds

Length=6060 Score = 119 bits (60) Expect = 6e-25 Identities = 6060 (100) Gaps = 060 (0) Strand=PlusPlus

Query 1 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 60 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 2977 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 3036

theoretical value was 50e-28 -

E-value Exercise

Given a transcription factor binding site

ACC[TG]TA

How many would you expect to find by chance in a 10k promoter sequence

How would this differ if there was an optional additional base between the 4th and 5th positionsIeACC[TG]TAOR ACC[TG]TA

E-value Exercise AnswerACC[TG]TA

Expect lsquoArsquo every 4 ntExpect lsquoACCrsquo every 4x4x4 = 64 nt

Expect lsquoT or Grsquo every 2nd ntExpect lsquoACC[TG]rsquo every 64x2 nt = 128 nt

Expect lsquoTArsquo every 4x4 = 16 ntExpect lsquoACC[TG]TArsquo every 128x16 nt = 2048 nt (4x4x4x2x4x4)We would expect ~5 of these promoter sites every 10k by chance

If also ACC[TG]TAA allowed

The two motifs independently have the same E-valueTo allow either means we expect twice as many

E-values Effect of Database SizeThe nr mRNA database has ~ 14 x 1010 letters (was RefSeq and 50 x108)There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA

Query = lsquoArsquo

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACrsquo

AC AC AC AC

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACGrsquo

ACG

Expected number of matches = (14 x 1010) 4 = ~12 x 108

Expected number of matches = (14 x 1010) (4x 4) = ~31 x 107

Expected number of matches = (14 x 1010) (4 x 4 x 4) = ~81 x 106

Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer

Expected number of matches = (14 x 1010) (4 x 4 x 4 x 4 hellip 60 times ) = (14 x 1010) 1036 = 14 x 10-26

E-value = 14 x 10-26

(was E-value = 50 x 10-28)

E-values Effect of Database Size

The E-value is simply dependent on database size

RefSeq

nr

14 x 1010 letters

50 x108 letters

30 x bigger

BLAST the same sequenceagainst each E-value = 14e-26

E-value = 50e-28

The database was ~30 times bigger and so the E-value was ~30 times bigger

Why were the values differentOur calculated E-value for searching against the RefSeq mRNA database was 50 x 10-28But our actual BLAST search at NCBI gave a value of

20 x 10-26 - about 40x larger - why is this

Gapped alignments

If we were expecting N matches for a query sequence lsquoACGTACGTACGTrsquo imagine what would happen to N if we allowed gaps in our matches

ACGTACGTACGT

This would now give us additional possible alignments that would meet our lsquomatchrsquo criteria

ACGTACGTACGT ACGTACAGTACGT ACGTACCGTACGT etc|||||||||||| |||||| |||||| |||||| ||||||ACGTACGTACGT ACGTAC-GTACGT ACGTAC-GTACGT

We will expect many more matches in a given database if we allow our alignments to have gaps The E-value will be larger

E-values Effect of Query Length

Biologically itrsquos the same match Does it mean we are any less sure that this match didnrsquot occur by chance The E-value is simply dependent on match length

database

BLAST 500 nt sequence against a database

BLASTn Get a full length match with sequence XYZ at an E-value = 50e-160

gtsequence ACTAGTCTAGCTAGACATCGATCGATGATGCTACACAGATAGACGATAGATAGTAAGTCGATCGATCGCGCATCGATCGTCTAGATCGATCGCTCGCTGTGTAGATAGATCGGCGATAGA

database

BLAST half of the same sequence against the same database

BLASTn

gtsequence ACTAGTCTAGCTAGACATCGATCGATGATGCTACACAGATAGACGATAGATAGTAAGTCG

Get a match with sequence XYZ again but at an E-value = 50e-80

Why not just use identityAt some levels this a good question

But consider two very different searches both of which give a 75 identity match

Query1 was 60 nt longCGGAGCTCAGGGCTTAACGACTGATATCTCCGCGCATGTCGAGAAACGATACAGCCAGCG||||||||||| || | || | || || |||| | | | |||||| | ||||||||||CGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGWhich would have an E-value ~ 50 x 10-19

And Query2 only 16 nt longACGTACGTACGTACGT||| || | |||| ||ACGCACCTTCGTAGGTWhich would have an E-value ~ 30

And intuitively we feel we would expect to see that sort of number of matches in the database just by chancehellip

So whatrsquos the real problemBasically you are usually trying to answer the question

Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism

Are there any useful guidelines though at least for biological meaningfulness

Basically you are usually trying to answer the question

Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism

BLAST

The difficulty is because

ORTHOLOGY

BLAST Similarity + Probability

biological knowledge

nature of query sequence

phylogenetic relationship

match length PI size of databasehellip

Rules of ThumbHow good does an E-value have to be before we might even think we have an ortholog

largerworse smallerbetter

E-values 10-5 10-10 10-40 10-100 00

fantasy borderline encouraging

pretty good canrsquot get

better

But note that in some gene families with closely related members you can get an E-value of 00 for several different matches and then identity may be more sensitive Also bear in mind in cases like this that ideas of lsquofunctionalrsquo orthology may break down with more than one locus producing identical proteins which share the same functionhellip

Protein BLASTItrsquos (nearly) always better to make comparisons at the amino acid level between protein sequences than the DNA level because the amino acid sequence is more conserved than the underlying DNA sequenceDoes this cause us to treat expected values any differently

If we follow the argument as before then for an exact match of a 20 amino acid sequence in the RefSeq protein database each additional amino acid will reduce the E-value by 120th (there are 20 different amino acids) And as there are 347895532 letters in that databaseE-value = ~35 x 108 (20 x 20 x 20 hellip20 times) = ~35 x 10-18

But this is what we get if we run the blast at NCBI

Score = 431 bits (100) Expect = 8e-04 Identities = 2020 (100) Positives = 2020 (100) Gaps = 020 (0) Frame = +3

Query 3 SSSSFRAYRAALSEVEPPCI 62 SSSSFRAYRAALSEVEPPCISbjct 972 SSSSFRAYRAALSEVEPPCI 991

Really too big a discrepancy to easily explain with hand wavinghellip

Amino Acid Substitutions

A SC F LWYG I LMVL IMFVM ILVP V ILMW FY

N DHSQ REHKS ANTT SY HFW

H NQYK RQER QK

D NEE DQK

In fact we need to take into account both amino acid substitutability as well as as before allowing gapped alignments On average any residue can be substituted for by about 2 others so each position has about 17th chance of lsquomatchingrsquo rather than 120th

So now we getE-value = ~35 x 108 (7 x 7 x 7 hellip20 times) = ~44 x 10-9which is much closer to the actual BLAST value

These substitutabilities are dealt with by the BLOSUM and PAM matrices

ExercisesGo to the file random-DNA-sequenceshtml select one of the 20 randomly generated nucleotide sequences and do a BLASTx (translated DNA-gtprotein) at NCBI against the nr protein database

Did you find any lsquosignificantrsquo hits

Repeat with a second sequence

What conclusions might you draw from this exercise

Try the same sequence(s) against the nr nucleotide database

Is there any general difference

Part 4 Tweaking BLASTAlthough you normally see BLAST as a web page with boxes to place data in and tick boxes etc it is actually a command line program that can be run just by typing the appropriate command and options eg

promptgt blastall ndashp ltblast typegt ndashi ltinput sequencegt ndashd ltdatabasegt

This is the simplest form where the basic program lsquoblastallrsquo takes a number of different options or parameters indicated by the ndashx and followed by its value -p ltwhich blast flavour to rungt-i ltfile with query sequence ingt-d ltpre-indexed database namegt

Not All Parameters are hellipThere are many other parameters and if not listed explicitly they will use a default value most appropriate to the blast flavour requested Eg for ndashW ltword sizegt blastn uses ndashW 11 where blastx uses ndashW 3

There are also some options that appear on the web pages that are not really parameters but manage the job in a similar way One of the most useful of these is on the NCBI blast pages where you can use Entrez queries or pick from an organism list to modify your search

The Many Parameters of BLASTThere are almost literally hundreds of parameters but most are way too obscure even for die-hard techies like me Very few of them are regularly useful in any but their default value but just occasionally they are very necessary

Here are some of the ones that I have used

-e max expected value -m output format (graphical or tabularspreadsheet)-F filter query sequence for low complexity (default TRUE)-U use only upper case regions of query (default FALSE)-G gap opening cost-E gap extension cost-q nucleotide mismatch penalty (BLASTx uses matrices)-r nucleotide match reward-b number of matching sequences to report-g allow gaps (default TRUE)-W word size-z effective database size (removes effect of actual database size)-S query strands to search (default both directions)-l restrict database sequences to given list of lsquogilsquo numbers

BLAST Parameters Exercises1 BLASTn vs BLASTx

Open the file example-sequenceshtml copy the sequence gtblastn-vs-blastxThis is a Xenopus tropicalis cDNA sequence

Go to the NCBI BLAST Home PageNucleotide-nucleotide BLAST (blastn) section Paste your sequence into the box

Run BLASTn against the nr nucleotide database using all default optionsThen hit [format] to wait for the results in a new page(hint if you paste the sequence definition line lsquogtnamersquo into the box as well your results will be labelled accordingly which can be useful)

Now repeat but go to the TRANSLATED BLAST section and BLAST against the nr protein database using BLASTx

How might the different results help us view the presence of this gene in other vertebrates

Results for Exercise 1 BLASTn

BLASTx

BLAST Parameters Exercises2 Low complexity filtering

Open the file example-sequenceshtml copy the sequence gtlow-complexity-filtering-A This sequence contains a long AT tandem repeat

Go to the NCBI BLAST Home PageTRANSLATED BLAST sectionBLASTx Paste your sequence into the box

Carefully UNTICK the ldquoChoose filter [ ] Low complexityrdquo BOX in the second section And then run BLASTx against the nr database

What do you feel about these alignmentsRe-run but leave the low-complexity filter ON this timeDoes this change our view of the protein matches

Now continue with gtlow-complexity-filtering-B and ndashCC is an especially interesting case ndash what can we deduce about the cDNA sequence Annotators beware

Results for Exercise 2A (OFF) BLASTn ndash low complexity filtering OFF

Results for Exercise 2A (ON) BLASTn ndash low complexity filtering ON

Results for Exercise 2B

ON OFF

Results for Exercise 2C

There is a sequence error an extra G at position 117 in the sequence cDNA (117)AGAAAAGAAGAAACATGGCAATGGATCAGAA|||||||||||||||| ||||||||||||||AGAAAAGAAGAAACAT-GCAATGGATCAGAA

Genomic sequence

ON

OFF

BLAST Parameters Exercises3 Limit by Entrez query

Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching sequences in fruit fly enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT

Open the file example-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home Page TRANSLATED BLAST sectionBLASTx and paste the sequence

Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip

BLAST Parameters Exercises4 BLASTn vs tBLASTx and nucleotide mismatch penalties

Open the file example-sequenceshtml

Also open the NCBI BLAST Home PageSPECIAL ndash Align two sequences section

There are several Xenopus tropicalis cyclins in the examples fileCopy the sequence gtcyclin-A1-Xt to the Sequence 1 BLAST windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 BLAST window(i) Run the default comparison should be BLASTn Note the alignmentNow run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these

(ii) Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1 Then try reducing the gap open and gap extension penaltieshellipWhat do we learn from this

(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip

Results for Exercise 4 (i)

BLASTn tBLASTx

Results for Exercise 4 (ii)

Mismatch penalty = -2 (default) Mismatch penalty = -1

BLAST Parameters Exercises5 E-Value maximum for reporting

Open the file example-sequenceshtml

Copy the sequence gtsumo-binding-motif and go to the NCBI BLAST Home PageGo to the PROTEIN BLAST section BLASTp and paste the sequence

Run the search with the default values

Now re-run the search setting the maximum E-value in the box

Expect 100

What difference does this make

BLAST Parameters Exercises6 Word Size

Open the file example-sequenceshtml

Copy the sequence gtmorpholino and go to the NCBI BLAST Home PageGo to the NUCLEOTIDE BLAST section BLASTn and paste the sequence

Check OFF the low complexity filter and then run the search

Now re-run the search setting the following parameters

Low complexity OFFExpect 100Word Size 7Other advanced -q-1 (mismatch penalty -1 instead of default -3)

What difference does this make

END

  • Bioinformatics Workshop 1 Sequences and Similarity Searches
  • The Basic Questions
  • Part 1 Structural Genomics
  • Chromosomes and Genes
  • Gene to Protein
  • Sequence Signals
  • Genomic Signals
  • Derivative Sequences
  • Gene Models
  • Sequences and Genes (Accession Numbers and Names)
  • Gene Symbols Names Etc
  • A Gene-Centric View
  • Sequences and Accession Numbers
  • mRNA Splicing Signals
  • Gene Predictions
  • Supporting Evidence
  • TheoreticalPredicted Sequences
  • Sequences for a model organism
  • So Whatrsquos in the Databases Now
  • Part 2 Comparative Genomics
  • Speciation
  • Residual Similarity
  • Computers Can Detect Homology
  • Orthologs
  • Paralogs
  • lsquoOtherrsquo-logs
  • The Essential Paradigm
  • Function Conserved Longer than Detectable Similarity
  • Redundancy in the Genetic Code
  • Protein Similarity Persists Longer
  • Always Compare Protein Sequences
  • Exercise 1 nucleotide vs amino acid search
  • Answers Exercise 1
  • The Essential Task
  • Functional Orthologs
  • Finding Orthologs
  • Using Synteny is Better
  • Metazome
  • Metazome Exercise
  • Part 3 Finding Sequence Similarities
  • Gaps in Alignments
  • The Downside of Gaps
  • BLAST
  • Flavours of BLAST
  • How does it work
  • BLAST WORDS and INDEXING
  • Analyse the Query Sequence
  • Expand from Word Based Matches
  • BLAST ndashTypical Output
  • When is a match significant
  • E-values
  • E-values From First Principles
  • Calculating an E-value
  • E-values In Practice
  • E-value Exercise
  • E-value Exercise Answer
  • E-values Effect of Database Size
  • Slide 58
  • Why were the values different
  • E-values Effect of Query Length
  • Why not just use identity
  • So whatrsquos the real problem
  • Rules of Thumb
  • Protein BLAST
  • Amino Acid Substitutions
  • Exercises
  • Part 4 Tweaking BLAST
  • Not All Parameters are hellip
  • The Many Parameters of BLAST
  • Slide 70
  • BLAST Parameters Exercises
  • Results for Exercise 1
  • Slide 73
  • Results for Exercise 2A (OFF)
  • Results for Exercise 2A (ON)
  • Results for Exercise 2B
  • Results for Exercise 2C
  • Slide 78
  • Slide 79
  • Results for Exercise 4 (i)
  • Results for Exercise 4 (ii)
  • Slide 82
  • Slide 83
  • END
Page 45: Bioinformatics Workshop 1 Sequences and Similarity Searches

How does it work The main task of any sequence comparison program is to test all possible mutual alignments of two sequence and see how good the match is

CTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGT

CCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGTTC | | | | | ||||||||||||||||||||||||| CTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGTCTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGT || | | | | | | | | |||||||||||||||||||||||| | | | | | |

CCGAGCTTCTCATTGCTCTTCCTAACAGTG=TGATAGGCTAACCGTAATGGCGTTC||||||||||||||||||||||||| ||||||||||||||||||||||||

query

1st database sequence

This would actually be a very slow search process if implemented like thishellip

BLAST achieves its speed through two strategies

- it takes a WORD based approach- it pre-INDEXES database sequences

BLAST WORDS and INDEXING1 GACAAATCCAAACCCCTGAAGTTCTCCACCAGCAAAGCCA

2 TAAGCAAATTTAATTTTGTTTACATTTTC

3 GTTAAGACCTTCCCTGACATTTGCAGCAGTTTCAAATGTA

AAAAAAAA 00001 AAAAAAAC 00002 AAAAAAAG 00003 ACAAATCC 07967 ACAAATCC 07968 ACAAATCC 07979 GACAAATC 33568 GACAAATG 33569 TCCAAACC 64321 TCCAAACC 64322

sequence position word

1 1 33658

1 2 07967

1 3 16210

3 15 33568

3 16 07967

Database of sequences

Numbered list of all possible lsquowordsrsquo

Build a position index of all words in the database

Analyse the Query Sequence gtquery AGACAAATCCAAACCCCTGAAGTTCTCCACCAGCAAAGCCA

AAAAAAAA 00001 AAAAAAAC 00002 AAAAAAAG 00003 ACAAATCC 07967 ACAAATCC 07968 ACAAATCC 07979 GACAAATC 33568 GACAAATG 33569 TCCAAACC 64321 TCCAAACC 64322

QUERY SEQUENCE

Numbered list of all possible lsquowordsrsquo

position word

1 14236

2 33658

3 07967

Analyse QUERY SEQUENCE

sequence position word

1 1 33658

1 2 07967

1 3 16210

3 15 33568

3 16 07967

Index of database

Expand from Word Based Matches We lsquoinstantlyrsquo know which sequences in the database have at least a word length match with our query sequence and at what relative position

Next the potential alignments are expanded adding up a score for (total matches ndash mismatches ndash gap penalties) to make the best possible alignment But this is usually for a tiny proportion of the sequences in the database ndash so overall it is much quicker

The highest scoring alignments are reported

But we can potentially miss alignments with no word-size bits in common consider BLASTn with a default word-size of 11

TCGGAAGTGGAAGCTGAACCTGATTGTAGAGTTGGAGGCCAGTGTTCTGGCTGAGC||||||||| ||||| |||||||||| |||||||||| |||| ||||| ||||||| TCGGAAGTGTAAGCTCAACCTGATTGCAGAGTTGGAGTCCAGAGTTCTAGCTGAGC

Care is sometimes neededhellip

BLAST ndashTypical OutputINPUT

gtpartial cDNA sequence Xenopus tropicalisCGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGTTCCCACCTCTCCTCTTTCACCATGAAGCTCAAGGACAAATTCCACTCCCCCAAAATCAAGCGCACCCCGTCCAAGAAGGGGAAGCCGGCCGACCTCACCGTCAAAACAGAAGAGAAACCCGTCAACAAAACCTTAAGCCGCTTGGAGGAACAGGAGAAAGAAGTCGTTAATGCCTTGCGTTACTTTAAGACAATTGTTGACAAGATGGCGGTGGACAAGATGGTGCTGGTGATGCTGCCAGGGTCGGCGA

OUTPUTQuery= (311 letters) Database NCBI Protein Reference Sequences 954378 sequences 347895532 total letters

gtgi|41055060|ref|NP_9574201| similar to guanine nucleotide-releasing factor 2 (specific for crk proto-oncogene) [Danio rerio]

Length=691

Score = 133 bits (335)Expect = 6e-31 Identities = 7698 (77) Positives = 8298 (83) Gaps = 498 (4) Frame = +2

Query 26 MSGKIE-KADSQRSHLSSFTMKLKDKFHSPKIKRTPSKKGKPA--DLTVKTEEKPVNKTL 196 MSGKIE K +SQ+SHLSSFTMKL KFHSPKIKRTPSKKGK + VKT EKPVNK + Sbjct 1 MSGKIESKHESQKSHLSSFTMKLM-KFHSPKIKRTPSKKGKQLQPEPAVKTPEKPVNKKV 59

Query 197 SRLEEQEKEVVNALRYFKTIVDKMAVDKMVLVMLPGSA 310 SRLEEQEK+VV+ALRYFKTIVDKM VD VL MLPGSA Sbjct 60 SRLEEQEKDVVSALRYFKTIVDKMNVDTKVLQMLPGSA 97

When is a match significant

RFKISDCQHPCTYSHNQYMTNHMRECPYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV

NFSWKKTSEKETNCQFDYPNDYNEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFNMCWLEVNSS

RF---KISDCQHPCTYSH-NQYMTNHMREC----PYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV F K S+ + C + + N Y N +C P+ + +W +P + D I N M ++ NFSWKKTSEKETNCQFDYPNDY--NEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFN------MCWLEVNSS

Here is a lsquotypicalrsquo weak alignment from BLASTp

In fact the sequences were randomly generated so there is no biologically significant alignmenthellip

E-values

The number of matches like the discovered match that I would expect to find by chance

An E-value of 00 implies that I would expect no matches like this to arise by chance thereforehellip

An E-value of 1 implies I would expect 1 match like this to arise by chance so if I have a match with such an E-valuehellip

Also ldquoexpect valueldquo or ldquoexpectationrdquo

E-values From First Principles

Some database statistics (23rd July 2005)

Database NCBI RefSeq mRNA 272619 sequences 503566580 total letters (~50 x 108)

Database NCBI nr 3329110 sequences 14601814750 total letters (~14 x 1010)

Notation

12e-35 = 12 x 10-35

48 x 106 = 4800000

We will consider first searching a nucleotide sequence (lsquoACGTAGACGTrsquo) against a nucleotide database eg the RefSeq mRNA above

Then we will consider the more complex case of amino acid sequence (protein) searches Which is of course what we mostly do

Calculating an E-valueThe RefSeq mRNA database has ~ 50 x 108 letters There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA

Query = lsquoArsquo

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACrsquo

AC AC AC AC

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACGrsquo

ACG

Expected number of matches = (50 x 108) 4 = ~12 x 108

Expected number of matches = (50 x 108) (4x 4) = ~31 x 107

Expected number of matches = (50 x 108) (4 x 4 x 4) = ~81 x 106

Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer

Expected number of matches = (50 x 108) (4 x 4 x 4 x 4 hellip 60 times ) = (50 x 108) 1036 = 50 x 10-28

E-value = 50 x 10-28

E-values In PracticeSo if I take a 60 nt sequencegtsequenceACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA and actually BLAST it against the RefSeq mRNA database I get

BLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds

Length=6060 Score = 119 bits (60) Expect = 2e-26 Identities = 6060 (100) Gaps = 060 (0) Strand=PlusPlus

Query 1 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 60 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 2977 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 3036

What do I get if I BLAST it against the larger nr databaseBLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds

Length=6060 Score = 119 bits (60) Expect = 6e-25 Identities = 6060 (100) Gaps = 060 (0) Strand=PlusPlus

Query 1 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 60 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 2977 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 3036

theoretical value was 50e-28 -

E-value Exercise

Given a transcription factor binding site

ACC[TG]TA

How many would you expect to find by chance in a 10k promoter sequence

How would this differ if there was an optional additional base between the 4th and 5th positionsIeACC[TG]TAOR ACC[TG]TA

E-value Exercise AnswerACC[TG]TA

Expect lsquoArsquo every 4 ntExpect lsquoACCrsquo every 4x4x4 = 64 nt

Expect lsquoT or Grsquo every 2nd ntExpect lsquoACC[TG]rsquo every 64x2 nt = 128 nt

Expect lsquoTArsquo every 4x4 = 16 ntExpect lsquoACC[TG]TArsquo every 128x16 nt = 2048 nt (4x4x4x2x4x4)We would expect ~5 of these promoter sites every 10k by chance

If also ACC[TG]TAA allowed

The two motifs independently have the same E-valueTo allow either means we expect twice as many

E-values Effect of Database SizeThe nr mRNA database has ~ 14 x 1010 letters (was RefSeq and 50 x108)There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA

Query = lsquoArsquo

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACrsquo

AC AC AC AC

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACGrsquo

ACG

Expected number of matches = (14 x 1010) 4 = ~12 x 108

Expected number of matches = (14 x 1010) (4x 4) = ~31 x 107

Expected number of matches = (14 x 1010) (4 x 4 x 4) = ~81 x 106

Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer

Expected number of matches = (14 x 1010) (4 x 4 x 4 x 4 hellip 60 times ) = (14 x 1010) 1036 = 14 x 10-26

E-value = 14 x 10-26

(was E-value = 50 x 10-28)

E-values Effect of Database Size

The E-value is simply dependent on database size

RefSeq

nr

14 x 1010 letters

50 x108 letters

30 x bigger

BLAST the same sequenceagainst each E-value = 14e-26

E-value = 50e-28

The database was ~30 times bigger and so the E-value was ~30 times bigger

Why were the values differentOur calculated E-value for searching against the RefSeq mRNA database was 50 x 10-28But our actual BLAST search at NCBI gave a value of

20 x 10-26 - about 40x larger - why is this

Gapped alignments

If we were expecting N matches for a query sequence lsquoACGTACGTACGTrsquo imagine what would happen to N if we allowed gaps in our matches

ACGTACGTACGT

This would now give us additional possible alignments that would meet our lsquomatchrsquo criteria

ACGTACGTACGT ACGTACAGTACGT ACGTACCGTACGT etc|||||||||||| |||||| |||||| |||||| ||||||ACGTACGTACGT ACGTAC-GTACGT ACGTAC-GTACGT

We will expect many more matches in a given database if we allow our alignments to have gaps The E-value will be larger

E-values Effect of Query Length

Biologically itrsquos the same match Does it mean we are any less sure that this match didnrsquot occur by chance The E-value is simply dependent on match length

database

BLAST 500 nt sequence against a database

BLASTn Get a full length match with sequence XYZ at an E-value = 50e-160

gtsequence ACTAGTCTAGCTAGACATCGATCGATGATGCTACACAGATAGACGATAGATAGTAAGTCGATCGATCGCGCATCGATCGTCTAGATCGATCGCTCGCTGTGTAGATAGATCGGCGATAGA

database

BLAST half of the same sequence against the same database

BLASTn

gtsequence ACTAGTCTAGCTAGACATCGATCGATGATGCTACACAGATAGACGATAGATAGTAAGTCG

Get a match with sequence XYZ again but at an E-value = 50e-80

Why not just use identityAt some levels this a good question

But consider two very different searches both of which give a 75 identity match

Query1 was 60 nt longCGGAGCTCAGGGCTTAACGACTGATATCTCCGCGCATGTCGAGAAACGATACAGCCAGCG||||||||||| || | || | || || |||| | | | |||||| | ||||||||||CGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGWhich would have an E-value ~ 50 x 10-19

And Query2 only 16 nt longACGTACGTACGTACGT||| || | |||| ||ACGCACCTTCGTAGGTWhich would have an E-value ~ 30

And intuitively we feel we would expect to see that sort of number of matches in the database just by chancehellip

So whatrsquos the real problemBasically you are usually trying to answer the question

Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism

Are there any useful guidelines though at least for biological meaningfulness

Basically you are usually trying to answer the question

Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism

BLAST

The difficulty is because

ORTHOLOGY

BLAST Similarity + Probability

biological knowledge

nature of query sequence

phylogenetic relationship

match length PI size of databasehellip

Rules of ThumbHow good does an E-value have to be before we might even think we have an ortholog

largerworse smallerbetter

E-values 10-5 10-10 10-40 10-100 00

fantasy borderline encouraging

pretty good canrsquot get

better

But note that in some gene families with closely related members you can get an E-value of 00 for several different matches and then identity may be more sensitive Also bear in mind in cases like this that ideas of lsquofunctionalrsquo orthology may break down with more than one locus producing identical proteins which share the same functionhellip

Protein BLASTItrsquos (nearly) always better to make comparisons at the amino acid level between protein sequences than the DNA level because the amino acid sequence is more conserved than the underlying DNA sequenceDoes this cause us to treat expected values any differently

If we follow the argument as before then for an exact match of a 20 amino acid sequence in the RefSeq protein database each additional amino acid will reduce the E-value by 120th (there are 20 different amino acids) And as there are 347895532 letters in that databaseE-value = ~35 x 108 (20 x 20 x 20 hellip20 times) = ~35 x 10-18

But this is what we get if we run the blast at NCBI

Score = 431 bits (100) Expect = 8e-04 Identities = 2020 (100) Positives = 2020 (100) Gaps = 020 (0) Frame = +3

Query 3 SSSSFRAYRAALSEVEPPCI 62 SSSSFRAYRAALSEVEPPCISbjct 972 SSSSFRAYRAALSEVEPPCI 991

Really too big a discrepancy to easily explain with hand wavinghellip

Amino Acid Substitutions

A SC F LWYG I LMVL IMFVM ILVP V ILMW FY

N DHSQ REHKS ANTT SY HFW

H NQYK RQER QK

D NEE DQK

In fact we need to take into account both amino acid substitutability as well as as before allowing gapped alignments On average any residue can be substituted for by about 2 others so each position has about 17th chance of lsquomatchingrsquo rather than 120th

So now we getE-value = ~35 x 108 (7 x 7 x 7 hellip20 times) = ~44 x 10-9which is much closer to the actual BLAST value

These substitutabilities are dealt with by the BLOSUM and PAM matrices

ExercisesGo to the file random-DNA-sequenceshtml select one of the 20 randomly generated nucleotide sequences and do a BLASTx (translated DNA-gtprotein) at NCBI against the nr protein database

Did you find any lsquosignificantrsquo hits

Repeat with a second sequence

What conclusions might you draw from this exercise

Try the same sequence(s) against the nr nucleotide database

Is there any general difference

Part 4 Tweaking BLASTAlthough you normally see BLAST as a web page with boxes to place data in and tick boxes etc it is actually a command line program that can be run just by typing the appropriate command and options eg

promptgt blastall ndashp ltblast typegt ndashi ltinput sequencegt ndashd ltdatabasegt

This is the simplest form where the basic program lsquoblastallrsquo takes a number of different options or parameters indicated by the ndashx and followed by its value -p ltwhich blast flavour to rungt-i ltfile with query sequence ingt-d ltpre-indexed database namegt

Not All Parameters are hellipThere are many other parameters and if not listed explicitly they will use a default value most appropriate to the blast flavour requested Eg for ndashW ltword sizegt blastn uses ndashW 11 where blastx uses ndashW 3

There are also some options that appear on the web pages that are not really parameters but manage the job in a similar way One of the most useful of these is on the NCBI blast pages where you can use Entrez queries or pick from an organism list to modify your search

The Many Parameters of BLASTThere are almost literally hundreds of parameters but most are way too obscure even for die-hard techies like me Very few of them are regularly useful in any but their default value but just occasionally they are very necessary

Here are some of the ones that I have used

-e max expected value -m output format (graphical or tabularspreadsheet)-F filter query sequence for low complexity (default TRUE)-U use only upper case regions of query (default FALSE)-G gap opening cost-E gap extension cost-q nucleotide mismatch penalty (BLASTx uses matrices)-r nucleotide match reward-b number of matching sequences to report-g allow gaps (default TRUE)-W word size-z effective database size (removes effect of actual database size)-S query strands to search (default both directions)-l restrict database sequences to given list of lsquogilsquo numbers

BLAST Parameters Exercises1 BLASTn vs BLASTx

Open the file example-sequenceshtml copy the sequence gtblastn-vs-blastxThis is a Xenopus tropicalis cDNA sequence

Go to the NCBI BLAST Home PageNucleotide-nucleotide BLAST (blastn) section Paste your sequence into the box

Run BLASTn against the nr nucleotide database using all default optionsThen hit [format] to wait for the results in a new page(hint if you paste the sequence definition line lsquogtnamersquo into the box as well your results will be labelled accordingly which can be useful)

Now repeat but go to the TRANSLATED BLAST section and BLAST against the nr protein database using BLASTx

How might the different results help us view the presence of this gene in other vertebrates

Results for Exercise 1 BLASTn

BLASTx

BLAST Parameters Exercises2 Low complexity filtering

Open the file example-sequenceshtml copy the sequence gtlow-complexity-filtering-A This sequence contains a long AT tandem repeat

Go to the NCBI BLAST Home PageTRANSLATED BLAST sectionBLASTx Paste your sequence into the box

Carefully UNTICK the ldquoChoose filter [ ] Low complexityrdquo BOX in the second section And then run BLASTx against the nr database

What do you feel about these alignmentsRe-run but leave the low-complexity filter ON this timeDoes this change our view of the protein matches

Now continue with gtlow-complexity-filtering-B and ndashCC is an especially interesting case ndash what can we deduce about the cDNA sequence Annotators beware

Results for Exercise 2A (OFF) BLASTn ndash low complexity filtering OFF

Results for Exercise 2A (ON) BLASTn ndash low complexity filtering ON

Results for Exercise 2B

ON OFF

Results for Exercise 2C

There is a sequence error an extra G at position 117 in the sequence cDNA (117)AGAAAAGAAGAAACATGGCAATGGATCAGAA|||||||||||||||| ||||||||||||||AGAAAAGAAGAAACAT-GCAATGGATCAGAA

Genomic sequence

ON

OFF

BLAST Parameters Exercises3 Limit by Entrez query

Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching sequences in fruit fly enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT

Open the file example-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home Page TRANSLATED BLAST sectionBLASTx and paste the sequence

Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip

BLAST Parameters Exercises4 BLASTn vs tBLASTx and nucleotide mismatch penalties

Open the file example-sequenceshtml

Also open the NCBI BLAST Home PageSPECIAL ndash Align two sequences section

There are several Xenopus tropicalis cyclins in the examples fileCopy the sequence gtcyclin-A1-Xt to the Sequence 1 BLAST windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 BLAST window(i) Run the default comparison should be BLASTn Note the alignmentNow run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these

(ii) Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1 Then try reducing the gap open and gap extension penaltieshellipWhat do we learn from this

(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip

Results for Exercise 4 (i)

BLASTn tBLASTx

Results for Exercise 4 (ii)

Mismatch penalty = -2 (default) Mismatch penalty = -1

BLAST Parameters Exercises5 E-Value maximum for reporting

Open the file example-sequenceshtml

Copy the sequence gtsumo-binding-motif and go to the NCBI BLAST Home PageGo to the PROTEIN BLAST section BLASTp and paste the sequence

Run the search with the default values

Now re-run the search setting the maximum E-value in the box

Expect 100

What difference does this make

BLAST Parameters Exercises6 Word Size

Open the file example-sequenceshtml

Copy the sequence gtmorpholino and go to the NCBI BLAST Home PageGo to the NUCLEOTIDE BLAST section BLASTn and paste the sequence

Check OFF the low complexity filter and then run the search

Now re-run the search setting the following parameters

Low complexity OFFExpect 100Word Size 7Other advanced -q-1 (mismatch penalty -1 instead of default -3)

What difference does this make

END

  • Bioinformatics Workshop 1 Sequences and Similarity Searches
  • The Basic Questions
  • Part 1 Structural Genomics
  • Chromosomes and Genes
  • Gene to Protein
  • Sequence Signals
  • Genomic Signals
  • Derivative Sequences
  • Gene Models
  • Sequences and Genes (Accession Numbers and Names)
  • Gene Symbols Names Etc
  • A Gene-Centric View
  • Sequences and Accession Numbers
  • mRNA Splicing Signals
  • Gene Predictions
  • Supporting Evidence
  • TheoreticalPredicted Sequences
  • Sequences for a model organism
  • So Whatrsquos in the Databases Now
  • Part 2 Comparative Genomics
  • Speciation
  • Residual Similarity
  • Computers Can Detect Homology
  • Orthologs
  • Paralogs
  • lsquoOtherrsquo-logs
  • The Essential Paradigm
  • Function Conserved Longer than Detectable Similarity
  • Redundancy in the Genetic Code
  • Protein Similarity Persists Longer
  • Always Compare Protein Sequences
  • Exercise 1 nucleotide vs amino acid search
  • Answers Exercise 1
  • The Essential Task
  • Functional Orthologs
  • Finding Orthologs
  • Using Synteny is Better
  • Metazome
  • Metazome Exercise
  • Part 3 Finding Sequence Similarities
  • Gaps in Alignments
  • The Downside of Gaps
  • BLAST
  • Flavours of BLAST
  • How does it work
  • BLAST WORDS and INDEXING
  • Analyse the Query Sequence
  • Expand from Word Based Matches
  • BLAST ndashTypical Output
  • When is a match significant
  • E-values
  • E-values From First Principles
  • Calculating an E-value
  • E-values In Practice
  • E-value Exercise
  • E-value Exercise Answer
  • E-values Effect of Database Size
  • Slide 58
  • Why were the values different
  • E-values Effect of Query Length
  • Why not just use identity
  • So whatrsquos the real problem
  • Rules of Thumb
  • Protein BLAST
  • Amino Acid Substitutions
  • Exercises
  • Part 4 Tweaking BLAST
  • Not All Parameters are hellip
  • The Many Parameters of BLAST
  • Slide 70
  • BLAST Parameters Exercises
  • Results for Exercise 1
  • Slide 73
  • Results for Exercise 2A (OFF)
  • Results for Exercise 2A (ON)
  • Results for Exercise 2B
  • Results for Exercise 2C
  • Slide 78
  • Slide 79
  • Results for Exercise 4 (i)
  • Results for Exercise 4 (ii)
  • Slide 82
  • Slide 83
  • END
Page 46: Bioinformatics Workshop 1 Sequences and Similarity Searches

BLAST WORDS and INDEXING1 GACAAATCCAAACCCCTGAAGTTCTCCACCAGCAAAGCCA

2 TAAGCAAATTTAATTTTGTTTACATTTTC

3 GTTAAGACCTTCCCTGACATTTGCAGCAGTTTCAAATGTA

AAAAAAAA 00001 AAAAAAAC 00002 AAAAAAAG 00003 ACAAATCC 07967 ACAAATCC 07968 ACAAATCC 07979 GACAAATC 33568 GACAAATG 33569 TCCAAACC 64321 TCCAAACC 64322

sequence position word

1 1 33658

1 2 07967

1 3 16210

3 15 33568

3 16 07967

Database of sequences

Numbered list of all possible lsquowordsrsquo

Build a position index of all words in the database

Analyse the Query Sequence gtquery AGACAAATCCAAACCCCTGAAGTTCTCCACCAGCAAAGCCA

AAAAAAAA 00001 AAAAAAAC 00002 AAAAAAAG 00003 ACAAATCC 07967 ACAAATCC 07968 ACAAATCC 07979 GACAAATC 33568 GACAAATG 33569 TCCAAACC 64321 TCCAAACC 64322

QUERY SEQUENCE

Numbered list of all possible lsquowordsrsquo

position word

1 14236

2 33658

3 07967

Analyse QUERY SEQUENCE

sequence position word

1 1 33658

1 2 07967

1 3 16210

3 15 33568

3 16 07967

Index of database

Expand from Word Based Matches We lsquoinstantlyrsquo know which sequences in the database have at least a word length match with our query sequence and at what relative position

Next the potential alignments are expanded adding up a score for (total matches ndash mismatches ndash gap penalties) to make the best possible alignment But this is usually for a tiny proportion of the sequences in the database ndash so overall it is much quicker

The highest scoring alignments are reported

But we can potentially miss alignments with no word-size bits in common consider BLASTn with a default word-size of 11

TCGGAAGTGGAAGCTGAACCTGATTGTAGAGTTGGAGGCCAGTGTTCTGGCTGAGC||||||||| ||||| |||||||||| |||||||||| |||| ||||| ||||||| TCGGAAGTGTAAGCTCAACCTGATTGCAGAGTTGGAGTCCAGAGTTCTAGCTGAGC

Care is sometimes neededhellip

BLAST ndashTypical OutputINPUT

gtpartial cDNA sequence Xenopus tropicalisCGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGTTCCCACCTCTCCTCTTTCACCATGAAGCTCAAGGACAAATTCCACTCCCCCAAAATCAAGCGCACCCCGTCCAAGAAGGGGAAGCCGGCCGACCTCACCGTCAAAACAGAAGAGAAACCCGTCAACAAAACCTTAAGCCGCTTGGAGGAACAGGAGAAAGAAGTCGTTAATGCCTTGCGTTACTTTAAGACAATTGTTGACAAGATGGCGGTGGACAAGATGGTGCTGGTGATGCTGCCAGGGTCGGCGA

OUTPUTQuery= (311 letters) Database NCBI Protein Reference Sequences 954378 sequences 347895532 total letters

gtgi|41055060|ref|NP_9574201| similar to guanine nucleotide-releasing factor 2 (specific for crk proto-oncogene) [Danio rerio]

Length=691

Score = 133 bits (335)Expect = 6e-31 Identities = 7698 (77) Positives = 8298 (83) Gaps = 498 (4) Frame = +2

Query 26 MSGKIE-KADSQRSHLSSFTMKLKDKFHSPKIKRTPSKKGKPA--DLTVKTEEKPVNKTL 196 MSGKIE K +SQ+SHLSSFTMKL KFHSPKIKRTPSKKGK + VKT EKPVNK + Sbjct 1 MSGKIESKHESQKSHLSSFTMKLM-KFHSPKIKRTPSKKGKQLQPEPAVKTPEKPVNKKV 59

Query 197 SRLEEQEKEVVNALRYFKTIVDKMAVDKMVLVMLPGSA 310 SRLEEQEK+VV+ALRYFKTIVDKM VD VL MLPGSA Sbjct 60 SRLEEQEKDVVSALRYFKTIVDKMNVDTKVLQMLPGSA 97

When is a match significant

RFKISDCQHPCTYSHNQYMTNHMRECPYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV

NFSWKKTSEKETNCQFDYPNDYNEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFNMCWLEVNSS

RF---KISDCQHPCTYSH-NQYMTNHMREC----PYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV F K S+ + C + + N Y N +C P+ + +W +P + D I N M ++ NFSWKKTSEKETNCQFDYPNDY--NEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFN------MCWLEVNSS

Here is a lsquotypicalrsquo weak alignment from BLASTp

In fact the sequences were randomly generated so there is no biologically significant alignmenthellip

E-values

The number of matches like the discovered match that I would expect to find by chance

An E-value of 00 implies that I would expect no matches like this to arise by chance thereforehellip

An E-value of 1 implies I would expect 1 match like this to arise by chance so if I have a match with such an E-valuehellip

Also ldquoexpect valueldquo or ldquoexpectationrdquo

E-values From First Principles

Some database statistics (23rd July 2005)

Database NCBI RefSeq mRNA 272619 sequences 503566580 total letters (~50 x 108)

Database NCBI nr 3329110 sequences 14601814750 total letters (~14 x 1010)

Notation

12e-35 = 12 x 10-35

48 x 106 = 4800000

We will consider first searching a nucleotide sequence (lsquoACGTAGACGTrsquo) against a nucleotide database eg the RefSeq mRNA above

Then we will consider the more complex case of amino acid sequence (protein) searches Which is of course what we mostly do

Calculating an E-valueThe RefSeq mRNA database has ~ 50 x 108 letters There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA

Query = lsquoArsquo

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACrsquo

AC AC AC AC

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACGrsquo

ACG

Expected number of matches = (50 x 108) 4 = ~12 x 108

Expected number of matches = (50 x 108) (4x 4) = ~31 x 107

Expected number of matches = (50 x 108) (4 x 4 x 4) = ~81 x 106

Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer

Expected number of matches = (50 x 108) (4 x 4 x 4 x 4 hellip 60 times ) = (50 x 108) 1036 = 50 x 10-28

E-value = 50 x 10-28

E-values In PracticeSo if I take a 60 nt sequencegtsequenceACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA and actually BLAST it against the RefSeq mRNA database I get

BLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds

Length=6060 Score = 119 bits (60) Expect = 2e-26 Identities = 6060 (100) Gaps = 060 (0) Strand=PlusPlus

Query 1 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 60 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 2977 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 3036

What do I get if I BLAST it against the larger nr databaseBLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds

Length=6060 Score = 119 bits (60) Expect = 6e-25 Identities = 6060 (100) Gaps = 060 (0) Strand=PlusPlus

Query 1 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 60 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 2977 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 3036

theoretical value was 50e-28 -

E-value Exercise

Given a transcription factor binding site

ACC[TG]TA

How many would you expect to find by chance in a 10k promoter sequence

How would this differ if there was an optional additional base between the 4th and 5th positionsIeACC[TG]TAOR ACC[TG]TA

E-value Exercise AnswerACC[TG]TA

Expect lsquoArsquo every 4 ntExpect lsquoACCrsquo every 4x4x4 = 64 nt

Expect lsquoT or Grsquo every 2nd ntExpect lsquoACC[TG]rsquo every 64x2 nt = 128 nt

Expect lsquoTArsquo every 4x4 = 16 ntExpect lsquoACC[TG]TArsquo every 128x16 nt = 2048 nt (4x4x4x2x4x4)We would expect ~5 of these promoter sites every 10k by chance

If also ACC[TG]TAA allowed

The two motifs independently have the same E-valueTo allow either means we expect twice as many

E-values Effect of Database SizeThe nr mRNA database has ~ 14 x 1010 letters (was RefSeq and 50 x108)There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA

Query = lsquoArsquo

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACrsquo

AC AC AC AC

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACGrsquo

ACG

Expected number of matches = (14 x 1010) 4 = ~12 x 108

Expected number of matches = (14 x 1010) (4x 4) = ~31 x 107

Expected number of matches = (14 x 1010) (4 x 4 x 4) = ~81 x 106

Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer

Expected number of matches = (14 x 1010) (4 x 4 x 4 x 4 hellip 60 times ) = (14 x 1010) 1036 = 14 x 10-26

E-value = 14 x 10-26

(was E-value = 50 x 10-28)

E-values Effect of Database Size

The E-value is simply dependent on database size

RefSeq

nr

14 x 1010 letters

50 x108 letters

30 x bigger

BLAST the same sequenceagainst each E-value = 14e-26

E-value = 50e-28

The database was ~30 times bigger and so the E-value was ~30 times bigger

Why were the values differentOur calculated E-value for searching against the RefSeq mRNA database was 50 x 10-28But our actual BLAST search at NCBI gave a value of

20 x 10-26 - about 40x larger - why is this

Gapped alignments

If we were expecting N matches for a query sequence lsquoACGTACGTACGTrsquo imagine what would happen to N if we allowed gaps in our matches

ACGTACGTACGT

This would now give us additional possible alignments that would meet our lsquomatchrsquo criteria

ACGTACGTACGT ACGTACAGTACGT ACGTACCGTACGT etc|||||||||||| |||||| |||||| |||||| ||||||ACGTACGTACGT ACGTAC-GTACGT ACGTAC-GTACGT

We will expect many more matches in a given database if we allow our alignments to have gaps The E-value will be larger

E-values Effect of Query Length

Biologically itrsquos the same match Does it mean we are any less sure that this match didnrsquot occur by chance The E-value is simply dependent on match length

database

BLAST 500 nt sequence against a database

BLASTn Get a full length match with sequence XYZ at an E-value = 50e-160

gtsequence ACTAGTCTAGCTAGACATCGATCGATGATGCTACACAGATAGACGATAGATAGTAAGTCGATCGATCGCGCATCGATCGTCTAGATCGATCGCTCGCTGTGTAGATAGATCGGCGATAGA

database

BLAST half of the same sequence against the same database

BLASTn

gtsequence ACTAGTCTAGCTAGACATCGATCGATGATGCTACACAGATAGACGATAGATAGTAAGTCG

Get a match with sequence XYZ again but at an E-value = 50e-80

Why not just use identityAt some levels this a good question

But consider two very different searches both of which give a 75 identity match

Query1 was 60 nt longCGGAGCTCAGGGCTTAACGACTGATATCTCCGCGCATGTCGAGAAACGATACAGCCAGCG||||||||||| || | || | || || |||| | | | |||||| | ||||||||||CGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGWhich would have an E-value ~ 50 x 10-19

And Query2 only 16 nt longACGTACGTACGTACGT||| || | |||| ||ACGCACCTTCGTAGGTWhich would have an E-value ~ 30

And intuitively we feel we would expect to see that sort of number of matches in the database just by chancehellip

So whatrsquos the real problemBasically you are usually trying to answer the question

Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism

Are there any useful guidelines though at least for biological meaningfulness

Basically you are usually trying to answer the question

Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism

BLAST

The difficulty is because

ORTHOLOGY

BLAST Similarity + Probability

biological knowledge

nature of query sequence

phylogenetic relationship

match length PI size of databasehellip

Rules of ThumbHow good does an E-value have to be before we might even think we have an ortholog

largerworse smallerbetter

E-values 10-5 10-10 10-40 10-100 00

fantasy borderline encouraging

pretty good canrsquot get

better

But note that in some gene families with closely related members you can get an E-value of 00 for several different matches and then identity may be more sensitive Also bear in mind in cases like this that ideas of lsquofunctionalrsquo orthology may break down with more than one locus producing identical proteins which share the same functionhellip

Protein BLASTItrsquos (nearly) always better to make comparisons at the amino acid level between protein sequences than the DNA level because the amino acid sequence is more conserved than the underlying DNA sequenceDoes this cause us to treat expected values any differently

If we follow the argument as before then for an exact match of a 20 amino acid sequence in the RefSeq protein database each additional amino acid will reduce the E-value by 120th (there are 20 different amino acids) And as there are 347895532 letters in that databaseE-value = ~35 x 108 (20 x 20 x 20 hellip20 times) = ~35 x 10-18

But this is what we get if we run the blast at NCBI

Score = 431 bits (100) Expect = 8e-04 Identities = 2020 (100) Positives = 2020 (100) Gaps = 020 (0) Frame = +3

Query 3 SSSSFRAYRAALSEVEPPCI 62 SSSSFRAYRAALSEVEPPCISbjct 972 SSSSFRAYRAALSEVEPPCI 991

Really too big a discrepancy to easily explain with hand wavinghellip

Amino Acid Substitutions

A SC F LWYG I LMVL IMFVM ILVP V ILMW FY

N DHSQ REHKS ANTT SY HFW

H NQYK RQER QK

D NEE DQK

In fact we need to take into account both amino acid substitutability as well as as before allowing gapped alignments On average any residue can be substituted for by about 2 others so each position has about 17th chance of lsquomatchingrsquo rather than 120th

So now we getE-value = ~35 x 108 (7 x 7 x 7 hellip20 times) = ~44 x 10-9which is much closer to the actual BLAST value

These substitutabilities are dealt with by the BLOSUM and PAM matrices

ExercisesGo to the file random-DNA-sequenceshtml select one of the 20 randomly generated nucleotide sequences and do a BLASTx (translated DNA-gtprotein) at NCBI against the nr protein database

Did you find any lsquosignificantrsquo hits

Repeat with a second sequence

What conclusions might you draw from this exercise

Try the same sequence(s) against the nr nucleotide database

Is there any general difference

Part 4 Tweaking BLASTAlthough you normally see BLAST as a web page with boxes to place data in and tick boxes etc it is actually a command line program that can be run just by typing the appropriate command and options eg

promptgt blastall ndashp ltblast typegt ndashi ltinput sequencegt ndashd ltdatabasegt

This is the simplest form where the basic program lsquoblastallrsquo takes a number of different options or parameters indicated by the ndashx and followed by its value -p ltwhich blast flavour to rungt-i ltfile with query sequence ingt-d ltpre-indexed database namegt

Not All Parameters are hellipThere are many other parameters and if not listed explicitly they will use a default value most appropriate to the blast flavour requested Eg for ndashW ltword sizegt blastn uses ndashW 11 where blastx uses ndashW 3

There are also some options that appear on the web pages that are not really parameters but manage the job in a similar way One of the most useful of these is on the NCBI blast pages where you can use Entrez queries or pick from an organism list to modify your search

The Many Parameters of BLASTThere are almost literally hundreds of parameters but most are way too obscure even for die-hard techies like me Very few of them are regularly useful in any but their default value but just occasionally they are very necessary

Here are some of the ones that I have used

-e max expected value -m output format (graphical or tabularspreadsheet)-F filter query sequence for low complexity (default TRUE)-U use only upper case regions of query (default FALSE)-G gap opening cost-E gap extension cost-q nucleotide mismatch penalty (BLASTx uses matrices)-r nucleotide match reward-b number of matching sequences to report-g allow gaps (default TRUE)-W word size-z effective database size (removes effect of actual database size)-S query strands to search (default both directions)-l restrict database sequences to given list of lsquogilsquo numbers

BLAST Parameters Exercises1 BLASTn vs BLASTx

Open the file example-sequenceshtml copy the sequence gtblastn-vs-blastxThis is a Xenopus tropicalis cDNA sequence

Go to the NCBI BLAST Home PageNucleotide-nucleotide BLAST (blastn) section Paste your sequence into the box

Run BLASTn against the nr nucleotide database using all default optionsThen hit [format] to wait for the results in a new page(hint if you paste the sequence definition line lsquogtnamersquo into the box as well your results will be labelled accordingly which can be useful)

Now repeat but go to the TRANSLATED BLAST section and BLAST against the nr protein database using BLASTx

How might the different results help us view the presence of this gene in other vertebrates

Results for Exercise 1 BLASTn

BLASTx

BLAST Parameters Exercises2 Low complexity filtering

Open the file example-sequenceshtml copy the sequence gtlow-complexity-filtering-A This sequence contains a long AT tandem repeat

Go to the NCBI BLAST Home PageTRANSLATED BLAST sectionBLASTx Paste your sequence into the box

Carefully UNTICK the ldquoChoose filter [ ] Low complexityrdquo BOX in the second section And then run BLASTx against the nr database

What do you feel about these alignmentsRe-run but leave the low-complexity filter ON this timeDoes this change our view of the protein matches

Now continue with gtlow-complexity-filtering-B and ndashCC is an especially interesting case ndash what can we deduce about the cDNA sequence Annotators beware

Results for Exercise 2A (OFF) BLASTn ndash low complexity filtering OFF

Results for Exercise 2A (ON) BLASTn ndash low complexity filtering ON

Results for Exercise 2B

ON OFF

Results for Exercise 2C

There is a sequence error an extra G at position 117 in the sequence cDNA (117)AGAAAAGAAGAAACATGGCAATGGATCAGAA|||||||||||||||| ||||||||||||||AGAAAAGAAGAAACAT-GCAATGGATCAGAA

Genomic sequence

ON

OFF

BLAST Parameters Exercises3 Limit by Entrez query

Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching sequences in fruit fly enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT

Open the file example-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home Page TRANSLATED BLAST sectionBLASTx and paste the sequence

Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip

BLAST Parameters Exercises4 BLASTn vs tBLASTx and nucleotide mismatch penalties

Open the file example-sequenceshtml

Also open the NCBI BLAST Home PageSPECIAL ndash Align two sequences section

There are several Xenopus tropicalis cyclins in the examples fileCopy the sequence gtcyclin-A1-Xt to the Sequence 1 BLAST windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 BLAST window(i) Run the default comparison should be BLASTn Note the alignmentNow run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these

(ii) Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1 Then try reducing the gap open and gap extension penaltieshellipWhat do we learn from this

(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip

Results for Exercise 4 (i)

BLASTn tBLASTx

Results for Exercise 4 (ii)

Mismatch penalty = -2 (default) Mismatch penalty = -1

BLAST Parameters Exercises5 E-Value maximum for reporting

Open the file example-sequenceshtml

Copy the sequence gtsumo-binding-motif and go to the NCBI BLAST Home PageGo to the PROTEIN BLAST section BLASTp and paste the sequence

Run the search with the default values

Now re-run the search setting the maximum E-value in the box

Expect 100

What difference does this make

BLAST Parameters Exercises6 Word Size

Open the file example-sequenceshtml

Copy the sequence gtmorpholino and go to the NCBI BLAST Home PageGo to the NUCLEOTIDE BLAST section BLASTn and paste the sequence

Check OFF the low complexity filter and then run the search

Now re-run the search setting the following parameters

Low complexity OFFExpect 100Word Size 7Other advanced -q-1 (mismatch penalty -1 instead of default -3)

What difference does this make

END

  • Bioinformatics Workshop 1 Sequences and Similarity Searches
  • The Basic Questions
  • Part 1 Structural Genomics
  • Chromosomes and Genes
  • Gene to Protein
  • Sequence Signals
  • Genomic Signals
  • Derivative Sequences
  • Gene Models
  • Sequences and Genes (Accession Numbers and Names)
  • Gene Symbols Names Etc
  • A Gene-Centric View
  • Sequences and Accession Numbers
  • mRNA Splicing Signals
  • Gene Predictions
  • Supporting Evidence
  • TheoreticalPredicted Sequences
  • Sequences for a model organism
  • So Whatrsquos in the Databases Now
  • Part 2 Comparative Genomics
  • Speciation
  • Residual Similarity
  • Computers Can Detect Homology
  • Orthologs
  • Paralogs
  • lsquoOtherrsquo-logs
  • The Essential Paradigm
  • Function Conserved Longer than Detectable Similarity
  • Redundancy in the Genetic Code
  • Protein Similarity Persists Longer
  • Always Compare Protein Sequences
  • Exercise 1 nucleotide vs amino acid search
  • Answers Exercise 1
  • The Essential Task
  • Functional Orthologs
  • Finding Orthologs
  • Using Synteny is Better
  • Metazome
  • Metazome Exercise
  • Part 3 Finding Sequence Similarities
  • Gaps in Alignments
  • The Downside of Gaps
  • BLAST
  • Flavours of BLAST
  • How does it work
  • BLAST WORDS and INDEXING
  • Analyse the Query Sequence
  • Expand from Word Based Matches
  • BLAST ndashTypical Output
  • When is a match significant
  • E-values
  • E-values From First Principles
  • Calculating an E-value
  • E-values In Practice
  • E-value Exercise
  • E-value Exercise Answer
  • E-values Effect of Database Size
  • Slide 58
  • Why were the values different
  • E-values Effect of Query Length
  • Why not just use identity
  • So whatrsquos the real problem
  • Rules of Thumb
  • Protein BLAST
  • Amino Acid Substitutions
  • Exercises
  • Part 4 Tweaking BLAST
  • Not All Parameters are hellip
  • The Many Parameters of BLAST
  • Slide 70
  • BLAST Parameters Exercises
  • Results for Exercise 1
  • Slide 73
  • Results for Exercise 2A (OFF)
  • Results for Exercise 2A (ON)
  • Results for Exercise 2B
  • Results for Exercise 2C
  • Slide 78
  • Slide 79
  • Results for Exercise 4 (i)
  • Results for Exercise 4 (ii)
  • Slide 82
  • Slide 83
  • END
Page 47: Bioinformatics Workshop 1 Sequences and Similarity Searches

Analyse the Query Sequence gtquery AGACAAATCCAAACCCCTGAAGTTCTCCACCAGCAAAGCCA

AAAAAAAA 00001 AAAAAAAC 00002 AAAAAAAG 00003 ACAAATCC 07967 ACAAATCC 07968 ACAAATCC 07979 GACAAATC 33568 GACAAATG 33569 TCCAAACC 64321 TCCAAACC 64322

QUERY SEQUENCE

Numbered list of all possible lsquowordsrsquo

position word

1 14236

2 33658

3 07967

Analyse QUERY SEQUENCE

sequence position word

1 1 33658

1 2 07967

1 3 16210

3 15 33568

3 16 07967

Index of database

Expand from Word Based Matches We lsquoinstantlyrsquo know which sequences in the database have at least a word length match with our query sequence and at what relative position

Next the potential alignments are expanded adding up a score for (total matches ndash mismatches ndash gap penalties) to make the best possible alignment But this is usually for a tiny proportion of the sequences in the database ndash so overall it is much quicker

The highest scoring alignments are reported

But we can potentially miss alignments with no word-size bits in common consider BLASTn with a default word-size of 11

TCGGAAGTGGAAGCTGAACCTGATTGTAGAGTTGGAGGCCAGTGTTCTGGCTGAGC||||||||| ||||| |||||||||| |||||||||| |||| ||||| ||||||| TCGGAAGTGTAAGCTCAACCTGATTGCAGAGTTGGAGTCCAGAGTTCTAGCTGAGC

Care is sometimes neededhellip

BLAST ndashTypical OutputINPUT

gtpartial cDNA sequence Xenopus tropicalisCGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGTTCCCACCTCTCCTCTTTCACCATGAAGCTCAAGGACAAATTCCACTCCCCCAAAATCAAGCGCACCCCGTCCAAGAAGGGGAAGCCGGCCGACCTCACCGTCAAAACAGAAGAGAAACCCGTCAACAAAACCTTAAGCCGCTTGGAGGAACAGGAGAAAGAAGTCGTTAATGCCTTGCGTTACTTTAAGACAATTGTTGACAAGATGGCGGTGGACAAGATGGTGCTGGTGATGCTGCCAGGGTCGGCGA

OUTPUTQuery= (311 letters) Database NCBI Protein Reference Sequences 954378 sequences 347895532 total letters

gtgi|41055060|ref|NP_9574201| similar to guanine nucleotide-releasing factor 2 (specific for crk proto-oncogene) [Danio rerio]

Length=691

Score = 133 bits (335)Expect = 6e-31 Identities = 7698 (77) Positives = 8298 (83) Gaps = 498 (4) Frame = +2

Query 26 MSGKIE-KADSQRSHLSSFTMKLKDKFHSPKIKRTPSKKGKPA--DLTVKTEEKPVNKTL 196 MSGKIE K +SQ+SHLSSFTMKL KFHSPKIKRTPSKKGK + VKT EKPVNK + Sbjct 1 MSGKIESKHESQKSHLSSFTMKLM-KFHSPKIKRTPSKKGKQLQPEPAVKTPEKPVNKKV 59

Query 197 SRLEEQEKEVVNALRYFKTIVDKMAVDKMVLVMLPGSA 310 SRLEEQEK+VV+ALRYFKTIVDKM VD VL MLPGSA Sbjct 60 SRLEEQEKDVVSALRYFKTIVDKMNVDTKVLQMLPGSA 97

When is a match significant

RFKISDCQHPCTYSHNQYMTNHMRECPYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV

NFSWKKTSEKETNCQFDYPNDYNEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFNMCWLEVNSS

RF---KISDCQHPCTYSH-NQYMTNHMREC----PYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV F K S+ + C + + N Y N +C P+ + +W +P + D I N M ++ NFSWKKTSEKETNCQFDYPNDY--NEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFN------MCWLEVNSS

Here is a lsquotypicalrsquo weak alignment from BLASTp

In fact the sequences were randomly generated so there is no biologically significant alignmenthellip

E-values

The number of matches like the discovered match that I would expect to find by chance

An E-value of 00 implies that I would expect no matches like this to arise by chance thereforehellip

An E-value of 1 implies I would expect 1 match like this to arise by chance so if I have a match with such an E-valuehellip

Also ldquoexpect valueldquo or ldquoexpectationrdquo

E-values From First Principles

Some database statistics (23rd July 2005)

Database NCBI RefSeq mRNA 272619 sequences 503566580 total letters (~50 x 108)

Database NCBI nr 3329110 sequences 14601814750 total letters (~14 x 1010)

Notation

12e-35 = 12 x 10-35

48 x 106 = 4800000

We will consider first searching a nucleotide sequence (lsquoACGTAGACGTrsquo) against a nucleotide database eg the RefSeq mRNA above

Then we will consider the more complex case of amino acid sequence (protein) searches Which is of course what we mostly do

Calculating an E-valueThe RefSeq mRNA database has ~ 50 x 108 letters There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA

Query = lsquoArsquo

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACrsquo

AC AC AC AC

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACGrsquo

ACG

Expected number of matches = (50 x 108) 4 = ~12 x 108

Expected number of matches = (50 x 108) (4x 4) = ~31 x 107

Expected number of matches = (50 x 108) (4 x 4 x 4) = ~81 x 106

Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer

Expected number of matches = (50 x 108) (4 x 4 x 4 x 4 hellip 60 times ) = (50 x 108) 1036 = 50 x 10-28

E-value = 50 x 10-28

E-values In PracticeSo if I take a 60 nt sequencegtsequenceACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA and actually BLAST it against the RefSeq mRNA database I get

BLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds

Length=6060 Score = 119 bits (60) Expect = 2e-26 Identities = 6060 (100) Gaps = 060 (0) Strand=PlusPlus

Query 1 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 60 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 2977 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 3036

What do I get if I BLAST it against the larger nr databaseBLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds

Length=6060 Score = 119 bits (60) Expect = 6e-25 Identities = 6060 (100) Gaps = 060 (0) Strand=PlusPlus

Query 1 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 60 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 2977 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 3036

theoretical value was 50e-28 -

E-value Exercise

Given a transcription factor binding site

ACC[TG]TA

How many would you expect to find by chance in a 10k promoter sequence

How would this differ if there was an optional additional base between the 4th and 5th positionsIeACC[TG]TAOR ACC[TG]TA

E-value Exercise AnswerACC[TG]TA

Expect lsquoArsquo every 4 ntExpect lsquoACCrsquo every 4x4x4 = 64 nt

Expect lsquoT or Grsquo every 2nd ntExpect lsquoACC[TG]rsquo every 64x2 nt = 128 nt

Expect lsquoTArsquo every 4x4 = 16 ntExpect lsquoACC[TG]TArsquo every 128x16 nt = 2048 nt (4x4x4x2x4x4)We would expect ~5 of these promoter sites every 10k by chance

If also ACC[TG]TAA allowed

The two motifs independently have the same E-valueTo allow either means we expect twice as many

E-values Effect of Database SizeThe nr mRNA database has ~ 14 x 1010 letters (was RefSeq and 50 x108)There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA

Query = lsquoArsquo

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACrsquo

AC AC AC AC

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACGrsquo

ACG

Expected number of matches = (14 x 1010) 4 = ~12 x 108

Expected number of matches = (14 x 1010) (4x 4) = ~31 x 107

Expected number of matches = (14 x 1010) (4 x 4 x 4) = ~81 x 106

Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer

Expected number of matches = (14 x 1010) (4 x 4 x 4 x 4 hellip 60 times ) = (14 x 1010) 1036 = 14 x 10-26

E-value = 14 x 10-26

(was E-value = 50 x 10-28)

E-values Effect of Database Size

The E-value is simply dependent on database size

RefSeq

nr

14 x 1010 letters

50 x108 letters

30 x bigger

BLAST the same sequenceagainst each E-value = 14e-26

E-value = 50e-28

The database was ~30 times bigger and so the E-value was ~30 times bigger

Why were the values differentOur calculated E-value for searching against the RefSeq mRNA database was 50 x 10-28But our actual BLAST search at NCBI gave a value of

20 x 10-26 - about 40x larger - why is this

Gapped alignments

If we were expecting N matches for a query sequence lsquoACGTACGTACGTrsquo imagine what would happen to N if we allowed gaps in our matches

ACGTACGTACGT

This would now give us additional possible alignments that would meet our lsquomatchrsquo criteria

ACGTACGTACGT ACGTACAGTACGT ACGTACCGTACGT etc|||||||||||| |||||| |||||| |||||| ||||||ACGTACGTACGT ACGTAC-GTACGT ACGTAC-GTACGT

We will expect many more matches in a given database if we allow our alignments to have gaps The E-value will be larger

E-values Effect of Query Length

Biologically itrsquos the same match Does it mean we are any less sure that this match didnrsquot occur by chance The E-value is simply dependent on match length

database

BLAST 500 nt sequence against a database

BLASTn Get a full length match with sequence XYZ at an E-value = 50e-160

gtsequence ACTAGTCTAGCTAGACATCGATCGATGATGCTACACAGATAGACGATAGATAGTAAGTCGATCGATCGCGCATCGATCGTCTAGATCGATCGCTCGCTGTGTAGATAGATCGGCGATAGA

database

BLAST half of the same sequence against the same database

BLASTn

gtsequence ACTAGTCTAGCTAGACATCGATCGATGATGCTACACAGATAGACGATAGATAGTAAGTCG

Get a match with sequence XYZ again but at an E-value = 50e-80

Why not just use identityAt some levels this a good question

But consider two very different searches both of which give a 75 identity match

Query1 was 60 nt longCGGAGCTCAGGGCTTAACGACTGATATCTCCGCGCATGTCGAGAAACGATACAGCCAGCG||||||||||| || | || | || || |||| | | | |||||| | ||||||||||CGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGWhich would have an E-value ~ 50 x 10-19

And Query2 only 16 nt longACGTACGTACGTACGT||| || | |||| ||ACGCACCTTCGTAGGTWhich would have an E-value ~ 30

And intuitively we feel we would expect to see that sort of number of matches in the database just by chancehellip

So whatrsquos the real problemBasically you are usually trying to answer the question

Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism

Are there any useful guidelines though at least for biological meaningfulness

Basically you are usually trying to answer the question

Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism

BLAST

The difficulty is because

ORTHOLOGY

BLAST Similarity + Probability

biological knowledge

nature of query sequence

phylogenetic relationship

match length PI size of databasehellip

Rules of ThumbHow good does an E-value have to be before we might even think we have an ortholog

largerworse smallerbetter

E-values 10-5 10-10 10-40 10-100 00

fantasy borderline encouraging

pretty good canrsquot get

better

But note that in some gene families with closely related members you can get an E-value of 00 for several different matches and then identity may be more sensitive Also bear in mind in cases like this that ideas of lsquofunctionalrsquo orthology may break down with more than one locus producing identical proteins which share the same functionhellip

Protein BLASTItrsquos (nearly) always better to make comparisons at the amino acid level between protein sequences than the DNA level because the amino acid sequence is more conserved than the underlying DNA sequenceDoes this cause us to treat expected values any differently

If we follow the argument as before then for an exact match of a 20 amino acid sequence in the RefSeq protein database each additional amino acid will reduce the E-value by 120th (there are 20 different amino acids) And as there are 347895532 letters in that databaseE-value = ~35 x 108 (20 x 20 x 20 hellip20 times) = ~35 x 10-18

But this is what we get if we run the blast at NCBI

Score = 431 bits (100) Expect = 8e-04 Identities = 2020 (100) Positives = 2020 (100) Gaps = 020 (0) Frame = +3

Query 3 SSSSFRAYRAALSEVEPPCI 62 SSSSFRAYRAALSEVEPPCISbjct 972 SSSSFRAYRAALSEVEPPCI 991

Really too big a discrepancy to easily explain with hand wavinghellip

Amino Acid Substitutions

A SC F LWYG I LMVL IMFVM ILVP V ILMW FY

N DHSQ REHKS ANTT SY HFW

H NQYK RQER QK

D NEE DQK

In fact we need to take into account both amino acid substitutability as well as as before allowing gapped alignments On average any residue can be substituted for by about 2 others so each position has about 17th chance of lsquomatchingrsquo rather than 120th

So now we getE-value = ~35 x 108 (7 x 7 x 7 hellip20 times) = ~44 x 10-9which is much closer to the actual BLAST value

These substitutabilities are dealt with by the BLOSUM and PAM matrices

ExercisesGo to the file random-DNA-sequenceshtml select one of the 20 randomly generated nucleotide sequences and do a BLASTx (translated DNA-gtprotein) at NCBI against the nr protein database

Did you find any lsquosignificantrsquo hits

Repeat with a second sequence

What conclusions might you draw from this exercise

Try the same sequence(s) against the nr nucleotide database

Is there any general difference

Part 4 Tweaking BLASTAlthough you normally see BLAST as a web page with boxes to place data in and tick boxes etc it is actually a command line program that can be run just by typing the appropriate command and options eg

promptgt blastall ndashp ltblast typegt ndashi ltinput sequencegt ndashd ltdatabasegt

This is the simplest form where the basic program lsquoblastallrsquo takes a number of different options or parameters indicated by the ndashx and followed by its value -p ltwhich blast flavour to rungt-i ltfile with query sequence ingt-d ltpre-indexed database namegt

Not All Parameters are hellipThere are many other parameters and if not listed explicitly they will use a default value most appropriate to the blast flavour requested Eg for ndashW ltword sizegt blastn uses ndashW 11 where blastx uses ndashW 3

There are also some options that appear on the web pages that are not really parameters but manage the job in a similar way One of the most useful of these is on the NCBI blast pages where you can use Entrez queries or pick from an organism list to modify your search

The Many Parameters of BLASTThere are almost literally hundreds of parameters but most are way too obscure even for die-hard techies like me Very few of them are regularly useful in any but their default value but just occasionally they are very necessary

Here are some of the ones that I have used

-e max expected value -m output format (graphical or tabularspreadsheet)-F filter query sequence for low complexity (default TRUE)-U use only upper case regions of query (default FALSE)-G gap opening cost-E gap extension cost-q nucleotide mismatch penalty (BLASTx uses matrices)-r nucleotide match reward-b number of matching sequences to report-g allow gaps (default TRUE)-W word size-z effective database size (removes effect of actual database size)-S query strands to search (default both directions)-l restrict database sequences to given list of lsquogilsquo numbers

BLAST Parameters Exercises1 BLASTn vs BLASTx

Open the file example-sequenceshtml copy the sequence gtblastn-vs-blastxThis is a Xenopus tropicalis cDNA sequence

Go to the NCBI BLAST Home PageNucleotide-nucleotide BLAST (blastn) section Paste your sequence into the box

Run BLASTn against the nr nucleotide database using all default optionsThen hit [format] to wait for the results in a new page(hint if you paste the sequence definition line lsquogtnamersquo into the box as well your results will be labelled accordingly which can be useful)

Now repeat but go to the TRANSLATED BLAST section and BLAST against the nr protein database using BLASTx

How might the different results help us view the presence of this gene in other vertebrates

Results for Exercise 1 BLASTn

BLASTx

BLAST Parameters Exercises2 Low complexity filtering

Open the file example-sequenceshtml copy the sequence gtlow-complexity-filtering-A This sequence contains a long AT tandem repeat

Go to the NCBI BLAST Home PageTRANSLATED BLAST sectionBLASTx Paste your sequence into the box

Carefully UNTICK the ldquoChoose filter [ ] Low complexityrdquo BOX in the second section And then run BLASTx against the nr database

What do you feel about these alignmentsRe-run but leave the low-complexity filter ON this timeDoes this change our view of the protein matches

Now continue with gtlow-complexity-filtering-B and ndashCC is an especially interesting case ndash what can we deduce about the cDNA sequence Annotators beware

Results for Exercise 2A (OFF) BLASTn ndash low complexity filtering OFF

Results for Exercise 2A (ON) BLASTn ndash low complexity filtering ON

Results for Exercise 2B

ON OFF

Results for Exercise 2C

There is a sequence error an extra G at position 117 in the sequence cDNA (117)AGAAAAGAAGAAACATGGCAATGGATCAGAA|||||||||||||||| ||||||||||||||AGAAAAGAAGAAACAT-GCAATGGATCAGAA

Genomic sequence

ON

OFF

BLAST Parameters Exercises3 Limit by Entrez query

Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching sequences in fruit fly enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT

Open the file example-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home Page TRANSLATED BLAST sectionBLASTx and paste the sequence

Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip

BLAST Parameters Exercises4 BLASTn vs tBLASTx and nucleotide mismatch penalties

Open the file example-sequenceshtml

Also open the NCBI BLAST Home PageSPECIAL ndash Align two sequences section

There are several Xenopus tropicalis cyclins in the examples fileCopy the sequence gtcyclin-A1-Xt to the Sequence 1 BLAST windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 BLAST window(i) Run the default comparison should be BLASTn Note the alignmentNow run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these

(ii) Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1 Then try reducing the gap open and gap extension penaltieshellipWhat do we learn from this

(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip

Results for Exercise 4 (i)

BLASTn tBLASTx

Results for Exercise 4 (ii)

Mismatch penalty = -2 (default) Mismatch penalty = -1

BLAST Parameters Exercises5 E-Value maximum for reporting

Open the file example-sequenceshtml

Copy the sequence gtsumo-binding-motif and go to the NCBI BLAST Home PageGo to the PROTEIN BLAST section BLASTp and paste the sequence

Run the search with the default values

Now re-run the search setting the maximum E-value in the box

Expect 100

What difference does this make

BLAST Parameters Exercises6 Word Size

Open the file example-sequenceshtml

Copy the sequence gtmorpholino and go to the NCBI BLAST Home PageGo to the NUCLEOTIDE BLAST section BLASTn and paste the sequence

Check OFF the low complexity filter and then run the search

Now re-run the search setting the following parameters

Low complexity OFFExpect 100Word Size 7Other advanced -q-1 (mismatch penalty -1 instead of default -3)

What difference does this make

END

  • Bioinformatics Workshop 1 Sequences and Similarity Searches
  • The Basic Questions
  • Part 1 Structural Genomics
  • Chromosomes and Genes
  • Gene to Protein
  • Sequence Signals
  • Genomic Signals
  • Derivative Sequences
  • Gene Models
  • Sequences and Genes (Accession Numbers and Names)
  • Gene Symbols Names Etc
  • A Gene-Centric View
  • Sequences and Accession Numbers
  • mRNA Splicing Signals
  • Gene Predictions
  • Supporting Evidence
  • TheoreticalPredicted Sequences
  • Sequences for a model organism
  • So Whatrsquos in the Databases Now
  • Part 2 Comparative Genomics
  • Speciation
  • Residual Similarity
  • Computers Can Detect Homology
  • Orthologs
  • Paralogs
  • lsquoOtherrsquo-logs
  • The Essential Paradigm
  • Function Conserved Longer than Detectable Similarity
  • Redundancy in the Genetic Code
  • Protein Similarity Persists Longer
  • Always Compare Protein Sequences
  • Exercise 1 nucleotide vs amino acid search
  • Answers Exercise 1
  • The Essential Task
  • Functional Orthologs
  • Finding Orthologs
  • Using Synteny is Better
  • Metazome
  • Metazome Exercise
  • Part 3 Finding Sequence Similarities
  • Gaps in Alignments
  • The Downside of Gaps
  • BLAST
  • Flavours of BLAST
  • How does it work
  • BLAST WORDS and INDEXING
  • Analyse the Query Sequence
  • Expand from Word Based Matches
  • BLAST ndashTypical Output
  • When is a match significant
  • E-values
  • E-values From First Principles
  • Calculating an E-value
  • E-values In Practice
  • E-value Exercise
  • E-value Exercise Answer
  • E-values Effect of Database Size
  • Slide 58
  • Why were the values different
  • E-values Effect of Query Length
  • Why not just use identity
  • So whatrsquos the real problem
  • Rules of Thumb
  • Protein BLAST
  • Amino Acid Substitutions
  • Exercises
  • Part 4 Tweaking BLAST
  • Not All Parameters are hellip
  • The Many Parameters of BLAST
  • Slide 70
  • BLAST Parameters Exercises
  • Results for Exercise 1
  • Slide 73
  • Results for Exercise 2A (OFF)
  • Results for Exercise 2A (ON)
  • Results for Exercise 2B
  • Results for Exercise 2C
  • Slide 78
  • Slide 79
  • Results for Exercise 4 (i)
  • Results for Exercise 4 (ii)
  • Slide 82
  • Slide 83
  • END
Page 48: Bioinformatics Workshop 1 Sequences and Similarity Searches

Expand from Word Based Matches We lsquoinstantlyrsquo know which sequences in the database have at least a word length match with our query sequence and at what relative position

Next the potential alignments are expanded adding up a score for (total matches ndash mismatches ndash gap penalties) to make the best possible alignment But this is usually for a tiny proportion of the sequences in the database ndash so overall it is much quicker

The highest scoring alignments are reported

But we can potentially miss alignments with no word-size bits in common consider BLASTn with a default word-size of 11

TCGGAAGTGGAAGCTGAACCTGATTGTAGAGTTGGAGGCCAGTGTTCTGGCTGAGC||||||||| ||||| |||||||||| |||||||||| |||| ||||| ||||||| TCGGAAGTGTAAGCTCAACCTGATTGCAGAGTTGGAGTCCAGAGTTCTAGCTGAGC

Care is sometimes neededhellip

BLAST ndashTypical OutputINPUT

gtpartial cDNA sequence Xenopus tropicalisCGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGTTCCCACCTCTCCTCTTTCACCATGAAGCTCAAGGACAAATTCCACTCCCCCAAAATCAAGCGCACCCCGTCCAAGAAGGGGAAGCCGGCCGACCTCACCGTCAAAACAGAAGAGAAACCCGTCAACAAAACCTTAAGCCGCTTGGAGGAACAGGAGAAAGAAGTCGTTAATGCCTTGCGTTACTTTAAGACAATTGTTGACAAGATGGCGGTGGACAAGATGGTGCTGGTGATGCTGCCAGGGTCGGCGA

OUTPUTQuery= (311 letters) Database NCBI Protein Reference Sequences 954378 sequences 347895532 total letters

gtgi|41055060|ref|NP_9574201| similar to guanine nucleotide-releasing factor 2 (specific for crk proto-oncogene) [Danio rerio]

Length=691

Score = 133 bits (335)Expect = 6e-31 Identities = 7698 (77) Positives = 8298 (83) Gaps = 498 (4) Frame = +2

Query 26 MSGKIE-KADSQRSHLSSFTMKLKDKFHSPKIKRTPSKKGKPA--DLTVKTEEKPVNKTL 196 MSGKIE K +SQ+SHLSSFTMKL KFHSPKIKRTPSKKGK + VKT EKPVNK + Sbjct 1 MSGKIESKHESQKSHLSSFTMKLM-KFHSPKIKRTPSKKGKQLQPEPAVKTPEKPVNKKV 59

Query 197 SRLEEQEKEVVNALRYFKTIVDKMAVDKMVLVMLPGSA 310 SRLEEQEK+VV+ALRYFKTIVDKM VD VL MLPGSA Sbjct 60 SRLEEQEKDVVSALRYFKTIVDKMNVDTKVLQMLPGSA 97

When is a match significant

RFKISDCQHPCTYSHNQYMTNHMRECPYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV

NFSWKKTSEKETNCQFDYPNDYNEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFNMCWLEVNSS

RF---KISDCQHPCTYSH-NQYMTNHMREC----PYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV F K S+ + C + + N Y N +C P+ + +W +P + D I N M ++ NFSWKKTSEKETNCQFDYPNDY--NEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFN------MCWLEVNSS

Here is a lsquotypicalrsquo weak alignment from BLASTp

In fact the sequences were randomly generated so there is no biologically significant alignmenthellip

E-values

The number of matches like the discovered match that I would expect to find by chance

An E-value of 00 implies that I would expect no matches like this to arise by chance thereforehellip

An E-value of 1 implies I would expect 1 match like this to arise by chance so if I have a match with such an E-valuehellip

Also ldquoexpect valueldquo or ldquoexpectationrdquo

E-values From First Principles

Some database statistics (23rd July 2005)

Database NCBI RefSeq mRNA 272619 sequences 503566580 total letters (~50 x 108)

Database NCBI nr 3329110 sequences 14601814750 total letters (~14 x 1010)

Notation

12e-35 = 12 x 10-35

48 x 106 = 4800000

We will consider first searching a nucleotide sequence (lsquoACGTAGACGTrsquo) against a nucleotide database eg the RefSeq mRNA above

Then we will consider the more complex case of amino acid sequence (protein) searches Which is of course what we mostly do

Calculating an E-valueThe RefSeq mRNA database has ~ 50 x 108 letters There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA

Query = lsquoArsquo

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACrsquo

AC AC AC AC

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACGrsquo

ACG

Expected number of matches = (50 x 108) 4 = ~12 x 108

Expected number of matches = (50 x 108) (4x 4) = ~31 x 107

Expected number of matches = (50 x 108) (4 x 4 x 4) = ~81 x 106

Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer

Expected number of matches = (50 x 108) (4 x 4 x 4 x 4 hellip 60 times ) = (50 x 108) 1036 = 50 x 10-28

E-value = 50 x 10-28

E-values In PracticeSo if I take a 60 nt sequencegtsequenceACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA and actually BLAST it against the RefSeq mRNA database I get

BLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds

Length=6060 Score = 119 bits (60) Expect = 2e-26 Identities = 6060 (100) Gaps = 060 (0) Strand=PlusPlus

Query 1 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 60 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 2977 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 3036

What do I get if I BLAST it against the larger nr databaseBLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds

Length=6060 Score = 119 bits (60) Expect = 6e-25 Identities = 6060 (100) Gaps = 060 (0) Strand=PlusPlus

Query 1 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 60 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 2977 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 3036

theoretical value was 50e-28 -

E-value Exercise

Given a transcription factor binding site

ACC[TG]TA

How many would you expect to find by chance in a 10k promoter sequence

How would this differ if there was an optional additional base between the 4th and 5th positionsIeACC[TG]TAOR ACC[TG]TA

E-value Exercise AnswerACC[TG]TA

Expect lsquoArsquo every 4 ntExpect lsquoACCrsquo every 4x4x4 = 64 nt

Expect lsquoT or Grsquo every 2nd ntExpect lsquoACC[TG]rsquo every 64x2 nt = 128 nt

Expect lsquoTArsquo every 4x4 = 16 ntExpect lsquoACC[TG]TArsquo every 128x16 nt = 2048 nt (4x4x4x2x4x4)We would expect ~5 of these promoter sites every 10k by chance

If also ACC[TG]TAA allowed

The two motifs independently have the same E-valueTo allow either means we expect twice as many

E-values Effect of Database SizeThe nr mRNA database has ~ 14 x 1010 letters (was RefSeq and 50 x108)There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA

Query = lsquoArsquo

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACrsquo

AC AC AC AC

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACGrsquo

ACG

Expected number of matches = (14 x 1010) 4 = ~12 x 108

Expected number of matches = (14 x 1010) (4x 4) = ~31 x 107

Expected number of matches = (14 x 1010) (4 x 4 x 4) = ~81 x 106

Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer

Expected number of matches = (14 x 1010) (4 x 4 x 4 x 4 hellip 60 times ) = (14 x 1010) 1036 = 14 x 10-26

E-value = 14 x 10-26

(was E-value = 50 x 10-28)

E-values Effect of Database Size

The E-value is simply dependent on database size

RefSeq

nr

14 x 1010 letters

50 x108 letters

30 x bigger

BLAST the same sequenceagainst each E-value = 14e-26

E-value = 50e-28

The database was ~30 times bigger and so the E-value was ~30 times bigger

Why were the values differentOur calculated E-value for searching against the RefSeq mRNA database was 50 x 10-28But our actual BLAST search at NCBI gave a value of

20 x 10-26 - about 40x larger - why is this

Gapped alignments

If we were expecting N matches for a query sequence lsquoACGTACGTACGTrsquo imagine what would happen to N if we allowed gaps in our matches

ACGTACGTACGT

This would now give us additional possible alignments that would meet our lsquomatchrsquo criteria

ACGTACGTACGT ACGTACAGTACGT ACGTACCGTACGT etc|||||||||||| |||||| |||||| |||||| ||||||ACGTACGTACGT ACGTAC-GTACGT ACGTAC-GTACGT

We will expect many more matches in a given database if we allow our alignments to have gaps The E-value will be larger

E-values Effect of Query Length

Biologically itrsquos the same match Does it mean we are any less sure that this match didnrsquot occur by chance The E-value is simply dependent on match length

database

BLAST 500 nt sequence against a database

BLASTn Get a full length match with sequence XYZ at an E-value = 50e-160

gtsequence ACTAGTCTAGCTAGACATCGATCGATGATGCTACACAGATAGACGATAGATAGTAAGTCGATCGATCGCGCATCGATCGTCTAGATCGATCGCTCGCTGTGTAGATAGATCGGCGATAGA

database

BLAST half of the same sequence against the same database

BLASTn

gtsequence ACTAGTCTAGCTAGACATCGATCGATGATGCTACACAGATAGACGATAGATAGTAAGTCG

Get a match with sequence XYZ again but at an E-value = 50e-80

Why not just use identityAt some levels this a good question

But consider two very different searches both of which give a 75 identity match

Query1 was 60 nt longCGGAGCTCAGGGCTTAACGACTGATATCTCCGCGCATGTCGAGAAACGATACAGCCAGCG||||||||||| || | || | || || |||| | | | |||||| | ||||||||||CGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGWhich would have an E-value ~ 50 x 10-19

And Query2 only 16 nt longACGTACGTACGTACGT||| || | |||| ||ACGCACCTTCGTAGGTWhich would have an E-value ~ 30

And intuitively we feel we would expect to see that sort of number of matches in the database just by chancehellip

So whatrsquos the real problemBasically you are usually trying to answer the question

Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism

Are there any useful guidelines though at least for biological meaningfulness

Basically you are usually trying to answer the question

Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism

BLAST

The difficulty is because

ORTHOLOGY

BLAST Similarity + Probability

biological knowledge

nature of query sequence

phylogenetic relationship

match length PI size of databasehellip

Rules of ThumbHow good does an E-value have to be before we might even think we have an ortholog

largerworse smallerbetter

E-values 10-5 10-10 10-40 10-100 00

fantasy borderline encouraging

pretty good canrsquot get

better

But note that in some gene families with closely related members you can get an E-value of 00 for several different matches and then identity may be more sensitive Also bear in mind in cases like this that ideas of lsquofunctionalrsquo orthology may break down with more than one locus producing identical proteins which share the same functionhellip

Protein BLASTItrsquos (nearly) always better to make comparisons at the amino acid level between protein sequences than the DNA level because the amino acid sequence is more conserved than the underlying DNA sequenceDoes this cause us to treat expected values any differently

If we follow the argument as before then for an exact match of a 20 amino acid sequence in the RefSeq protein database each additional amino acid will reduce the E-value by 120th (there are 20 different amino acids) And as there are 347895532 letters in that databaseE-value = ~35 x 108 (20 x 20 x 20 hellip20 times) = ~35 x 10-18

But this is what we get if we run the blast at NCBI

Score = 431 bits (100) Expect = 8e-04 Identities = 2020 (100) Positives = 2020 (100) Gaps = 020 (0) Frame = +3

Query 3 SSSSFRAYRAALSEVEPPCI 62 SSSSFRAYRAALSEVEPPCISbjct 972 SSSSFRAYRAALSEVEPPCI 991

Really too big a discrepancy to easily explain with hand wavinghellip

Amino Acid Substitutions

A SC F LWYG I LMVL IMFVM ILVP V ILMW FY

N DHSQ REHKS ANTT SY HFW

H NQYK RQER QK

D NEE DQK

In fact we need to take into account both amino acid substitutability as well as as before allowing gapped alignments On average any residue can be substituted for by about 2 others so each position has about 17th chance of lsquomatchingrsquo rather than 120th

So now we getE-value = ~35 x 108 (7 x 7 x 7 hellip20 times) = ~44 x 10-9which is much closer to the actual BLAST value

These substitutabilities are dealt with by the BLOSUM and PAM matrices

ExercisesGo to the file random-DNA-sequenceshtml select one of the 20 randomly generated nucleotide sequences and do a BLASTx (translated DNA-gtprotein) at NCBI against the nr protein database

Did you find any lsquosignificantrsquo hits

Repeat with a second sequence

What conclusions might you draw from this exercise

Try the same sequence(s) against the nr nucleotide database

Is there any general difference

Part 4 Tweaking BLASTAlthough you normally see BLAST as a web page with boxes to place data in and tick boxes etc it is actually a command line program that can be run just by typing the appropriate command and options eg

promptgt blastall ndashp ltblast typegt ndashi ltinput sequencegt ndashd ltdatabasegt

This is the simplest form where the basic program lsquoblastallrsquo takes a number of different options or parameters indicated by the ndashx and followed by its value -p ltwhich blast flavour to rungt-i ltfile with query sequence ingt-d ltpre-indexed database namegt

Not All Parameters are hellipThere are many other parameters and if not listed explicitly they will use a default value most appropriate to the blast flavour requested Eg for ndashW ltword sizegt blastn uses ndashW 11 where blastx uses ndashW 3

There are also some options that appear on the web pages that are not really parameters but manage the job in a similar way One of the most useful of these is on the NCBI blast pages where you can use Entrez queries or pick from an organism list to modify your search

The Many Parameters of BLASTThere are almost literally hundreds of parameters but most are way too obscure even for die-hard techies like me Very few of them are regularly useful in any but their default value but just occasionally they are very necessary

Here are some of the ones that I have used

-e max expected value -m output format (graphical or tabularspreadsheet)-F filter query sequence for low complexity (default TRUE)-U use only upper case regions of query (default FALSE)-G gap opening cost-E gap extension cost-q nucleotide mismatch penalty (BLASTx uses matrices)-r nucleotide match reward-b number of matching sequences to report-g allow gaps (default TRUE)-W word size-z effective database size (removes effect of actual database size)-S query strands to search (default both directions)-l restrict database sequences to given list of lsquogilsquo numbers

BLAST Parameters Exercises1 BLASTn vs BLASTx

Open the file example-sequenceshtml copy the sequence gtblastn-vs-blastxThis is a Xenopus tropicalis cDNA sequence

Go to the NCBI BLAST Home PageNucleotide-nucleotide BLAST (blastn) section Paste your sequence into the box

Run BLASTn against the nr nucleotide database using all default optionsThen hit [format] to wait for the results in a new page(hint if you paste the sequence definition line lsquogtnamersquo into the box as well your results will be labelled accordingly which can be useful)

Now repeat but go to the TRANSLATED BLAST section and BLAST against the nr protein database using BLASTx

How might the different results help us view the presence of this gene in other vertebrates

Results for Exercise 1 BLASTn

BLASTx

BLAST Parameters Exercises2 Low complexity filtering

Open the file example-sequenceshtml copy the sequence gtlow-complexity-filtering-A This sequence contains a long AT tandem repeat

Go to the NCBI BLAST Home PageTRANSLATED BLAST sectionBLASTx Paste your sequence into the box

Carefully UNTICK the ldquoChoose filter [ ] Low complexityrdquo BOX in the second section And then run BLASTx against the nr database

What do you feel about these alignmentsRe-run but leave the low-complexity filter ON this timeDoes this change our view of the protein matches

Now continue with gtlow-complexity-filtering-B and ndashCC is an especially interesting case ndash what can we deduce about the cDNA sequence Annotators beware

Results for Exercise 2A (OFF) BLASTn ndash low complexity filtering OFF

Results for Exercise 2A (ON) BLASTn ndash low complexity filtering ON

Results for Exercise 2B

ON OFF

Results for Exercise 2C

There is a sequence error an extra G at position 117 in the sequence cDNA (117)AGAAAAGAAGAAACATGGCAATGGATCAGAA|||||||||||||||| ||||||||||||||AGAAAAGAAGAAACAT-GCAATGGATCAGAA

Genomic sequence

ON

OFF

BLAST Parameters Exercises3 Limit by Entrez query

Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching sequences in fruit fly enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT

Open the file example-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home Page TRANSLATED BLAST sectionBLASTx and paste the sequence

Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip

BLAST Parameters Exercises4 BLASTn vs tBLASTx and nucleotide mismatch penalties

Open the file example-sequenceshtml

Also open the NCBI BLAST Home PageSPECIAL ndash Align two sequences section

There are several Xenopus tropicalis cyclins in the examples fileCopy the sequence gtcyclin-A1-Xt to the Sequence 1 BLAST windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 BLAST window(i) Run the default comparison should be BLASTn Note the alignmentNow run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these

(ii) Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1 Then try reducing the gap open and gap extension penaltieshellipWhat do we learn from this

(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip

Results for Exercise 4 (i)

BLASTn tBLASTx

Results for Exercise 4 (ii)

Mismatch penalty = -2 (default) Mismatch penalty = -1

BLAST Parameters Exercises5 E-Value maximum for reporting

Open the file example-sequenceshtml

Copy the sequence gtsumo-binding-motif and go to the NCBI BLAST Home PageGo to the PROTEIN BLAST section BLASTp and paste the sequence

Run the search with the default values

Now re-run the search setting the maximum E-value in the box

Expect 100

What difference does this make

BLAST Parameters Exercises6 Word Size

Open the file example-sequenceshtml

Copy the sequence gtmorpholino and go to the NCBI BLAST Home PageGo to the NUCLEOTIDE BLAST section BLASTn and paste the sequence

Check OFF the low complexity filter and then run the search

Now re-run the search setting the following parameters

Low complexity OFFExpect 100Word Size 7Other advanced -q-1 (mismatch penalty -1 instead of default -3)

What difference does this make

END

  • Bioinformatics Workshop 1 Sequences and Similarity Searches
  • The Basic Questions
  • Part 1 Structural Genomics
  • Chromosomes and Genes
  • Gene to Protein
  • Sequence Signals
  • Genomic Signals
  • Derivative Sequences
  • Gene Models
  • Sequences and Genes (Accession Numbers and Names)
  • Gene Symbols Names Etc
  • A Gene-Centric View
  • Sequences and Accession Numbers
  • mRNA Splicing Signals
  • Gene Predictions
  • Supporting Evidence
  • TheoreticalPredicted Sequences
  • Sequences for a model organism
  • So Whatrsquos in the Databases Now
  • Part 2 Comparative Genomics
  • Speciation
  • Residual Similarity
  • Computers Can Detect Homology
  • Orthologs
  • Paralogs
  • lsquoOtherrsquo-logs
  • The Essential Paradigm
  • Function Conserved Longer than Detectable Similarity
  • Redundancy in the Genetic Code
  • Protein Similarity Persists Longer
  • Always Compare Protein Sequences
  • Exercise 1 nucleotide vs amino acid search
  • Answers Exercise 1
  • The Essential Task
  • Functional Orthologs
  • Finding Orthologs
  • Using Synteny is Better
  • Metazome
  • Metazome Exercise
  • Part 3 Finding Sequence Similarities
  • Gaps in Alignments
  • The Downside of Gaps
  • BLAST
  • Flavours of BLAST
  • How does it work
  • BLAST WORDS and INDEXING
  • Analyse the Query Sequence
  • Expand from Word Based Matches
  • BLAST ndashTypical Output
  • When is a match significant
  • E-values
  • E-values From First Principles
  • Calculating an E-value
  • E-values In Practice
  • E-value Exercise
  • E-value Exercise Answer
  • E-values Effect of Database Size
  • Slide 58
  • Why were the values different
  • E-values Effect of Query Length
  • Why not just use identity
  • So whatrsquos the real problem
  • Rules of Thumb
  • Protein BLAST
  • Amino Acid Substitutions
  • Exercises
  • Part 4 Tweaking BLAST
  • Not All Parameters are hellip
  • The Many Parameters of BLAST
  • Slide 70
  • BLAST Parameters Exercises
  • Results for Exercise 1
  • Slide 73
  • Results for Exercise 2A (OFF)
  • Results for Exercise 2A (ON)
  • Results for Exercise 2B
  • Results for Exercise 2C
  • Slide 78
  • Slide 79
  • Results for Exercise 4 (i)
  • Results for Exercise 4 (ii)
  • Slide 82
  • Slide 83
  • END
Page 49: Bioinformatics Workshop 1 Sequences and Similarity Searches

BLAST ndashTypical OutputINPUT

gtpartial cDNA sequence Xenopus tropicalisCGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGTTCCCACCTCTCCTCTTTCACCATGAAGCTCAAGGACAAATTCCACTCCCCCAAAATCAAGCGCACCCCGTCCAAGAAGGGGAAGCCGGCCGACCTCACCGTCAAAACAGAAGAGAAACCCGTCAACAAAACCTTAAGCCGCTTGGAGGAACAGGAGAAAGAAGTCGTTAATGCCTTGCGTTACTTTAAGACAATTGTTGACAAGATGGCGGTGGACAAGATGGTGCTGGTGATGCTGCCAGGGTCGGCGA

OUTPUTQuery= (311 letters) Database NCBI Protein Reference Sequences 954378 sequences 347895532 total letters

gtgi|41055060|ref|NP_9574201| similar to guanine nucleotide-releasing factor 2 (specific for crk proto-oncogene) [Danio rerio]

Length=691

Score = 133 bits (335)Expect = 6e-31 Identities = 7698 (77) Positives = 8298 (83) Gaps = 498 (4) Frame = +2

Query 26 MSGKIE-KADSQRSHLSSFTMKLKDKFHSPKIKRTPSKKGKPA--DLTVKTEEKPVNKTL 196 MSGKIE K +SQ+SHLSSFTMKL KFHSPKIKRTPSKKGK + VKT EKPVNK + Sbjct 1 MSGKIESKHESQKSHLSSFTMKLM-KFHSPKIKRTPSKKGKQLQPEPAVKTPEKPVNKKV 59

Query 197 SRLEEQEKEVVNALRYFKTIVDKMAVDKMVLVMLPGSA 310 SRLEEQEK+VV+ALRYFKTIVDKM VD VL MLPGSA Sbjct 60 SRLEEQEKDVVSALRYFKTIVDKMNVDTKVLQMLPGSA 97

When is a match significant

RFKISDCQHPCTYSHNQYMTNHMRECPYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV

NFSWKKTSEKETNCQFDYPNDYNEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFNMCWLEVNSS

RF---KISDCQHPCTYSH-NQYMTNHMREC----PYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV F K S+ + C + + N Y N +C P+ + +W +P + D I N M ++ NFSWKKTSEKETNCQFDYPNDY--NEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFN------MCWLEVNSS

Here is a lsquotypicalrsquo weak alignment from BLASTp

In fact the sequences were randomly generated so there is no biologically significant alignmenthellip

E-values

The number of matches like the discovered match that I would expect to find by chance

An E-value of 00 implies that I would expect no matches like this to arise by chance thereforehellip

An E-value of 1 implies I would expect 1 match like this to arise by chance so if I have a match with such an E-valuehellip

Also ldquoexpect valueldquo or ldquoexpectationrdquo

E-values From First Principles

Some database statistics (23rd July 2005)

Database NCBI RefSeq mRNA 272619 sequences 503566580 total letters (~50 x 108)

Database NCBI nr 3329110 sequences 14601814750 total letters (~14 x 1010)

Notation

12e-35 = 12 x 10-35

48 x 106 = 4800000

We will consider first searching a nucleotide sequence (lsquoACGTAGACGTrsquo) against a nucleotide database eg the RefSeq mRNA above

Then we will consider the more complex case of amino acid sequence (protein) searches Which is of course what we mostly do

Calculating an E-valueThe RefSeq mRNA database has ~ 50 x 108 letters There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA

Query = lsquoArsquo

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACrsquo

AC AC AC AC

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACGrsquo

ACG

Expected number of matches = (50 x 108) 4 = ~12 x 108

Expected number of matches = (50 x 108) (4x 4) = ~31 x 107

Expected number of matches = (50 x 108) (4 x 4 x 4) = ~81 x 106

Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer

Expected number of matches = (50 x 108) (4 x 4 x 4 x 4 hellip 60 times ) = (50 x 108) 1036 = 50 x 10-28

E-value = 50 x 10-28

E-values In PracticeSo if I take a 60 nt sequencegtsequenceACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA and actually BLAST it against the RefSeq mRNA database I get

BLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds

Length=6060 Score = 119 bits (60) Expect = 2e-26 Identities = 6060 (100) Gaps = 060 (0) Strand=PlusPlus

Query 1 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 60 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 2977 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 3036

What do I get if I BLAST it against the larger nr databaseBLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds

Length=6060 Score = 119 bits (60) Expect = 6e-25 Identities = 6060 (100) Gaps = 060 (0) Strand=PlusPlus

Query 1 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 60 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 2977 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 3036

theoretical value was 50e-28 -

E-value Exercise

Given a transcription factor binding site

ACC[TG]TA

How many would you expect to find by chance in a 10k promoter sequence

How would this differ if there was an optional additional base between the 4th and 5th positionsIeACC[TG]TAOR ACC[TG]TA

E-value Exercise AnswerACC[TG]TA

Expect lsquoArsquo every 4 ntExpect lsquoACCrsquo every 4x4x4 = 64 nt

Expect lsquoT or Grsquo every 2nd ntExpect lsquoACC[TG]rsquo every 64x2 nt = 128 nt

Expect lsquoTArsquo every 4x4 = 16 ntExpect lsquoACC[TG]TArsquo every 128x16 nt = 2048 nt (4x4x4x2x4x4)We would expect ~5 of these promoter sites every 10k by chance

If also ACC[TG]TAA allowed

The two motifs independently have the same E-valueTo allow either means we expect twice as many

E-values Effect of Database SizeThe nr mRNA database has ~ 14 x 1010 letters (was RefSeq and 50 x108)There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA

Query = lsquoArsquo

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACrsquo

AC AC AC AC

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACGrsquo

ACG

Expected number of matches = (14 x 1010) 4 = ~12 x 108

Expected number of matches = (14 x 1010) (4x 4) = ~31 x 107

Expected number of matches = (14 x 1010) (4 x 4 x 4) = ~81 x 106

Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer

Expected number of matches = (14 x 1010) (4 x 4 x 4 x 4 hellip 60 times ) = (14 x 1010) 1036 = 14 x 10-26

E-value = 14 x 10-26

(was E-value = 50 x 10-28)

E-values Effect of Database Size

The E-value is simply dependent on database size

RefSeq

nr

14 x 1010 letters

50 x108 letters

30 x bigger

BLAST the same sequenceagainst each E-value = 14e-26

E-value = 50e-28

The database was ~30 times bigger and so the E-value was ~30 times bigger

Why were the values differentOur calculated E-value for searching against the RefSeq mRNA database was 50 x 10-28But our actual BLAST search at NCBI gave a value of

20 x 10-26 - about 40x larger - why is this

Gapped alignments

If we were expecting N matches for a query sequence lsquoACGTACGTACGTrsquo imagine what would happen to N if we allowed gaps in our matches

ACGTACGTACGT

This would now give us additional possible alignments that would meet our lsquomatchrsquo criteria

ACGTACGTACGT ACGTACAGTACGT ACGTACCGTACGT etc|||||||||||| |||||| |||||| |||||| ||||||ACGTACGTACGT ACGTAC-GTACGT ACGTAC-GTACGT

We will expect many more matches in a given database if we allow our alignments to have gaps The E-value will be larger

E-values Effect of Query Length

Biologically itrsquos the same match Does it mean we are any less sure that this match didnrsquot occur by chance The E-value is simply dependent on match length

database

BLAST 500 nt sequence against a database

BLASTn Get a full length match with sequence XYZ at an E-value = 50e-160

gtsequence ACTAGTCTAGCTAGACATCGATCGATGATGCTACACAGATAGACGATAGATAGTAAGTCGATCGATCGCGCATCGATCGTCTAGATCGATCGCTCGCTGTGTAGATAGATCGGCGATAGA

database

BLAST half of the same sequence against the same database

BLASTn

gtsequence ACTAGTCTAGCTAGACATCGATCGATGATGCTACACAGATAGACGATAGATAGTAAGTCG

Get a match with sequence XYZ again but at an E-value = 50e-80

Why not just use identityAt some levels this a good question

But consider two very different searches both of which give a 75 identity match

Query1 was 60 nt longCGGAGCTCAGGGCTTAACGACTGATATCTCCGCGCATGTCGAGAAACGATACAGCCAGCG||||||||||| || | || | || || |||| | | | |||||| | ||||||||||CGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGWhich would have an E-value ~ 50 x 10-19

And Query2 only 16 nt longACGTACGTACGTACGT||| || | |||| ||ACGCACCTTCGTAGGTWhich would have an E-value ~ 30

And intuitively we feel we would expect to see that sort of number of matches in the database just by chancehellip

So whatrsquos the real problemBasically you are usually trying to answer the question

Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism

Are there any useful guidelines though at least for biological meaningfulness

Basically you are usually trying to answer the question

Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism

BLAST

The difficulty is because

ORTHOLOGY

BLAST Similarity + Probability

biological knowledge

nature of query sequence

phylogenetic relationship

match length PI size of databasehellip

Rules of ThumbHow good does an E-value have to be before we might even think we have an ortholog

largerworse smallerbetter

E-values 10-5 10-10 10-40 10-100 00

fantasy borderline encouraging

pretty good canrsquot get

better

But note that in some gene families with closely related members you can get an E-value of 00 for several different matches and then identity may be more sensitive Also bear in mind in cases like this that ideas of lsquofunctionalrsquo orthology may break down with more than one locus producing identical proteins which share the same functionhellip

Protein BLASTItrsquos (nearly) always better to make comparisons at the amino acid level between protein sequences than the DNA level because the amino acid sequence is more conserved than the underlying DNA sequenceDoes this cause us to treat expected values any differently

If we follow the argument as before then for an exact match of a 20 amino acid sequence in the RefSeq protein database each additional amino acid will reduce the E-value by 120th (there are 20 different amino acids) And as there are 347895532 letters in that databaseE-value = ~35 x 108 (20 x 20 x 20 hellip20 times) = ~35 x 10-18

But this is what we get if we run the blast at NCBI

Score = 431 bits (100) Expect = 8e-04 Identities = 2020 (100) Positives = 2020 (100) Gaps = 020 (0) Frame = +3

Query 3 SSSSFRAYRAALSEVEPPCI 62 SSSSFRAYRAALSEVEPPCISbjct 972 SSSSFRAYRAALSEVEPPCI 991

Really too big a discrepancy to easily explain with hand wavinghellip

Amino Acid Substitutions

A SC F LWYG I LMVL IMFVM ILVP V ILMW FY

N DHSQ REHKS ANTT SY HFW

H NQYK RQER QK

D NEE DQK

In fact we need to take into account both amino acid substitutability as well as as before allowing gapped alignments On average any residue can be substituted for by about 2 others so each position has about 17th chance of lsquomatchingrsquo rather than 120th

So now we getE-value = ~35 x 108 (7 x 7 x 7 hellip20 times) = ~44 x 10-9which is much closer to the actual BLAST value

These substitutabilities are dealt with by the BLOSUM and PAM matrices

ExercisesGo to the file random-DNA-sequenceshtml select one of the 20 randomly generated nucleotide sequences and do a BLASTx (translated DNA-gtprotein) at NCBI against the nr protein database

Did you find any lsquosignificantrsquo hits

Repeat with a second sequence

What conclusions might you draw from this exercise

Try the same sequence(s) against the nr nucleotide database

Is there any general difference

Part 4 Tweaking BLASTAlthough you normally see BLAST as a web page with boxes to place data in and tick boxes etc it is actually a command line program that can be run just by typing the appropriate command and options eg

promptgt blastall ndashp ltblast typegt ndashi ltinput sequencegt ndashd ltdatabasegt

This is the simplest form where the basic program lsquoblastallrsquo takes a number of different options or parameters indicated by the ndashx and followed by its value -p ltwhich blast flavour to rungt-i ltfile with query sequence ingt-d ltpre-indexed database namegt

Not All Parameters are hellipThere are many other parameters and if not listed explicitly they will use a default value most appropriate to the blast flavour requested Eg for ndashW ltword sizegt blastn uses ndashW 11 where blastx uses ndashW 3

There are also some options that appear on the web pages that are not really parameters but manage the job in a similar way One of the most useful of these is on the NCBI blast pages where you can use Entrez queries or pick from an organism list to modify your search

The Many Parameters of BLASTThere are almost literally hundreds of parameters but most are way too obscure even for die-hard techies like me Very few of them are regularly useful in any but their default value but just occasionally they are very necessary

Here are some of the ones that I have used

-e max expected value -m output format (graphical or tabularspreadsheet)-F filter query sequence for low complexity (default TRUE)-U use only upper case regions of query (default FALSE)-G gap opening cost-E gap extension cost-q nucleotide mismatch penalty (BLASTx uses matrices)-r nucleotide match reward-b number of matching sequences to report-g allow gaps (default TRUE)-W word size-z effective database size (removes effect of actual database size)-S query strands to search (default both directions)-l restrict database sequences to given list of lsquogilsquo numbers

BLAST Parameters Exercises1 BLASTn vs BLASTx

Open the file example-sequenceshtml copy the sequence gtblastn-vs-blastxThis is a Xenopus tropicalis cDNA sequence

Go to the NCBI BLAST Home PageNucleotide-nucleotide BLAST (blastn) section Paste your sequence into the box

Run BLASTn against the nr nucleotide database using all default optionsThen hit [format] to wait for the results in a new page(hint if you paste the sequence definition line lsquogtnamersquo into the box as well your results will be labelled accordingly which can be useful)

Now repeat but go to the TRANSLATED BLAST section and BLAST against the nr protein database using BLASTx

How might the different results help us view the presence of this gene in other vertebrates

Results for Exercise 1 BLASTn

BLASTx

BLAST Parameters Exercises2 Low complexity filtering

Open the file example-sequenceshtml copy the sequence gtlow-complexity-filtering-A This sequence contains a long AT tandem repeat

Go to the NCBI BLAST Home PageTRANSLATED BLAST sectionBLASTx Paste your sequence into the box

Carefully UNTICK the ldquoChoose filter [ ] Low complexityrdquo BOX in the second section And then run BLASTx against the nr database

What do you feel about these alignmentsRe-run but leave the low-complexity filter ON this timeDoes this change our view of the protein matches

Now continue with gtlow-complexity-filtering-B and ndashCC is an especially interesting case ndash what can we deduce about the cDNA sequence Annotators beware

Results for Exercise 2A (OFF) BLASTn ndash low complexity filtering OFF

Results for Exercise 2A (ON) BLASTn ndash low complexity filtering ON

Results for Exercise 2B

ON OFF

Results for Exercise 2C

There is a sequence error an extra G at position 117 in the sequence cDNA (117)AGAAAAGAAGAAACATGGCAATGGATCAGAA|||||||||||||||| ||||||||||||||AGAAAAGAAGAAACAT-GCAATGGATCAGAA

Genomic sequence

ON

OFF

BLAST Parameters Exercises3 Limit by Entrez query

Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching sequences in fruit fly enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT

Open the file example-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home Page TRANSLATED BLAST sectionBLASTx and paste the sequence

Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip

BLAST Parameters Exercises4 BLASTn vs tBLASTx and nucleotide mismatch penalties

Open the file example-sequenceshtml

Also open the NCBI BLAST Home PageSPECIAL ndash Align two sequences section

There are several Xenopus tropicalis cyclins in the examples fileCopy the sequence gtcyclin-A1-Xt to the Sequence 1 BLAST windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 BLAST window(i) Run the default comparison should be BLASTn Note the alignmentNow run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these

(ii) Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1 Then try reducing the gap open and gap extension penaltieshellipWhat do we learn from this

(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip

Results for Exercise 4 (i)

BLASTn tBLASTx

Results for Exercise 4 (ii)

Mismatch penalty = -2 (default) Mismatch penalty = -1

BLAST Parameters Exercises5 E-Value maximum for reporting

Open the file example-sequenceshtml

Copy the sequence gtsumo-binding-motif and go to the NCBI BLAST Home PageGo to the PROTEIN BLAST section BLASTp and paste the sequence

Run the search with the default values

Now re-run the search setting the maximum E-value in the box

Expect 100

What difference does this make

BLAST Parameters Exercises6 Word Size

Open the file example-sequenceshtml

Copy the sequence gtmorpholino and go to the NCBI BLAST Home PageGo to the NUCLEOTIDE BLAST section BLASTn and paste the sequence

Check OFF the low complexity filter and then run the search

Now re-run the search setting the following parameters

Low complexity OFFExpect 100Word Size 7Other advanced -q-1 (mismatch penalty -1 instead of default -3)

What difference does this make

END

  • Bioinformatics Workshop 1 Sequences and Similarity Searches
  • The Basic Questions
  • Part 1 Structural Genomics
  • Chromosomes and Genes
  • Gene to Protein
  • Sequence Signals
  • Genomic Signals
  • Derivative Sequences
  • Gene Models
  • Sequences and Genes (Accession Numbers and Names)
  • Gene Symbols Names Etc
  • A Gene-Centric View
  • Sequences and Accession Numbers
  • mRNA Splicing Signals
  • Gene Predictions
  • Supporting Evidence
  • TheoreticalPredicted Sequences
  • Sequences for a model organism
  • So Whatrsquos in the Databases Now
  • Part 2 Comparative Genomics
  • Speciation
  • Residual Similarity
  • Computers Can Detect Homology
  • Orthologs
  • Paralogs
  • lsquoOtherrsquo-logs
  • The Essential Paradigm
  • Function Conserved Longer than Detectable Similarity
  • Redundancy in the Genetic Code
  • Protein Similarity Persists Longer
  • Always Compare Protein Sequences
  • Exercise 1 nucleotide vs amino acid search
  • Answers Exercise 1
  • The Essential Task
  • Functional Orthologs
  • Finding Orthologs
  • Using Synteny is Better
  • Metazome
  • Metazome Exercise
  • Part 3 Finding Sequence Similarities
  • Gaps in Alignments
  • The Downside of Gaps
  • BLAST
  • Flavours of BLAST
  • How does it work
  • BLAST WORDS and INDEXING
  • Analyse the Query Sequence
  • Expand from Word Based Matches
  • BLAST ndashTypical Output
  • When is a match significant
  • E-values
  • E-values From First Principles
  • Calculating an E-value
  • E-values In Practice
  • E-value Exercise
  • E-value Exercise Answer
  • E-values Effect of Database Size
  • Slide 58
  • Why were the values different
  • E-values Effect of Query Length
  • Why not just use identity
  • So whatrsquos the real problem
  • Rules of Thumb
  • Protein BLAST
  • Amino Acid Substitutions
  • Exercises
  • Part 4 Tweaking BLAST
  • Not All Parameters are hellip
  • The Many Parameters of BLAST
  • Slide 70
  • BLAST Parameters Exercises
  • Results for Exercise 1
  • Slide 73
  • Results for Exercise 2A (OFF)
  • Results for Exercise 2A (ON)
  • Results for Exercise 2B
  • Results for Exercise 2C
  • Slide 78
  • Slide 79
  • Results for Exercise 4 (i)
  • Results for Exercise 4 (ii)
  • Slide 82
  • Slide 83
  • END
Page 50: Bioinformatics Workshop 1 Sequences and Similarity Searches

When is a match significant

RFKISDCQHPCTYSHNQYMTNHMRECPYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV

NFSWKKTSEKETNCQFDYPNDYNEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFNMCWLEVNSS

RF---KISDCQHPCTYSH-NQYMTNHMREC----PYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV F K S+ + C + + N Y N +C P+ + +W +P + D I N M ++ NFSWKKTSEKETNCQFDYPNDY--NEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFN------MCWLEVNSS

Here is a lsquotypicalrsquo weak alignment from BLASTp

In fact the sequences were randomly generated so there is no biologically significant alignmenthellip

E-values

The number of matches like the discovered match that I would expect to find by chance

An E-value of 00 implies that I would expect no matches like this to arise by chance thereforehellip

An E-value of 1 implies I would expect 1 match like this to arise by chance so if I have a match with such an E-valuehellip

Also ldquoexpect valueldquo or ldquoexpectationrdquo

E-values From First Principles

Some database statistics (23rd July 2005)

Database NCBI RefSeq mRNA 272619 sequences 503566580 total letters (~50 x 108)

Database NCBI nr 3329110 sequences 14601814750 total letters (~14 x 1010)

Notation

12e-35 = 12 x 10-35

48 x 106 = 4800000

We will consider first searching a nucleotide sequence (lsquoACGTAGACGTrsquo) against a nucleotide database eg the RefSeq mRNA above

Then we will consider the more complex case of amino acid sequence (protein) searches Which is of course what we mostly do

Calculating an E-valueThe RefSeq mRNA database has ~ 50 x 108 letters There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA

Query = lsquoArsquo

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACrsquo

AC AC AC AC

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACGrsquo

ACG

Expected number of matches = (50 x 108) 4 = ~12 x 108

Expected number of matches = (50 x 108) (4x 4) = ~31 x 107

Expected number of matches = (50 x 108) (4 x 4 x 4) = ~81 x 106

Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer

Expected number of matches = (50 x 108) (4 x 4 x 4 x 4 hellip 60 times ) = (50 x 108) 1036 = 50 x 10-28

E-value = 50 x 10-28

E-values In PracticeSo if I take a 60 nt sequencegtsequenceACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA and actually BLAST it against the RefSeq mRNA database I get

BLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds

Length=6060 Score = 119 bits (60) Expect = 2e-26 Identities = 6060 (100) Gaps = 060 (0) Strand=PlusPlus

Query 1 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 60 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 2977 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 3036

What do I get if I BLAST it against the larger nr databaseBLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds

Length=6060 Score = 119 bits (60) Expect = 6e-25 Identities = 6060 (100) Gaps = 060 (0) Strand=PlusPlus

Query 1 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 60 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 2977 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 3036

theoretical value was 50e-28 -

E-value Exercise

Given a transcription factor binding site

ACC[TG]TA

How many would you expect to find by chance in a 10k promoter sequence

How would this differ if there was an optional additional base between the 4th and 5th positionsIeACC[TG]TAOR ACC[TG]TA

E-value Exercise AnswerACC[TG]TA

Expect lsquoArsquo every 4 ntExpect lsquoACCrsquo every 4x4x4 = 64 nt

Expect lsquoT or Grsquo every 2nd ntExpect lsquoACC[TG]rsquo every 64x2 nt = 128 nt

Expect lsquoTArsquo every 4x4 = 16 ntExpect lsquoACC[TG]TArsquo every 128x16 nt = 2048 nt (4x4x4x2x4x4)We would expect ~5 of these promoter sites every 10k by chance

If also ACC[TG]TAA allowed

The two motifs independently have the same E-valueTo allow either means we expect twice as many

E-values Effect of Database SizeThe nr mRNA database has ~ 14 x 1010 letters (was RefSeq and 50 x108)There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA

Query = lsquoArsquo

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACrsquo

AC AC AC AC

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACGrsquo

ACG

Expected number of matches = (14 x 1010) 4 = ~12 x 108

Expected number of matches = (14 x 1010) (4x 4) = ~31 x 107

Expected number of matches = (14 x 1010) (4 x 4 x 4) = ~81 x 106

Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer

Expected number of matches = (14 x 1010) (4 x 4 x 4 x 4 hellip 60 times ) = (14 x 1010) 1036 = 14 x 10-26

E-value = 14 x 10-26

(was E-value = 50 x 10-28)

E-values Effect of Database Size

The E-value is simply dependent on database size

RefSeq

nr

14 x 1010 letters

50 x108 letters

30 x bigger

BLAST the same sequenceagainst each E-value = 14e-26

E-value = 50e-28

The database was ~30 times bigger and so the E-value was ~30 times bigger

Why were the values differentOur calculated E-value for searching against the RefSeq mRNA database was 50 x 10-28But our actual BLAST search at NCBI gave a value of

20 x 10-26 - about 40x larger - why is this

Gapped alignments

If we were expecting N matches for a query sequence lsquoACGTACGTACGTrsquo imagine what would happen to N if we allowed gaps in our matches

ACGTACGTACGT

This would now give us additional possible alignments that would meet our lsquomatchrsquo criteria

ACGTACGTACGT ACGTACAGTACGT ACGTACCGTACGT etc|||||||||||| |||||| |||||| |||||| ||||||ACGTACGTACGT ACGTAC-GTACGT ACGTAC-GTACGT

We will expect many more matches in a given database if we allow our alignments to have gaps The E-value will be larger

E-values Effect of Query Length

Biologically itrsquos the same match Does it mean we are any less sure that this match didnrsquot occur by chance The E-value is simply dependent on match length

database

BLAST 500 nt sequence against a database

BLASTn Get a full length match with sequence XYZ at an E-value = 50e-160

gtsequence ACTAGTCTAGCTAGACATCGATCGATGATGCTACACAGATAGACGATAGATAGTAAGTCGATCGATCGCGCATCGATCGTCTAGATCGATCGCTCGCTGTGTAGATAGATCGGCGATAGA

database

BLAST half of the same sequence against the same database

BLASTn

gtsequence ACTAGTCTAGCTAGACATCGATCGATGATGCTACACAGATAGACGATAGATAGTAAGTCG

Get a match with sequence XYZ again but at an E-value = 50e-80

Why not just use identityAt some levels this a good question

But consider two very different searches both of which give a 75 identity match

Query1 was 60 nt longCGGAGCTCAGGGCTTAACGACTGATATCTCCGCGCATGTCGAGAAACGATACAGCCAGCG||||||||||| || | || | || || |||| | | | |||||| | ||||||||||CGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGWhich would have an E-value ~ 50 x 10-19

And Query2 only 16 nt longACGTACGTACGTACGT||| || | |||| ||ACGCACCTTCGTAGGTWhich would have an E-value ~ 30

And intuitively we feel we would expect to see that sort of number of matches in the database just by chancehellip

So whatrsquos the real problemBasically you are usually trying to answer the question

Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism

Are there any useful guidelines though at least for biological meaningfulness

Basically you are usually trying to answer the question

Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism

BLAST

The difficulty is because

ORTHOLOGY

BLAST Similarity + Probability

biological knowledge

nature of query sequence

phylogenetic relationship

match length PI size of databasehellip

Rules of ThumbHow good does an E-value have to be before we might even think we have an ortholog

largerworse smallerbetter

E-values 10-5 10-10 10-40 10-100 00

fantasy borderline encouraging

pretty good canrsquot get

better

But note that in some gene families with closely related members you can get an E-value of 00 for several different matches and then identity may be more sensitive Also bear in mind in cases like this that ideas of lsquofunctionalrsquo orthology may break down with more than one locus producing identical proteins which share the same functionhellip

Protein BLASTItrsquos (nearly) always better to make comparisons at the amino acid level between protein sequences than the DNA level because the amino acid sequence is more conserved than the underlying DNA sequenceDoes this cause us to treat expected values any differently

If we follow the argument as before then for an exact match of a 20 amino acid sequence in the RefSeq protein database each additional amino acid will reduce the E-value by 120th (there are 20 different amino acids) And as there are 347895532 letters in that databaseE-value = ~35 x 108 (20 x 20 x 20 hellip20 times) = ~35 x 10-18

But this is what we get if we run the blast at NCBI

Score = 431 bits (100) Expect = 8e-04 Identities = 2020 (100) Positives = 2020 (100) Gaps = 020 (0) Frame = +3

Query 3 SSSSFRAYRAALSEVEPPCI 62 SSSSFRAYRAALSEVEPPCISbjct 972 SSSSFRAYRAALSEVEPPCI 991

Really too big a discrepancy to easily explain with hand wavinghellip

Amino Acid Substitutions

A SC F LWYG I LMVL IMFVM ILVP V ILMW FY

N DHSQ REHKS ANTT SY HFW

H NQYK RQER QK

D NEE DQK

In fact we need to take into account both amino acid substitutability as well as as before allowing gapped alignments On average any residue can be substituted for by about 2 others so each position has about 17th chance of lsquomatchingrsquo rather than 120th

So now we getE-value = ~35 x 108 (7 x 7 x 7 hellip20 times) = ~44 x 10-9which is much closer to the actual BLAST value

These substitutabilities are dealt with by the BLOSUM and PAM matrices

ExercisesGo to the file random-DNA-sequenceshtml select one of the 20 randomly generated nucleotide sequences and do a BLASTx (translated DNA-gtprotein) at NCBI against the nr protein database

Did you find any lsquosignificantrsquo hits

Repeat with a second sequence

What conclusions might you draw from this exercise

Try the same sequence(s) against the nr nucleotide database

Is there any general difference

Part 4 Tweaking BLASTAlthough you normally see BLAST as a web page with boxes to place data in and tick boxes etc it is actually a command line program that can be run just by typing the appropriate command and options eg

promptgt blastall ndashp ltblast typegt ndashi ltinput sequencegt ndashd ltdatabasegt

This is the simplest form where the basic program lsquoblastallrsquo takes a number of different options or parameters indicated by the ndashx and followed by its value -p ltwhich blast flavour to rungt-i ltfile with query sequence ingt-d ltpre-indexed database namegt

Not All Parameters are hellipThere are many other parameters and if not listed explicitly they will use a default value most appropriate to the blast flavour requested Eg for ndashW ltword sizegt blastn uses ndashW 11 where blastx uses ndashW 3

There are also some options that appear on the web pages that are not really parameters but manage the job in a similar way One of the most useful of these is on the NCBI blast pages where you can use Entrez queries or pick from an organism list to modify your search

The Many Parameters of BLASTThere are almost literally hundreds of parameters but most are way too obscure even for die-hard techies like me Very few of them are regularly useful in any but their default value but just occasionally they are very necessary

Here are some of the ones that I have used

-e max expected value -m output format (graphical or tabularspreadsheet)-F filter query sequence for low complexity (default TRUE)-U use only upper case regions of query (default FALSE)-G gap opening cost-E gap extension cost-q nucleotide mismatch penalty (BLASTx uses matrices)-r nucleotide match reward-b number of matching sequences to report-g allow gaps (default TRUE)-W word size-z effective database size (removes effect of actual database size)-S query strands to search (default both directions)-l restrict database sequences to given list of lsquogilsquo numbers

BLAST Parameters Exercises1 BLASTn vs BLASTx

Open the file example-sequenceshtml copy the sequence gtblastn-vs-blastxThis is a Xenopus tropicalis cDNA sequence

Go to the NCBI BLAST Home PageNucleotide-nucleotide BLAST (blastn) section Paste your sequence into the box

Run BLASTn against the nr nucleotide database using all default optionsThen hit [format] to wait for the results in a new page(hint if you paste the sequence definition line lsquogtnamersquo into the box as well your results will be labelled accordingly which can be useful)

Now repeat but go to the TRANSLATED BLAST section and BLAST against the nr protein database using BLASTx

How might the different results help us view the presence of this gene in other vertebrates

Results for Exercise 1 BLASTn

BLASTx

BLAST Parameters Exercises2 Low complexity filtering

Open the file example-sequenceshtml copy the sequence gtlow-complexity-filtering-A This sequence contains a long AT tandem repeat

Go to the NCBI BLAST Home PageTRANSLATED BLAST sectionBLASTx Paste your sequence into the box

Carefully UNTICK the ldquoChoose filter [ ] Low complexityrdquo BOX in the second section And then run BLASTx against the nr database

What do you feel about these alignmentsRe-run but leave the low-complexity filter ON this timeDoes this change our view of the protein matches

Now continue with gtlow-complexity-filtering-B and ndashCC is an especially interesting case ndash what can we deduce about the cDNA sequence Annotators beware

Results for Exercise 2A (OFF) BLASTn ndash low complexity filtering OFF

Results for Exercise 2A (ON) BLASTn ndash low complexity filtering ON

Results for Exercise 2B

ON OFF

Results for Exercise 2C

There is a sequence error an extra G at position 117 in the sequence cDNA (117)AGAAAAGAAGAAACATGGCAATGGATCAGAA|||||||||||||||| ||||||||||||||AGAAAAGAAGAAACAT-GCAATGGATCAGAA

Genomic sequence

ON

OFF

BLAST Parameters Exercises3 Limit by Entrez query

Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching sequences in fruit fly enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT

Open the file example-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home Page TRANSLATED BLAST sectionBLASTx and paste the sequence

Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip

BLAST Parameters Exercises4 BLASTn vs tBLASTx and nucleotide mismatch penalties

Open the file example-sequenceshtml

Also open the NCBI BLAST Home PageSPECIAL ndash Align two sequences section

There are several Xenopus tropicalis cyclins in the examples fileCopy the sequence gtcyclin-A1-Xt to the Sequence 1 BLAST windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 BLAST window(i) Run the default comparison should be BLASTn Note the alignmentNow run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these

(ii) Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1 Then try reducing the gap open and gap extension penaltieshellipWhat do we learn from this

(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip

Results for Exercise 4 (i)

BLASTn tBLASTx

Results for Exercise 4 (ii)

Mismatch penalty = -2 (default) Mismatch penalty = -1

BLAST Parameters Exercises5 E-Value maximum for reporting

Open the file example-sequenceshtml

Copy the sequence gtsumo-binding-motif and go to the NCBI BLAST Home PageGo to the PROTEIN BLAST section BLASTp and paste the sequence

Run the search with the default values

Now re-run the search setting the maximum E-value in the box

Expect 100

What difference does this make

BLAST Parameters Exercises6 Word Size

Open the file example-sequenceshtml

Copy the sequence gtmorpholino and go to the NCBI BLAST Home PageGo to the NUCLEOTIDE BLAST section BLASTn and paste the sequence

Check OFF the low complexity filter and then run the search

Now re-run the search setting the following parameters

Low complexity OFFExpect 100Word Size 7Other advanced -q-1 (mismatch penalty -1 instead of default -3)

What difference does this make

END

  • Bioinformatics Workshop 1 Sequences and Similarity Searches
  • The Basic Questions
  • Part 1 Structural Genomics
  • Chromosomes and Genes
  • Gene to Protein
  • Sequence Signals
  • Genomic Signals
  • Derivative Sequences
  • Gene Models
  • Sequences and Genes (Accession Numbers and Names)
  • Gene Symbols Names Etc
  • A Gene-Centric View
  • Sequences and Accession Numbers
  • mRNA Splicing Signals
  • Gene Predictions
  • Supporting Evidence
  • TheoreticalPredicted Sequences
  • Sequences for a model organism
  • So Whatrsquos in the Databases Now
  • Part 2 Comparative Genomics
  • Speciation
  • Residual Similarity
  • Computers Can Detect Homology
  • Orthologs
  • Paralogs
  • lsquoOtherrsquo-logs
  • The Essential Paradigm
  • Function Conserved Longer than Detectable Similarity
  • Redundancy in the Genetic Code
  • Protein Similarity Persists Longer
  • Always Compare Protein Sequences
  • Exercise 1 nucleotide vs amino acid search
  • Answers Exercise 1
  • The Essential Task
  • Functional Orthologs
  • Finding Orthologs
  • Using Synteny is Better
  • Metazome
  • Metazome Exercise
  • Part 3 Finding Sequence Similarities
  • Gaps in Alignments
  • The Downside of Gaps
  • BLAST
  • Flavours of BLAST
  • How does it work
  • BLAST WORDS and INDEXING
  • Analyse the Query Sequence
  • Expand from Word Based Matches
  • BLAST ndashTypical Output
  • When is a match significant
  • E-values
  • E-values From First Principles
  • Calculating an E-value
  • E-values In Practice
  • E-value Exercise
  • E-value Exercise Answer
  • E-values Effect of Database Size
  • Slide 58
  • Why were the values different
  • E-values Effect of Query Length
  • Why not just use identity
  • So whatrsquos the real problem
  • Rules of Thumb
  • Protein BLAST
  • Amino Acid Substitutions
  • Exercises
  • Part 4 Tweaking BLAST
  • Not All Parameters are hellip
  • The Many Parameters of BLAST
  • Slide 70
  • BLAST Parameters Exercises
  • Results for Exercise 1
  • Slide 73
  • Results for Exercise 2A (OFF)
  • Results for Exercise 2A (ON)
  • Results for Exercise 2B
  • Results for Exercise 2C
  • Slide 78
  • Slide 79
  • Results for Exercise 4 (i)
  • Results for Exercise 4 (ii)
  • Slide 82
  • Slide 83
  • END
Page 51: Bioinformatics Workshop 1 Sequences and Similarity Searches

E-values

The number of matches like the discovered match that I would expect to find by chance

An E-value of 00 implies that I would expect no matches like this to arise by chance thereforehellip

An E-value of 1 implies I would expect 1 match like this to arise by chance so if I have a match with such an E-valuehellip

Also ldquoexpect valueldquo or ldquoexpectationrdquo

E-values From First Principles

Some database statistics (23rd July 2005)

Database NCBI RefSeq mRNA 272619 sequences 503566580 total letters (~50 x 108)

Database NCBI nr 3329110 sequences 14601814750 total letters (~14 x 1010)

Notation

12e-35 = 12 x 10-35

48 x 106 = 4800000

We will consider first searching a nucleotide sequence (lsquoACGTAGACGTrsquo) against a nucleotide database eg the RefSeq mRNA above

Then we will consider the more complex case of amino acid sequence (protein) searches Which is of course what we mostly do

Calculating an E-valueThe RefSeq mRNA database has ~ 50 x 108 letters There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA

Query = lsquoArsquo

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACrsquo

AC AC AC AC

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACGrsquo

ACG

Expected number of matches = (50 x 108) 4 = ~12 x 108

Expected number of matches = (50 x 108) (4x 4) = ~31 x 107

Expected number of matches = (50 x 108) (4 x 4 x 4) = ~81 x 106

Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer

Expected number of matches = (50 x 108) (4 x 4 x 4 x 4 hellip 60 times ) = (50 x 108) 1036 = 50 x 10-28

E-value = 50 x 10-28

E-values In PracticeSo if I take a 60 nt sequencegtsequenceACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA and actually BLAST it against the RefSeq mRNA database I get

BLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds

Length=6060 Score = 119 bits (60) Expect = 2e-26 Identities = 6060 (100) Gaps = 060 (0) Strand=PlusPlus

Query 1 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 60 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 2977 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 3036

What do I get if I BLAST it against the larger nr databaseBLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds

Length=6060 Score = 119 bits (60) Expect = 6e-25 Identities = 6060 (100) Gaps = 060 (0) Strand=PlusPlus

Query 1 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 60 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 2977 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 3036

theoretical value was 50e-28 -

E-value Exercise

Given a transcription factor binding site

ACC[TG]TA

How many would you expect to find by chance in a 10k promoter sequence

How would this differ if there was an optional additional base between the 4th and 5th positionsIeACC[TG]TAOR ACC[TG]TA

E-value Exercise AnswerACC[TG]TA

Expect lsquoArsquo every 4 ntExpect lsquoACCrsquo every 4x4x4 = 64 nt

Expect lsquoT or Grsquo every 2nd ntExpect lsquoACC[TG]rsquo every 64x2 nt = 128 nt

Expect lsquoTArsquo every 4x4 = 16 ntExpect lsquoACC[TG]TArsquo every 128x16 nt = 2048 nt (4x4x4x2x4x4)We would expect ~5 of these promoter sites every 10k by chance

If also ACC[TG]TAA allowed

The two motifs independently have the same E-valueTo allow either means we expect twice as many

E-values Effect of Database SizeThe nr mRNA database has ~ 14 x 1010 letters (was RefSeq and 50 x108)There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA

Query = lsquoArsquo

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACrsquo

AC AC AC AC

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACGrsquo

ACG

Expected number of matches = (14 x 1010) 4 = ~12 x 108

Expected number of matches = (14 x 1010) (4x 4) = ~31 x 107

Expected number of matches = (14 x 1010) (4 x 4 x 4) = ~81 x 106

Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer

Expected number of matches = (14 x 1010) (4 x 4 x 4 x 4 hellip 60 times ) = (14 x 1010) 1036 = 14 x 10-26

E-value = 14 x 10-26

(was E-value = 50 x 10-28)

E-values Effect of Database Size

The E-value is simply dependent on database size

RefSeq

nr

14 x 1010 letters

50 x108 letters

30 x bigger

BLAST the same sequenceagainst each E-value = 14e-26

E-value = 50e-28

The database was ~30 times bigger and so the E-value was ~30 times bigger

Why were the values differentOur calculated E-value for searching against the RefSeq mRNA database was 50 x 10-28But our actual BLAST search at NCBI gave a value of

20 x 10-26 - about 40x larger - why is this

Gapped alignments

If we were expecting N matches for a query sequence lsquoACGTACGTACGTrsquo imagine what would happen to N if we allowed gaps in our matches

ACGTACGTACGT

This would now give us additional possible alignments that would meet our lsquomatchrsquo criteria

ACGTACGTACGT ACGTACAGTACGT ACGTACCGTACGT etc|||||||||||| |||||| |||||| |||||| ||||||ACGTACGTACGT ACGTAC-GTACGT ACGTAC-GTACGT

We will expect many more matches in a given database if we allow our alignments to have gaps The E-value will be larger

E-values Effect of Query Length

Biologically itrsquos the same match Does it mean we are any less sure that this match didnrsquot occur by chance The E-value is simply dependent on match length

database

BLAST 500 nt sequence against a database

BLASTn Get a full length match with sequence XYZ at an E-value = 50e-160

gtsequence ACTAGTCTAGCTAGACATCGATCGATGATGCTACACAGATAGACGATAGATAGTAAGTCGATCGATCGCGCATCGATCGTCTAGATCGATCGCTCGCTGTGTAGATAGATCGGCGATAGA

database

BLAST half of the same sequence against the same database

BLASTn

gtsequence ACTAGTCTAGCTAGACATCGATCGATGATGCTACACAGATAGACGATAGATAGTAAGTCG

Get a match with sequence XYZ again but at an E-value = 50e-80

Why not just use identityAt some levels this a good question

But consider two very different searches both of which give a 75 identity match

Query1 was 60 nt longCGGAGCTCAGGGCTTAACGACTGATATCTCCGCGCATGTCGAGAAACGATACAGCCAGCG||||||||||| || | || | || || |||| | | | |||||| | ||||||||||CGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGWhich would have an E-value ~ 50 x 10-19

And Query2 only 16 nt longACGTACGTACGTACGT||| || | |||| ||ACGCACCTTCGTAGGTWhich would have an E-value ~ 30

And intuitively we feel we would expect to see that sort of number of matches in the database just by chancehellip

So whatrsquos the real problemBasically you are usually trying to answer the question

Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism

Are there any useful guidelines though at least for biological meaningfulness

Basically you are usually trying to answer the question

Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism

BLAST

The difficulty is because

ORTHOLOGY

BLAST Similarity + Probability

biological knowledge

nature of query sequence

phylogenetic relationship

match length PI size of databasehellip

Rules of ThumbHow good does an E-value have to be before we might even think we have an ortholog

largerworse smallerbetter

E-values 10-5 10-10 10-40 10-100 00

fantasy borderline encouraging

pretty good canrsquot get

better

But note that in some gene families with closely related members you can get an E-value of 00 for several different matches and then identity may be more sensitive Also bear in mind in cases like this that ideas of lsquofunctionalrsquo orthology may break down with more than one locus producing identical proteins which share the same functionhellip

Protein BLASTItrsquos (nearly) always better to make comparisons at the amino acid level between protein sequences than the DNA level because the amino acid sequence is more conserved than the underlying DNA sequenceDoes this cause us to treat expected values any differently

If we follow the argument as before then for an exact match of a 20 amino acid sequence in the RefSeq protein database each additional amino acid will reduce the E-value by 120th (there are 20 different amino acids) And as there are 347895532 letters in that databaseE-value = ~35 x 108 (20 x 20 x 20 hellip20 times) = ~35 x 10-18

But this is what we get if we run the blast at NCBI

Score = 431 bits (100) Expect = 8e-04 Identities = 2020 (100) Positives = 2020 (100) Gaps = 020 (0) Frame = +3

Query 3 SSSSFRAYRAALSEVEPPCI 62 SSSSFRAYRAALSEVEPPCISbjct 972 SSSSFRAYRAALSEVEPPCI 991

Really too big a discrepancy to easily explain with hand wavinghellip

Amino Acid Substitutions

A SC F LWYG I LMVL IMFVM ILVP V ILMW FY

N DHSQ REHKS ANTT SY HFW

H NQYK RQER QK

D NEE DQK

In fact we need to take into account both amino acid substitutability as well as as before allowing gapped alignments On average any residue can be substituted for by about 2 others so each position has about 17th chance of lsquomatchingrsquo rather than 120th

So now we getE-value = ~35 x 108 (7 x 7 x 7 hellip20 times) = ~44 x 10-9which is much closer to the actual BLAST value

These substitutabilities are dealt with by the BLOSUM and PAM matrices

ExercisesGo to the file random-DNA-sequenceshtml select one of the 20 randomly generated nucleotide sequences and do a BLASTx (translated DNA-gtprotein) at NCBI against the nr protein database

Did you find any lsquosignificantrsquo hits

Repeat with a second sequence

What conclusions might you draw from this exercise

Try the same sequence(s) against the nr nucleotide database

Is there any general difference

Part 4 Tweaking BLASTAlthough you normally see BLAST as a web page with boxes to place data in and tick boxes etc it is actually a command line program that can be run just by typing the appropriate command and options eg

promptgt blastall ndashp ltblast typegt ndashi ltinput sequencegt ndashd ltdatabasegt

This is the simplest form where the basic program lsquoblastallrsquo takes a number of different options or parameters indicated by the ndashx and followed by its value -p ltwhich blast flavour to rungt-i ltfile with query sequence ingt-d ltpre-indexed database namegt

Not All Parameters are hellipThere are many other parameters and if not listed explicitly they will use a default value most appropriate to the blast flavour requested Eg for ndashW ltword sizegt blastn uses ndashW 11 where blastx uses ndashW 3

There are also some options that appear on the web pages that are not really parameters but manage the job in a similar way One of the most useful of these is on the NCBI blast pages where you can use Entrez queries or pick from an organism list to modify your search

The Many Parameters of BLASTThere are almost literally hundreds of parameters but most are way too obscure even for die-hard techies like me Very few of them are regularly useful in any but their default value but just occasionally they are very necessary

Here are some of the ones that I have used

-e max expected value -m output format (graphical or tabularspreadsheet)-F filter query sequence for low complexity (default TRUE)-U use only upper case regions of query (default FALSE)-G gap opening cost-E gap extension cost-q nucleotide mismatch penalty (BLASTx uses matrices)-r nucleotide match reward-b number of matching sequences to report-g allow gaps (default TRUE)-W word size-z effective database size (removes effect of actual database size)-S query strands to search (default both directions)-l restrict database sequences to given list of lsquogilsquo numbers

BLAST Parameters Exercises1 BLASTn vs BLASTx

Open the file example-sequenceshtml copy the sequence gtblastn-vs-blastxThis is a Xenopus tropicalis cDNA sequence

Go to the NCBI BLAST Home PageNucleotide-nucleotide BLAST (blastn) section Paste your sequence into the box

Run BLASTn against the nr nucleotide database using all default optionsThen hit [format] to wait for the results in a new page(hint if you paste the sequence definition line lsquogtnamersquo into the box as well your results will be labelled accordingly which can be useful)

Now repeat but go to the TRANSLATED BLAST section and BLAST against the nr protein database using BLASTx

How might the different results help us view the presence of this gene in other vertebrates

Results for Exercise 1 BLASTn

BLASTx

BLAST Parameters Exercises2 Low complexity filtering

Open the file example-sequenceshtml copy the sequence gtlow-complexity-filtering-A This sequence contains a long AT tandem repeat

Go to the NCBI BLAST Home PageTRANSLATED BLAST sectionBLASTx Paste your sequence into the box

Carefully UNTICK the ldquoChoose filter [ ] Low complexityrdquo BOX in the second section And then run BLASTx against the nr database

What do you feel about these alignmentsRe-run but leave the low-complexity filter ON this timeDoes this change our view of the protein matches

Now continue with gtlow-complexity-filtering-B and ndashCC is an especially interesting case ndash what can we deduce about the cDNA sequence Annotators beware

Results for Exercise 2A (OFF) BLASTn ndash low complexity filtering OFF

Results for Exercise 2A (ON) BLASTn ndash low complexity filtering ON

Results for Exercise 2B

ON OFF

Results for Exercise 2C

There is a sequence error an extra G at position 117 in the sequence cDNA (117)AGAAAAGAAGAAACATGGCAATGGATCAGAA|||||||||||||||| ||||||||||||||AGAAAAGAAGAAACAT-GCAATGGATCAGAA

Genomic sequence

ON

OFF

BLAST Parameters Exercises3 Limit by Entrez query

Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching sequences in fruit fly enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT

Open the file example-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home Page TRANSLATED BLAST sectionBLASTx and paste the sequence

Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip

BLAST Parameters Exercises4 BLASTn vs tBLASTx and nucleotide mismatch penalties

Open the file example-sequenceshtml

Also open the NCBI BLAST Home PageSPECIAL ndash Align two sequences section

There are several Xenopus tropicalis cyclins in the examples fileCopy the sequence gtcyclin-A1-Xt to the Sequence 1 BLAST windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 BLAST window(i) Run the default comparison should be BLASTn Note the alignmentNow run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these

(ii) Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1 Then try reducing the gap open and gap extension penaltieshellipWhat do we learn from this

(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip

Results for Exercise 4 (i)

BLASTn tBLASTx

Results for Exercise 4 (ii)

Mismatch penalty = -2 (default) Mismatch penalty = -1

BLAST Parameters Exercises5 E-Value maximum for reporting

Open the file example-sequenceshtml

Copy the sequence gtsumo-binding-motif and go to the NCBI BLAST Home PageGo to the PROTEIN BLAST section BLASTp and paste the sequence

Run the search with the default values

Now re-run the search setting the maximum E-value in the box

Expect 100

What difference does this make

BLAST Parameters Exercises6 Word Size

Open the file example-sequenceshtml

Copy the sequence gtmorpholino and go to the NCBI BLAST Home PageGo to the NUCLEOTIDE BLAST section BLASTn and paste the sequence

Check OFF the low complexity filter and then run the search

Now re-run the search setting the following parameters

Low complexity OFFExpect 100Word Size 7Other advanced -q-1 (mismatch penalty -1 instead of default -3)

What difference does this make

END

  • Bioinformatics Workshop 1 Sequences and Similarity Searches
  • The Basic Questions
  • Part 1 Structural Genomics
  • Chromosomes and Genes
  • Gene to Protein
  • Sequence Signals
  • Genomic Signals
  • Derivative Sequences
  • Gene Models
  • Sequences and Genes (Accession Numbers and Names)
  • Gene Symbols Names Etc
  • A Gene-Centric View
  • Sequences and Accession Numbers
  • mRNA Splicing Signals
  • Gene Predictions
  • Supporting Evidence
  • TheoreticalPredicted Sequences
  • Sequences for a model organism
  • So Whatrsquos in the Databases Now
  • Part 2 Comparative Genomics
  • Speciation
  • Residual Similarity
  • Computers Can Detect Homology
  • Orthologs
  • Paralogs
  • lsquoOtherrsquo-logs
  • The Essential Paradigm
  • Function Conserved Longer than Detectable Similarity
  • Redundancy in the Genetic Code
  • Protein Similarity Persists Longer
  • Always Compare Protein Sequences
  • Exercise 1 nucleotide vs amino acid search
  • Answers Exercise 1
  • The Essential Task
  • Functional Orthologs
  • Finding Orthologs
  • Using Synteny is Better
  • Metazome
  • Metazome Exercise
  • Part 3 Finding Sequence Similarities
  • Gaps in Alignments
  • The Downside of Gaps
  • BLAST
  • Flavours of BLAST
  • How does it work
  • BLAST WORDS and INDEXING
  • Analyse the Query Sequence
  • Expand from Word Based Matches
  • BLAST ndashTypical Output
  • When is a match significant
  • E-values
  • E-values From First Principles
  • Calculating an E-value
  • E-values In Practice
  • E-value Exercise
  • E-value Exercise Answer
  • E-values Effect of Database Size
  • Slide 58
  • Why were the values different
  • E-values Effect of Query Length
  • Why not just use identity
  • So whatrsquos the real problem
  • Rules of Thumb
  • Protein BLAST
  • Amino Acid Substitutions
  • Exercises
  • Part 4 Tweaking BLAST
  • Not All Parameters are hellip
  • The Many Parameters of BLAST
  • Slide 70
  • BLAST Parameters Exercises
  • Results for Exercise 1
  • Slide 73
  • Results for Exercise 2A (OFF)
  • Results for Exercise 2A (ON)
  • Results for Exercise 2B
  • Results for Exercise 2C
  • Slide 78
  • Slide 79
  • Results for Exercise 4 (i)
  • Results for Exercise 4 (ii)
  • Slide 82
  • Slide 83
  • END
Page 52: Bioinformatics Workshop 1 Sequences and Similarity Searches

E-values From First Principles

Some database statistics (23rd July 2005)

Database NCBI RefSeq mRNA 272619 sequences 503566580 total letters (~50 x 108)

Database NCBI nr 3329110 sequences 14601814750 total letters (~14 x 1010)

Notation

12e-35 = 12 x 10-35

48 x 106 = 4800000

We will consider first searching a nucleotide sequence (lsquoACGTAGACGTrsquo) against a nucleotide database eg the RefSeq mRNA above

Then we will consider the more complex case of amino acid sequence (protein) searches Which is of course what we mostly do

Calculating an E-valueThe RefSeq mRNA database has ~ 50 x 108 letters There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA

Query = lsquoArsquo

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACrsquo

AC AC AC AC

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACGrsquo

ACG

Expected number of matches = (50 x 108) 4 = ~12 x 108

Expected number of matches = (50 x 108) (4x 4) = ~31 x 107

Expected number of matches = (50 x 108) (4 x 4 x 4) = ~81 x 106

Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer

Expected number of matches = (50 x 108) (4 x 4 x 4 x 4 hellip 60 times ) = (50 x 108) 1036 = 50 x 10-28

E-value = 50 x 10-28

E-values In PracticeSo if I take a 60 nt sequencegtsequenceACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA and actually BLAST it against the RefSeq mRNA database I get

BLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds

Length=6060 Score = 119 bits (60) Expect = 2e-26 Identities = 6060 (100) Gaps = 060 (0) Strand=PlusPlus

Query 1 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 60 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 2977 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 3036

What do I get if I BLAST it against the larger nr databaseBLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds

Length=6060 Score = 119 bits (60) Expect = 6e-25 Identities = 6060 (100) Gaps = 060 (0) Strand=PlusPlus

Query 1 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 60 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 2977 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 3036

theoretical value was 50e-28 -

E-value Exercise

Given a transcription factor binding site

ACC[TG]TA

How many would you expect to find by chance in a 10k promoter sequence

How would this differ if there was an optional additional base between the 4th and 5th positionsIeACC[TG]TAOR ACC[TG]TA

E-value Exercise AnswerACC[TG]TA

Expect lsquoArsquo every 4 ntExpect lsquoACCrsquo every 4x4x4 = 64 nt

Expect lsquoT or Grsquo every 2nd ntExpect lsquoACC[TG]rsquo every 64x2 nt = 128 nt

Expect lsquoTArsquo every 4x4 = 16 ntExpect lsquoACC[TG]TArsquo every 128x16 nt = 2048 nt (4x4x4x2x4x4)We would expect ~5 of these promoter sites every 10k by chance

If also ACC[TG]TAA allowed

The two motifs independently have the same E-valueTo allow either means we expect twice as many

E-values Effect of Database SizeThe nr mRNA database has ~ 14 x 1010 letters (was RefSeq and 50 x108)There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA

Query = lsquoArsquo

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACrsquo

AC AC AC AC

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACGrsquo

ACG

Expected number of matches = (14 x 1010) 4 = ~12 x 108

Expected number of matches = (14 x 1010) (4x 4) = ~31 x 107

Expected number of matches = (14 x 1010) (4 x 4 x 4) = ~81 x 106

Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer

Expected number of matches = (14 x 1010) (4 x 4 x 4 x 4 hellip 60 times ) = (14 x 1010) 1036 = 14 x 10-26

E-value = 14 x 10-26

(was E-value = 50 x 10-28)

E-values Effect of Database Size

The E-value is simply dependent on database size

RefSeq

nr

14 x 1010 letters

50 x108 letters

30 x bigger

BLAST the same sequenceagainst each E-value = 14e-26

E-value = 50e-28

The database was ~30 times bigger and so the E-value was ~30 times bigger

Why were the values differentOur calculated E-value for searching against the RefSeq mRNA database was 50 x 10-28But our actual BLAST search at NCBI gave a value of

20 x 10-26 - about 40x larger - why is this

Gapped alignments

If we were expecting N matches for a query sequence lsquoACGTACGTACGTrsquo imagine what would happen to N if we allowed gaps in our matches

ACGTACGTACGT

This would now give us additional possible alignments that would meet our lsquomatchrsquo criteria

ACGTACGTACGT ACGTACAGTACGT ACGTACCGTACGT etc|||||||||||| |||||| |||||| |||||| ||||||ACGTACGTACGT ACGTAC-GTACGT ACGTAC-GTACGT

We will expect many more matches in a given database if we allow our alignments to have gaps The E-value will be larger

E-values Effect of Query Length

Biologically itrsquos the same match Does it mean we are any less sure that this match didnrsquot occur by chance The E-value is simply dependent on match length

database

BLAST 500 nt sequence against a database

BLASTn Get a full length match with sequence XYZ at an E-value = 50e-160

gtsequence ACTAGTCTAGCTAGACATCGATCGATGATGCTACACAGATAGACGATAGATAGTAAGTCGATCGATCGCGCATCGATCGTCTAGATCGATCGCTCGCTGTGTAGATAGATCGGCGATAGA

database

BLAST half of the same sequence against the same database

BLASTn

gtsequence ACTAGTCTAGCTAGACATCGATCGATGATGCTACACAGATAGACGATAGATAGTAAGTCG

Get a match with sequence XYZ again but at an E-value = 50e-80

Why not just use identityAt some levels this a good question

But consider two very different searches both of which give a 75 identity match

Query1 was 60 nt longCGGAGCTCAGGGCTTAACGACTGATATCTCCGCGCATGTCGAGAAACGATACAGCCAGCG||||||||||| || | || | || || |||| | | | |||||| | ||||||||||CGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGWhich would have an E-value ~ 50 x 10-19

And Query2 only 16 nt longACGTACGTACGTACGT||| || | |||| ||ACGCACCTTCGTAGGTWhich would have an E-value ~ 30

And intuitively we feel we would expect to see that sort of number of matches in the database just by chancehellip

So whatrsquos the real problemBasically you are usually trying to answer the question

Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism

Are there any useful guidelines though at least for biological meaningfulness

Basically you are usually trying to answer the question

Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism

BLAST

The difficulty is because

ORTHOLOGY

BLAST Similarity + Probability

biological knowledge

nature of query sequence

phylogenetic relationship

match length PI size of databasehellip

Rules of ThumbHow good does an E-value have to be before we might even think we have an ortholog

largerworse smallerbetter

E-values 10-5 10-10 10-40 10-100 00

fantasy borderline encouraging

pretty good canrsquot get

better

But note that in some gene families with closely related members you can get an E-value of 00 for several different matches and then identity may be more sensitive Also bear in mind in cases like this that ideas of lsquofunctionalrsquo orthology may break down with more than one locus producing identical proteins which share the same functionhellip

Protein BLASTItrsquos (nearly) always better to make comparisons at the amino acid level between protein sequences than the DNA level because the amino acid sequence is more conserved than the underlying DNA sequenceDoes this cause us to treat expected values any differently

If we follow the argument as before then for an exact match of a 20 amino acid sequence in the RefSeq protein database each additional amino acid will reduce the E-value by 120th (there are 20 different amino acids) And as there are 347895532 letters in that databaseE-value = ~35 x 108 (20 x 20 x 20 hellip20 times) = ~35 x 10-18

But this is what we get if we run the blast at NCBI

Score = 431 bits (100) Expect = 8e-04 Identities = 2020 (100) Positives = 2020 (100) Gaps = 020 (0) Frame = +3

Query 3 SSSSFRAYRAALSEVEPPCI 62 SSSSFRAYRAALSEVEPPCISbjct 972 SSSSFRAYRAALSEVEPPCI 991

Really too big a discrepancy to easily explain with hand wavinghellip

Amino Acid Substitutions

A SC F LWYG I LMVL IMFVM ILVP V ILMW FY

N DHSQ REHKS ANTT SY HFW

H NQYK RQER QK

D NEE DQK

In fact we need to take into account both amino acid substitutability as well as as before allowing gapped alignments On average any residue can be substituted for by about 2 others so each position has about 17th chance of lsquomatchingrsquo rather than 120th

So now we getE-value = ~35 x 108 (7 x 7 x 7 hellip20 times) = ~44 x 10-9which is much closer to the actual BLAST value

These substitutabilities are dealt with by the BLOSUM and PAM matrices

ExercisesGo to the file random-DNA-sequenceshtml select one of the 20 randomly generated nucleotide sequences and do a BLASTx (translated DNA-gtprotein) at NCBI against the nr protein database

Did you find any lsquosignificantrsquo hits

Repeat with a second sequence

What conclusions might you draw from this exercise

Try the same sequence(s) against the nr nucleotide database

Is there any general difference

Part 4 Tweaking BLASTAlthough you normally see BLAST as a web page with boxes to place data in and tick boxes etc it is actually a command line program that can be run just by typing the appropriate command and options eg

promptgt blastall ndashp ltblast typegt ndashi ltinput sequencegt ndashd ltdatabasegt

This is the simplest form where the basic program lsquoblastallrsquo takes a number of different options or parameters indicated by the ndashx and followed by its value -p ltwhich blast flavour to rungt-i ltfile with query sequence ingt-d ltpre-indexed database namegt

Not All Parameters are hellipThere are many other parameters and if not listed explicitly they will use a default value most appropriate to the blast flavour requested Eg for ndashW ltword sizegt blastn uses ndashW 11 where blastx uses ndashW 3

There are also some options that appear on the web pages that are not really parameters but manage the job in a similar way One of the most useful of these is on the NCBI blast pages where you can use Entrez queries or pick from an organism list to modify your search

The Many Parameters of BLASTThere are almost literally hundreds of parameters but most are way too obscure even for die-hard techies like me Very few of them are regularly useful in any but their default value but just occasionally they are very necessary

Here are some of the ones that I have used

-e max expected value -m output format (graphical or tabularspreadsheet)-F filter query sequence for low complexity (default TRUE)-U use only upper case regions of query (default FALSE)-G gap opening cost-E gap extension cost-q nucleotide mismatch penalty (BLASTx uses matrices)-r nucleotide match reward-b number of matching sequences to report-g allow gaps (default TRUE)-W word size-z effective database size (removes effect of actual database size)-S query strands to search (default both directions)-l restrict database sequences to given list of lsquogilsquo numbers

BLAST Parameters Exercises1 BLASTn vs BLASTx

Open the file example-sequenceshtml copy the sequence gtblastn-vs-blastxThis is a Xenopus tropicalis cDNA sequence

Go to the NCBI BLAST Home PageNucleotide-nucleotide BLAST (blastn) section Paste your sequence into the box

Run BLASTn against the nr nucleotide database using all default optionsThen hit [format] to wait for the results in a new page(hint if you paste the sequence definition line lsquogtnamersquo into the box as well your results will be labelled accordingly which can be useful)

Now repeat but go to the TRANSLATED BLAST section and BLAST against the nr protein database using BLASTx

How might the different results help us view the presence of this gene in other vertebrates

Results for Exercise 1 BLASTn

BLASTx

BLAST Parameters Exercises2 Low complexity filtering

Open the file example-sequenceshtml copy the sequence gtlow-complexity-filtering-A This sequence contains a long AT tandem repeat

Go to the NCBI BLAST Home PageTRANSLATED BLAST sectionBLASTx Paste your sequence into the box

Carefully UNTICK the ldquoChoose filter [ ] Low complexityrdquo BOX in the second section And then run BLASTx against the nr database

What do you feel about these alignmentsRe-run but leave the low-complexity filter ON this timeDoes this change our view of the protein matches

Now continue with gtlow-complexity-filtering-B and ndashCC is an especially interesting case ndash what can we deduce about the cDNA sequence Annotators beware

Results for Exercise 2A (OFF) BLASTn ndash low complexity filtering OFF

Results for Exercise 2A (ON) BLASTn ndash low complexity filtering ON

Results for Exercise 2B

ON OFF

Results for Exercise 2C

There is a sequence error an extra G at position 117 in the sequence cDNA (117)AGAAAAGAAGAAACATGGCAATGGATCAGAA|||||||||||||||| ||||||||||||||AGAAAAGAAGAAACAT-GCAATGGATCAGAA

Genomic sequence

ON

OFF

BLAST Parameters Exercises3 Limit by Entrez query

Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching sequences in fruit fly enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT

Open the file example-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home Page TRANSLATED BLAST sectionBLASTx and paste the sequence

Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip

BLAST Parameters Exercises4 BLASTn vs tBLASTx and nucleotide mismatch penalties

Open the file example-sequenceshtml

Also open the NCBI BLAST Home PageSPECIAL ndash Align two sequences section

There are several Xenopus tropicalis cyclins in the examples fileCopy the sequence gtcyclin-A1-Xt to the Sequence 1 BLAST windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 BLAST window(i) Run the default comparison should be BLASTn Note the alignmentNow run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these

(ii) Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1 Then try reducing the gap open and gap extension penaltieshellipWhat do we learn from this

(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip

Results for Exercise 4 (i)

BLASTn tBLASTx

Results for Exercise 4 (ii)

Mismatch penalty = -2 (default) Mismatch penalty = -1

BLAST Parameters Exercises5 E-Value maximum for reporting

Open the file example-sequenceshtml

Copy the sequence gtsumo-binding-motif and go to the NCBI BLAST Home PageGo to the PROTEIN BLAST section BLASTp and paste the sequence

Run the search with the default values

Now re-run the search setting the maximum E-value in the box

Expect 100

What difference does this make

BLAST Parameters Exercises6 Word Size

Open the file example-sequenceshtml

Copy the sequence gtmorpholino and go to the NCBI BLAST Home PageGo to the NUCLEOTIDE BLAST section BLASTn and paste the sequence

Check OFF the low complexity filter and then run the search

Now re-run the search setting the following parameters

Low complexity OFFExpect 100Word Size 7Other advanced -q-1 (mismatch penalty -1 instead of default -3)

What difference does this make

END

  • Bioinformatics Workshop 1 Sequences and Similarity Searches
  • The Basic Questions
  • Part 1 Structural Genomics
  • Chromosomes and Genes
  • Gene to Protein
  • Sequence Signals
  • Genomic Signals
  • Derivative Sequences
  • Gene Models
  • Sequences and Genes (Accession Numbers and Names)
  • Gene Symbols Names Etc
  • A Gene-Centric View
  • Sequences and Accession Numbers
  • mRNA Splicing Signals
  • Gene Predictions
  • Supporting Evidence
  • TheoreticalPredicted Sequences
  • Sequences for a model organism
  • So Whatrsquos in the Databases Now
  • Part 2 Comparative Genomics
  • Speciation
  • Residual Similarity
  • Computers Can Detect Homology
  • Orthologs
  • Paralogs
  • lsquoOtherrsquo-logs
  • The Essential Paradigm
  • Function Conserved Longer than Detectable Similarity
  • Redundancy in the Genetic Code
  • Protein Similarity Persists Longer
  • Always Compare Protein Sequences
  • Exercise 1 nucleotide vs amino acid search
  • Answers Exercise 1
  • The Essential Task
  • Functional Orthologs
  • Finding Orthologs
  • Using Synteny is Better
  • Metazome
  • Metazome Exercise
  • Part 3 Finding Sequence Similarities
  • Gaps in Alignments
  • The Downside of Gaps
  • BLAST
  • Flavours of BLAST
  • How does it work
  • BLAST WORDS and INDEXING
  • Analyse the Query Sequence
  • Expand from Word Based Matches
  • BLAST ndashTypical Output
  • When is a match significant
  • E-values
  • E-values From First Principles
  • Calculating an E-value
  • E-values In Practice
  • E-value Exercise
  • E-value Exercise Answer
  • E-values Effect of Database Size
  • Slide 58
  • Why were the values different
  • E-values Effect of Query Length
  • Why not just use identity
  • So whatrsquos the real problem
  • Rules of Thumb
  • Protein BLAST
  • Amino Acid Substitutions
  • Exercises
  • Part 4 Tweaking BLAST
  • Not All Parameters are hellip
  • The Many Parameters of BLAST
  • Slide 70
  • BLAST Parameters Exercises
  • Results for Exercise 1
  • Slide 73
  • Results for Exercise 2A (OFF)
  • Results for Exercise 2A (ON)
  • Results for Exercise 2B
  • Results for Exercise 2C
  • Slide 78
  • Slide 79
  • Results for Exercise 4 (i)
  • Results for Exercise 4 (ii)
  • Slide 82
  • Slide 83
  • END
Page 53: Bioinformatics Workshop 1 Sequences and Similarity Searches

Calculating an E-valueThe RefSeq mRNA database has ~ 50 x 108 letters There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA

Query = lsquoArsquo

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACrsquo

AC AC AC AC

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACGrsquo

ACG

Expected number of matches = (50 x 108) 4 = ~12 x 108

Expected number of matches = (50 x 108) (4x 4) = ~31 x 107

Expected number of matches = (50 x 108) (4 x 4 x 4) = ~81 x 106

Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer

Expected number of matches = (50 x 108) (4 x 4 x 4 x 4 hellip 60 times ) = (50 x 108) 1036 = 50 x 10-28

E-value = 50 x 10-28

E-values In PracticeSo if I take a 60 nt sequencegtsequenceACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA and actually BLAST it against the RefSeq mRNA database I get

BLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds

Length=6060 Score = 119 bits (60) Expect = 2e-26 Identities = 6060 (100) Gaps = 060 (0) Strand=PlusPlus

Query 1 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 60 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 2977 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 3036

What do I get if I BLAST it against the larger nr databaseBLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds

Length=6060 Score = 119 bits (60) Expect = 6e-25 Identities = 6060 (100) Gaps = 060 (0) Strand=PlusPlus

Query 1 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 60 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 2977 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 3036

theoretical value was 50e-28 -

E-value Exercise

Given a transcription factor binding site

ACC[TG]TA

How many would you expect to find by chance in a 10k promoter sequence

How would this differ if there was an optional additional base between the 4th and 5th positionsIeACC[TG]TAOR ACC[TG]TA

E-value Exercise AnswerACC[TG]TA

Expect lsquoArsquo every 4 ntExpect lsquoACCrsquo every 4x4x4 = 64 nt

Expect lsquoT or Grsquo every 2nd ntExpect lsquoACC[TG]rsquo every 64x2 nt = 128 nt

Expect lsquoTArsquo every 4x4 = 16 ntExpect lsquoACC[TG]TArsquo every 128x16 nt = 2048 nt (4x4x4x2x4x4)We would expect ~5 of these promoter sites every 10k by chance

If also ACC[TG]TAA allowed

The two motifs independently have the same E-valueTo allow either means we expect twice as many

E-values Effect of Database SizeThe nr mRNA database has ~ 14 x 1010 letters (was RefSeq and 50 x108)There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA

Query = lsquoArsquo

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACrsquo

AC AC AC AC

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACGrsquo

ACG

Expected number of matches = (14 x 1010) 4 = ~12 x 108

Expected number of matches = (14 x 1010) (4x 4) = ~31 x 107

Expected number of matches = (14 x 1010) (4 x 4 x 4) = ~81 x 106

Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer

Expected number of matches = (14 x 1010) (4 x 4 x 4 x 4 hellip 60 times ) = (14 x 1010) 1036 = 14 x 10-26

E-value = 14 x 10-26

(was E-value = 50 x 10-28)

E-values Effect of Database Size

The E-value is simply dependent on database size

RefSeq

nr

14 x 1010 letters

50 x108 letters

30 x bigger

BLAST the same sequenceagainst each E-value = 14e-26

E-value = 50e-28

The database was ~30 times bigger and so the E-value was ~30 times bigger

Why were the values differentOur calculated E-value for searching against the RefSeq mRNA database was 50 x 10-28But our actual BLAST search at NCBI gave a value of

20 x 10-26 - about 40x larger - why is this

Gapped alignments

If we were expecting N matches for a query sequence lsquoACGTACGTACGTrsquo imagine what would happen to N if we allowed gaps in our matches

ACGTACGTACGT

This would now give us additional possible alignments that would meet our lsquomatchrsquo criteria

ACGTACGTACGT ACGTACAGTACGT ACGTACCGTACGT etc|||||||||||| |||||| |||||| |||||| ||||||ACGTACGTACGT ACGTAC-GTACGT ACGTAC-GTACGT

We will expect many more matches in a given database if we allow our alignments to have gaps The E-value will be larger

E-values Effect of Query Length

Biologically itrsquos the same match Does it mean we are any less sure that this match didnrsquot occur by chance The E-value is simply dependent on match length

database

BLAST 500 nt sequence against a database

BLASTn Get a full length match with sequence XYZ at an E-value = 50e-160

gtsequence ACTAGTCTAGCTAGACATCGATCGATGATGCTACACAGATAGACGATAGATAGTAAGTCGATCGATCGCGCATCGATCGTCTAGATCGATCGCTCGCTGTGTAGATAGATCGGCGATAGA

database

BLAST half of the same sequence against the same database

BLASTn

gtsequence ACTAGTCTAGCTAGACATCGATCGATGATGCTACACAGATAGACGATAGATAGTAAGTCG

Get a match with sequence XYZ again but at an E-value = 50e-80

Why not just use identityAt some levels this a good question

But consider two very different searches both of which give a 75 identity match

Query1 was 60 nt longCGGAGCTCAGGGCTTAACGACTGATATCTCCGCGCATGTCGAGAAACGATACAGCCAGCG||||||||||| || | || | || || |||| | | | |||||| | ||||||||||CGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGWhich would have an E-value ~ 50 x 10-19

And Query2 only 16 nt longACGTACGTACGTACGT||| || | |||| ||ACGCACCTTCGTAGGTWhich would have an E-value ~ 30

And intuitively we feel we would expect to see that sort of number of matches in the database just by chancehellip

So whatrsquos the real problemBasically you are usually trying to answer the question

Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism

Are there any useful guidelines though at least for biological meaningfulness

Basically you are usually trying to answer the question

Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism

BLAST

The difficulty is because

ORTHOLOGY

BLAST Similarity + Probability

biological knowledge

nature of query sequence

phylogenetic relationship

match length PI size of databasehellip

Rules of ThumbHow good does an E-value have to be before we might even think we have an ortholog

largerworse smallerbetter

E-values 10-5 10-10 10-40 10-100 00

fantasy borderline encouraging

pretty good canrsquot get

better

But note that in some gene families with closely related members you can get an E-value of 00 for several different matches and then identity may be more sensitive Also bear in mind in cases like this that ideas of lsquofunctionalrsquo orthology may break down with more than one locus producing identical proteins which share the same functionhellip

Protein BLASTItrsquos (nearly) always better to make comparisons at the amino acid level between protein sequences than the DNA level because the amino acid sequence is more conserved than the underlying DNA sequenceDoes this cause us to treat expected values any differently

If we follow the argument as before then for an exact match of a 20 amino acid sequence in the RefSeq protein database each additional amino acid will reduce the E-value by 120th (there are 20 different amino acids) And as there are 347895532 letters in that databaseE-value = ~35 x 108 (20 x 20 x 20 hellip20 times) = ~35 x 10-18

But this is what we get if we run the blast at NCBI

Score = 431 bits (100) Expect = 8e-04 Identities = 2020 (100) Positives = 2020 (100) Gaps = 020 (0) Frame = +3

Query 3 SSSSFRAYRAALSEVEPPCI 62 SSSSFRAYRAALSEVEPPCISbjct 972 SSSSFRAYRAALSEVEPPCI 991

Really too big a discrepancy to easily explain with hand wavinghellip

Amino Acid Substitutions

A SC F LWYG I LMVL IMFVM ILVP V ILMW FY

N DHSQ REHKS ANTT SY HFW

H NQYK RQER QK

D NEE DQK

In fact we need to take into account both amino acid substitutability as well as as before allowing gapped alignments On average any residue can be substituted for by about 2 others so each position has about 17th chance of lsquomatchingrsquo rather than 120th

So now we getE-value = ~35 x 108 (7 x 7 x 7 hellip20 times) = ~44 x 10-9which is much closer to the actual BLAST value

These substitutabilities are dealt with by the BLOSUM and PAM matrices

ExercisesGo to the file random-DNA-sequenceshtml select one of the 20 randomly generated nucleotide sequences and do a BLASTx (translated DNA-gtprotein) at NCBI against the nr protein database

Did you find any lsquosignificantrsquo hits

Repeat with a second sequence

What conclusions might you draw from this exercise

Try the same sequence(s) against the nr nucleotide database

Is there any general difference

Part 4 Tweaking BLASTAlthough you normally see BLAST as a web page with boxes to place data in and tick boxes etc it is actually a command line program that can be run just by typing the appropriate command and options eg

promptgt blastall ndashp ltblast typegt ndashi ltinput sequencegt ndashd ltdatabasegt

This is the simplest form where the basic program lsquoblastallrsquo takes a number of different options or parameters indicated by the ndashx and followed by its value -p ltwhich blast flavour to rungt-i ltfile with query sequence ingt-d ltpre-indexed database namegt

Not All Parameters are hellipThere are many other parameters and if not listed explicitly they will use a default value most appropriate to the blast flavour requested Eg for ndashW ltword sizegt blastn uses ndashW 11 where blastx uses ndashW 3

There are also some options that appear on the web pages that are not really parameters but manage the job in a similar way One of the most useful of these is on the NCBI blast pages where you can use Entrez queries or pick from an organism list to modify your search

The Many Parameters of BLASTThere are almost literally hundreds of parameters but most are way too obscure even for die-hard techies like me Very few of them are regularly useful in any but their default value but just occasionally they are very necessary

Here are some of the ones that I have used

-e max expected value -m output format (graphical or tabularspreadsheet)-F filter query sequence for low complexity (default TRUE)-U use only upper case regions of query (default FALSE)-G gap opening cost-E gap extension cost-q nucleotide mismatch penalty (BLASTx uses matrices)-r nucleotide match reward-b number of matching sequences to report-g allow gaps (default TRUE)-W word size-z effective database size (removes effect of actual database size)-S query strands to search (default both directions)-l restrict database sequences to given list of lsquogilsquo numbers

BLAST Parameters Exercises1 BLASTn vs BLASTx

Open the file example-sequenceshtml copy the sequence gtblastn-vs-blastxThis is a Xenopus tropicalis cDNA sequence

Go to the NCBI BLAST Home PageNucleotide-nucleotide BLAST (blastn) section Paste your sequence into the box

Run BLASTn against the nr nucleotide database using all default optionsThen hit [format] to wait for the results in a new page(hint if you paste the sequence definition line lsquogtnamersquo into the box as well your results will be labelled accordingly which can be useful)

Now repeat but go to the TRANSLATED BLAST section and BLAST against the nr protein database using BLASTx

How might the different results help us view the presence of this gene in other vertebrates

Results for Exercise 1 BLASTn

BLASTx

BLAST Parameters Exercises2 Low complexity filtering

Open the file example-sequenceshtml copy the sequence gtlow-complexity-filtering-A This sequence contains a long AT tandem repeat

Go to the NCBI BLAST Home PageTRANSLATED BLAST sectionBLASTx Paste your sequence into the box

Carefully UNTICK the ldquoChoose filter [ ] Low complexityrdquo BOX in the second section And then run BLASTx against the nr database

What do you feel about these alignmentsRe-run but leave the low-complexity filter ON this timeDoes this change our view of the protein matches

Now continue with gtlow-complexity-filtering-B and ndashCC is an especially interesting case ndash what can we deduce about the cDNA sequence Annotators beware

Results for Exercise 2A (OFF) BLASTn ndash low complexity filtering OFF

Results for Exercise 2A (ON) BLASTn ndash low complexity filtering ON

Results for Exercise 2B

ON OFF

Results for Exercise 2C

There is a sequence error an extra G at position 117 in the sequence cDNA (117)AGAAAAGAAGAAACATGGCAATGGATCAGAA|||||||||||||||| ||||||||||||||AGAAAAGAAGAAACAT-GCAATGGATCAGAA

Genomic sequence

ON

OFF

BLAST Parameters Exercises3 Limit by Entrez query

Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching sequences in fruit fly enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT

Open the file example-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home Page TRANSLATED BLAST sectionBLASTx and paste the sequence

Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip

BLAST Parameters Exercises4 BLASTn vs tBLASTx and nucleotide mismatch penalties

Open the file example-sequenceshtml

Also open the NCBI BLAST Home PageSPECIAL ndash Align two sequences section

There are several Xenopus tropicalis cyclins in the examples fileCopy the sequence gtcyclin-A1-Xt to the Sequence 1 BLAST windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 BLAST window(i) Run the default comparison should be BLASTn Note the alignmentNow run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these

(ii) Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1 Then try reducing the gap open and gap extension penaltieshellipWhat do we learn from this

(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip

Results for Exercise 4 (i)

BLASTn tBLASTx

Results for Exercise 4 (ii)

Mismatch penalty = -2 (default) Mismatch penalty = -1

BLAST Parameters Exercises5 E-Value maximum for reporting

Open the file example-sequenceshtml

Copy the sequence gtsumo-binding-motif and go to the NCBI BLAST Home PageGo to the PROTEIN BLAST section BLASTp and paste the sequence

Run the search with the default values

Now re-run the search setting the maximum E-value in the box

Expect 100

What difference does this make

BLAST Parameters Exercises6 Word Size

Open the file example-sequenceshtml

Copy the sequence gtmorpholino and go to the NCBI BLAST Home PageGo to the NUCLEOTIDE BLAST section BLASTn and paste the sequence

Check OFF the low complexity filter and then run the search

Now re-run the search setting the following parameters

Low complexity OFFExpect 100Word Size 7Other advanced -q-1 (mismatch penalty -1 instead of default -3)

What difference does this make

END

  • Bioinformatics Workshop 1 Sequences and Similarity Searches
  • The Basic Questions
  • Part 1 Structural Genomics
  • Chromosomes and Genes
  • Gene to Protein
  • Sequence Signals
  • Genomic Signals
  • Derivative Sequences
  • Gene Models
  • Sequences and Genes (Accession Numbers and Names)
  • Gene Symbols Names Etc
  • A Gene-Centric View
  • Sequences and Accession Numbers
  • mRNA Splicing Signals
  • Gene Predictions
  • Supporting Evidence
  • TheoreticalPredicted Sequences
  • Sequences for a model organism
  • So Whatrsquos in the Databases Now
  • Part 2 Comparative Genomics
  • Speciation
  • Residual Similarity
  • Computers Can Detect Homology
  • Orthologs
  • Paralogs
  • lsquoOtherrsquo-logs
  • The Essential Paradigm
  • Function Conserved Longer than Detectable Similarity
  • Redundancy in the Genetic Code
  • Protein Similarity Persists Longer
  • Always Compare Protein Sequences
  • Exercise 1 nucleotide vs amino acid search
  • Answers Exercise 1
  • The Essential Task
  • Functional Orthologs
  • Finding Orthologs
  • Using Synteny is Better
  • Metazome
  • Metazome Exercise
  • Part 3 Finding Sequence Similarities
  • Gaps in Alignments
  • The Downside of Gaps
  • BLAST
  • Flavours of BLAST
  • How does it work
  • BLAST WORDS and INDEXING
  • Analyse the Query Sequence
  • Expand from Word Based Matches
  • BLAST ndashTypical Output
  • When is a match significant
  • E-values
  • E-values From First Principles
  • Calculating an E-value
  • E-values In Practice
  • E-value Exercise
  • E-value Exercise Answer
  • E-values Effect of Database Size
  • Slide 58
  • Why were the values different
  • E-values Effect of Query Length
  • Why not just use identity
  • So whatrsquos the real problem
  • Rules of Thumb
  • Protein BLAST
  • Amino Acid Substitutions
  • Exercises
  • Part 4 Tweaking BLAST
  • Not All Parameters are hellip
  • The Many Parameters of BLAST
  • Slide 70
  • BLAST Parameters Exercises
  • Results for Exercise 1
  • Slide 73
  • Results for Exercise 2A (OFF)
  • Results for Exercise 2A (ON)
  • Results for Exercise 2B
  • Results for Exercise 2C
  • Slide 78
  • Slide 79
  • Results for Exercise 4 (i)
  • Results for Exercise 4 (ii)
  • Slide 82
  • Slide 83
  • END
Page 54: Bioinformatics Workshop 1 Sequences and Similarity Searches

E-values In PracticeSo if I take a 60 nt sequencegtsequenceACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA and actually BLAST it against the RefSeq mRNA database I get

BLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds

Length=6060 Score = 119 bits (60) Expect = 2e-26 Identities = 6060 (100) Gaps = 060 (0) Strand=PlusPlus

Query 1 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 60 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 2977 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 3036

What do I get if I BLAST it against the larger nr databaseBLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds

Length=6060 Score = 119 bits (60) Expect = 6e-25 Identities = 6060 (100) Gaps = 060 (0) Strand=PlusPlus

Query 1 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 60 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 2977 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 3036

theoretical value was 50e-28 -

E-value Exercise

Given a transcription factor binding site

ACC[TG]TA

How many would you expect to find by chance in a 10k promoter sequence

How would this differ if there was an optional additional base between the 4th and 5th positionsIeACC[TG]TAOR ACC[TG]TA

E-value Exercise AnswerACC[TG]TA

Expect lsquoArsquo every 4 ntExpect lsquoACCrsquo every 4x4x4 = 64 nt

Expect lsquoT or Grsquo every 2nd ntExpect lsquoACC[TG]rsquo every 64x2 nt = 128 nt

Expect lsquoTArsquo every 4x4 = 16 ntExpect lsquoACC[TG]TArsquo every 128x16 nt = 2048 nt (4x4x4x2x4x4)We would expect ~5 of these promoter sites every 10k by chance

If also ACC[TG]TAA allowed

The two motifs independently have the same E-valueTo allow either means we expect twice as many

E-values Effect of Database SizeThe nr mRNA database has ~ 14 x 1010 letters (was RefSeq and 50 x108)There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA

Query = lsquoArsquo

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACrsquo

AC AC AC AC

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACGrsquo

ACG

Expected number of matches = (14 x 1010) 4 = ~12 x 108

Expected number of matches = (14 x 1010) (4x 4) = ~31 x 107

Expected number of matches = (14 x 1010) (4 x 4 x 4) = ~81 x 106

Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer

Expected number of matches = (14 x 1010) (4 x 4 x 4 x 4 hellip 60 times ) = (14 x 1010) 1036 = 14 x 10-26

E-value = 14 x 10-26

(was E-value = 50 x 10-28)

E-values Effect of Database Size

The E-value is simply dependent on database size

RefSeq

nr

14 x 1010 letters

50 x108 letters

30 x bigger

BLAST the same sequenceagainst each E-value = 14e-26

E-value = 50e-28

The database was ~30 times bigger and so the E-value was ~30 times bigger

Why were the values differentOur calculated E-value for searching against the RefSeq mRNA database was 50 x 10-28But our actual BLAST search at NCBI gave a value of

20 x 10-26 - about 40x larger - why is this

Gapped alignments

If we were expecting N matches for a query sequence lsquoACGTACGTACGTrsquo imagine what would happen to N if we allowed gaps in our matches

ACGTACGTACGT

This would now give us additional possible alignments that would meet our lsquomatchrsquo criteria

ACGTACGTACGT ACGTACAGTACGT ACGTACCGTACGT etc|||||||||||| |||||| |||||| |||||| ||||||ACGTACGTACGT ACGTAC-GTACGT ACGTAC-GTACGT

We will expect many more matches in a given database if we allow our alignments to have gaps The E-value will be larger

E-values Effect of Query Length

Biologically itrsquos the same match Does it mean we are any less sure that this match didnrsquot occur by chance The E-value is simply dependent on match length

database

BLAST 500 nt sequence against a database

BLASTn Get a full length match with sequence XYZ at an E-value = 50e-160

gtsequence ACTAGTCTAGCTAGACATCGATCGATGATGCTACACAGATAGACGATAGATAGTAAGTCGATCGATCGCGCATCGATCGTCTAGATCGATCGCTCGCTGTGTAGATAGATCGGCGATAGA

database

BLAST half of the same sequence against the same database

BLASTn

gtsequence ACTAGTCTAGCTAGACATCGATCGATGATGCTACACAGATAGACGATAGATAGTAAGTCG

Get a match with sequence XYZ again but at an E-value = 50e-80

Why not just use identityAt some levels this a good question

But consider two very different searches both of which give a 75 identity match

Query1 was 60 nt longCGGAGCTCAGGGCTTAACGACTGATATCTCCGCGCATGTCGAGAAACGATACAGCCAGCG||||||||||| || | || | || || |||| | | | |||||| | ||||||||||CGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGWhich would have an E-value ~ 50 x 10-19

And Query2 only 16 nt longACGTACGTACGTACGT||| || | |||| ||ACGCACCTTCGTAGGTWhich would have an E-value ~ 30

And intuitively we feel we would expect to see that sort of number of matches in the database just by chancehellip

So whatrsquos the real problemBasically you are usually trying to answer the question

Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism

Are there any useful guidelines though at least for biological meaningfulness

Basically you are usually trying to answer the question

Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism

BLAST

The difficulty is because

ORTHOLOGY

BLAST Similarity + Probability

biological knowledge

nature of query sequence

phylogenetic relationship

match length PI size of databasehellip

Rules of ThumbHow good does an E-value have to be before we might even think we have an ortholog

largerworse smallerbetter

E-values 10-5 10-10 10-40 10-100 00

fantasy borderline encouraging

pretty good canrsquot get

better

But note that in some gene families with closely related members you can get an E-value of 00 for several different matches and then identity may be more sensitive Also bear in mind in cases like this that ideas of lsquofunctionalrsquo orthology may break down with more than one locus producing identical proteins which share the same functionhellip

Protein BLASTItrsquos (nearly) always better to make comparisons at the amino acid level between protein sequences than the DNA level because the amino acid sequence is more conserved than the underlying DNA sequenceDoes this cause us to treat expected values any differently

If we follow the argument as before then for an exact match of a 20 amino acid sequence in the RefSeq protein database each additional amino acid will reduce the E-value by 120th (there are 20 different amino acids) And as there are 347895532 letters in that databaseE-value = ~35 x 108 (20 x 20 x 20 hellip20 times) = ~35 x 10-18

But this is what we get if we run the blast at NCBI

Score = 431 bits (100) Expect = 8e-04 Identities = 2020 (100) Positives = 2020 (100) Gaps = 020 (0) Frame = +3

Query 3 SSSSFRAYRAALSEVEPPCI 62 SSSSFRAYRAALSEVEPPCISbjct 972 SSSSFRAYRAALSEVEPPCI 991

Really too big a discrepancy to easily explain with hand wavinghellip

Amino Acid Substitutions

A SC F LWYG I LMVL IMFVM ILVP V ILMW FY

N DHSQ REHKS ANTT SY HFW

H NQYK RQER QK

D NEE DQK

In fact we need to take into account both amino acid substitutability as well as as before allowing gapped alignments On average any residue can be substituted for by about 2 others so each position has about 17th chance of lsquomatchingrsquo rather than 120th

So now we getE-value = ~35 x 108 (7 x 7 x 7 hellip20 times) = ~44 x 10-9which is much closer to the actual BLAST value

These substitutabilities are dealt with by the BLOSUM and PAM matrices

ExercisesGo to the file random-DNA-sequenceshtml select one of the 20 randomly generated nucleotide sequences and do a BLASTx (translated DNA-gtprotein) at NCBI against the nr protein database

Did you find any lsquosignificantrsquo hits

Repeat with a second sequence

What conclusions might you draw from this exercise

Try the same sequence(s) against the nr nucleotide database

Is there any general difference

Part 4 Tweaking BLASTAlthough you normally see BLAST as a web page with boxes to place data in and tick boxes etc it is actually a command line program that can be run just by typing the appropriate command and options eg

promptgt blastall ndashp ltblast typegt ndashi ltinput sequencegt ndashd ltdatabasegt

This is the simplest form where the basic program lsquoblastallrsquo takes a number of different options or parameters indicated by the ndashx and followed by its value -p ltwhich blast flavour to rungt-i ltfile with query sequence ingt-d ltpre-indexed database namegt

Not All Parameters are hellipThere are many other parameters and if not listed explicitly they will use a default value most appropriate to the blast flavour requested Eg for ndashW ltword sizegt blastn uses ndashW 11 where blastx uses ndashW 3

There are also some options that appear on the web pages that are not really parameters but manage the job in a similar way One of the most useful of these is on the NCBI blast pages where you can use Entrez queries or pick from an organism list to modify your search

The Many Parameters of BLASTThere are almost literally hundreds of parameters but most are way too obscure even for die-hard techies like me Very few of them are regularly useful in any but their default value but just occasionally they are very necessary

Here are some of the ones that I have used

-e max expected value -m output format (graphical or tabularspreadsheet)-F filter query sequence for low complexity (default TRUE)-U use only upper case regions of query (default FALSE)-G gap opening cost-E gap extension cost-q nucleotide mismatch penalty (BLASTx uses matrices)-r nucleotide match reward-b number of matching sequences to report-g allow gaps (default TRUE)-W word size-z effective database size (removes effect of actual database size)-S query strands to search (default both directions)-l restrict database sequences to given list of lsquogilsquo numbers

BLAST Parameters Exercises1 BLASTn vs BLASTx

Open the file example-sequenceshtml copy the sequence gtblastn-vs-blastxThis is a Xenopus tropicalis cDNA sequence

Go to the NCBI BLAST Home PageNucleotide-nucleotide BLAST (blastn) section Paste your sequence into the box

Run BLASTn against the nr nucleotide database using all default optionsThen hit [format] to wait for the results in a new page(hint if you paste the sequence definition line lsquogtnamersquo into the box as well your results will be labelled accordingly which can be useful)

Now repeat but go to the TRANSLATED BLAST section and BLAST against the nr protein database using BLASTx

How might the different results help us view the presence of this gene in other vertebrates

Results for Exercise 1 BLASTn

BLASTx

BLAST Parameters Exercises2 Low complexity filtering

Open the file example-sequenceshtml copy the sequence gtlow-complexity-filtering-A This sequence contains a long AT tandem repeat

Go to the NCBI BLAST Home PageTRANSLATED BLAST sectionBLASTx Paste your sequence into the box

Carefully UNTICK the ldquoChoose filter [ ] Low complexityrdquo BOX in the second section And then run BLASTx against the nr database

What do you feel about these alignmentsRe-run but leave the low-complexity filter ON this timeDoes this change our view of the protein matches

Now continue with gtlow-complexity-filtering-B and ndashCC is an especially interesting case ndash what can we deduce about the cDNA sequence Annotators beware

Results for Exercise 2A (OFF) BLASTn ndash low complexity filtering OFF

Results for Exercise 2A (ON) BLASTn ndash low complexity filtering ON

Results for Exercise 2B

ON OFF

Results for Exercise 2C

There is a sequence error an extra G at position 117 in the sequence cDNA (117)AGAAAAGAAGAAACATGGCAATGGATCAGAA|||||||||||||||| ||||||||||||||AGAAAAGAAGAAACAT-GCAATGGATCAGAA

Genomic sequence

ON

OFF

BLAST Parameters Exercises3 Limit by Entrez query

Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching sequences in fruit fly enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT

Open the file example-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home Page TRANSLATED BLAST sectionBLASTx and paste the sequence

Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip

BLAST Parameters Exercises4 BLASTn vs tBLASTx and nucleotide mismatch penalties

Open the file example-sequenceshtml

Also open the NCBI BLAST Home PageSPECIAL ndash Align two sequences section

There are several Xenopus tropicalis cyclins in the examples fileCopy the sequence gtcyclin-A1-Xt to the Sequence 1 BLAST windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 BLAST window(i) Run the default comparison should be BLASTn Note the alignmentNow run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these

(ii) Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1 Then try reducing the gap open and gap extension penaltieshellipWhat do we learn from this

(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip

Results for Exercise 4 (i)

BLASTn tBLASTx

Results for Exercise 4 (ii)

Mismatch penalty = -2 (default) Mismatch penalty = -1

BLAST Parameters Exercises5 E-Value maximum for reporting

Open the file example-sequenceshtml

Copy the sequence gtsumo-binding-motif and go to the NCBI BLAST Home PageGo to the PROTEIN BLAST section BLASTp and paste the sequence

Run the search with the default values

Now re-run the search setting the maximum E-value in the box

Expect 100

What difference does this make

BLAST Parameters Exercises6 Word Size

Open the file example-sequenceshtml

Copy the sequence gtmorpholino and go to the NCBI BLAST Home PageGo to the NUCLEOTIDE BLAST section BLASTn and paste the sequence

Check OFF the low complexity filter and then run the search

Now re-run the search setting the following parameters

Low complexity OFFExpect 100Word Size 7Other advanced -q-1 (mismatch penalty -1 instead of default -3)

What difference does this make

END

  • Bioinformatics Workshop 1 Sequences and Similarity Searches
  • The Basic Questions
  • Part 1 Structural Genomics
  • Chromosomes and Genes
  • Gene to Protein
  • Sequence Signals
  • Genomic Signals
  • Derivative Sequences
  • Gene Models
  • Sequences and Genes (Accession Numbers and Names)
  • Gene Symbols Names Etc
  • A Gene-Centric View
  • Sequences and Accession Numbers
  • mRNA Splicing Signals
  • Gene Predictions
  • Supporting Evidence
  • TheoreticalPredicted Sequences
  • Sequences for a model organism
  • So Whatrsquos in the Databases Now
  • Part 2 Comparative Genomics
  • Speciation
  • Residual Similarity
  • Computers Can Detect Homology
  • Orthologs
  • Paralogs
  • lsquoOtherrsquo-logs
  • The Essential Paradigm
  • Function Conserved Longer than Detectable Similarity
  • Redundancy in the Genetic Code
  • Protein Similarity Persists Longer
  • Always Compare Protein Sequences
  • Exercise 1 nucleotide vs amino acid search
  • Answers Exercise 1
  • The Essential Task
  • Functional Orthologs
  • Finding Orthologs
  • Using Synteny is Better
  • Metazome
  • Metazome Exercise
  • Part 3 Finding Sequence Similarities
  • Gaps in Alignments
  • The Downside of Gaps
  • BLAST
  • Flavours of BLAST
  • How does it work
  • BLAST WORDS and INDEXING
  • Analyse the Query Sequence
  • Expand from Word Based Matches
  • BLAST ndashTypical Output
  • When is a match significant
  • E-values
  • E-values From First Principles
  • Calculating an E-value
  • E-values In Practice
  • E-value Exercise
  • E-value Exercise Answer
  • E-values Effect of Database Size
  • Slide 58
  • Why were the values different
  • E-values Effect of Query Length
  • Why not just use identity
  • So whatrsquos the real problem
  • Rules of Thumb
  • Protein BLAST
  • Amino Acid Substitutions
  • Exercises
  • Part 4 Tweaking BLAST
  • Not All Parameters are hellip
  • The Many Parameters of BLAST
  • Slide 70
  • BLAST Parameters Exercises
  • Results for Exercise 1
  • Slide 73
  • Results for Exercise 2A (OFF)
  • Results for Exercise 2A (ON)
  • Results for Exercise 2B
  • Results for Exercise 2C
  • Slide 78
  • Slide 79
  • Results for Exercise 4 (i)
  • Results for Exercise 4 (ii)
  • Slide 82
  • Slide 83
  • END
Page 55: Bioinformatics Workshop 1 Sequences and Similarity Searches

E-value Exercise

Given a transcription factor binding site

ACC[TG]TA

How many would you expect to find by chance in a 10k promoter sequence

How would this differ if there was an optional additional base between the 4th and 5th positionsIeACC[TG]TAOR ACC[TG]TA

E-value Exercise AnswerACC[TG]TA

Expect lsquoArsquo every 4 ntExpect lsquoACCrsquo every 4x4x4 = 64 nt

Expect lsquoT or Grsquo every 2nd ntExpect lsquoACC[TG]rsquo every 64x2 nt = 128 nt

Expect lsquoTArsquo every 4x4 = 16 ntExpect lsquoACC[TG]TArsquo every 128x16 nt = 2048 nt (4x4x4x2x4x4)We would expect ~5 of these promoter sites every 10k by chance

If also ACC[TG]TAA allowed

The two motifs independently have the same E-valueTo allow either means we expect twice as many

E-values Effect of Database SizeThe nr mRNA database has ~ 14 x 1010 letters (was RefSeq and 50 x108)There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA

Query = lsquoArsquo

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACrsquo

AC AC AC AC

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACGrsquo

ACG

Expected number of matches = (14 x 1010) 4 = ~12 x 108

Expected number of matches = (14 x 1010) (4x 4) = ~31 x 107

Expected number of matches = (14 x 1010) (4 x 4 x 4) = ~81 x 106

Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer

Expected number of matches = (14 x 1010) (4 x 4 x 4 x 4 hellip 60 times ) = (14 x 1010) 1036 = 14 x 10-26

E-value = 14 x 10-26

(was E-value = 50 x 10-28)

E-values Effect of Database Size

The E-value is simply dependent on database size

RefSeq

nr

14 x 1010 letters

50 x108 letters

30 x bigger

BLAST the same sequenceagainst each E-value = 14e-26

E-value = 50e-28

The database was ~30 times bigger and so the E-value was ~30 times bigger

Why were the values differentOur calculated E-value for searching against the RefSeq mRNA database was 50 x 10-28But our actual BLAST search at NCBI gave a value of

20 x 10-26 - about 40x larger - why is this

Gapped alignments

If we were expecting N matches for a query sequence lsquoACGTACGTACGTrsquo imagine what would happen to N if we allowed gaps in our matches

ACGTACGTACGT

This would now give us additional possible alignments that would meet our lsquomatchrsquo criteria

ACGTACGTACGT ACGTACAGTACGT ACGTACCGTACGT etc|||||||||||| |||||| |||||| |||||| ||||||ACGTACGTACGT ACGTAC-GTACGT ACGTAC-GTACGT

We will expect many more matches in a given database if we allow our alignments to have gaps The E-value will be larger

E-values Effect of Query Length

Biologically itrsquos the same match Does it mean we are any less sure that this match didnrsquot occur by chance The E-value is simply dependent on match length

database

BLAST 500 nt sequence against a database

BLASTn Get a full length match with sequence XYZ at an E-value = 50e-160

gtsequence ACTAGTCTAGCTAGACATCGATCGATGATGCTACACAGATAGACGATAGATAGTAAGTCGATCGATCGCGCATCGATCGTCTAGATCGATCGCTCGCTGTGTAGATAGATCGGCGATAGA

database

BLAST half of the same sequence against the same database

BLASTn

gtsequence ACTAGTCTAGCTAGACATCGATCGATGATGCTACACAGATAGACGATAGATAGTAAGTCG

Get a match with sequence XYZ again but at an E-value = 50e-80

Why not just use identityAt some levels this a good question

But consider two very different searches both of which give a 75 identity match

Query1 was 60 nt longCGGAGCTCAGGGCTTAACGACTGATATCTCCGCGCATGTCGAGAAACGATACAGCCAGCG||||||||||| || | || | || || |||| | | | |||||| | ||||||||||CGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGWhich would have an E-value ~ 50 x 10-19

And Query2 only 16 nt longACGTACGTACGTACGT||| || | |||| ||ACGCACCTTCGTAGGTWhich would have an E-value ~ 30

And intuitively we feel we would expect to see that sort of number of matches in the database just by chancehellip

So whatrsquos the real problemBasically you are usually trying to answer the question

Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism

Are there any useful guidelines though at least for biological meaningfulness

Basically you are usually trying to answer the question

Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism

BLAST

The difficulty is because

ORTHOLOGY

BLAST Similarity + Probability

biological knowledge

nature of query sequence

phylogenetic relationship

match length PI size of databasehellip

Rules of ThumbHow good does an E-value have to be before we might even think we have an ortholog

largerworse smallerbetter

E-values 10-5 10-10 10-40 10-100 00

fantasy borderline encouraging

pretty good canrsquot get

better

But note that in some gene families with closely related members you can get an E-value of 00 for several different matches and then identity may be more sensitive Also bear in mind in cases like this that ideas of lsquofunctionalrsquo orthology may break down with more than one locus producing identical proteins which share the same functionhellip

Protein BLASTItrsquos (nearly) always better to make comparisons at the amino acid level between protein sequences than the DNA level because the amino acid sequence is more conserved than the underlying DNA sequenceDoes this cause us to treat expected values any differently

If we follow the argument as before then for an exact match of a 20 amino acid sequence in the RefSeq protein database each additional amino acid will reduce the E-value by 120th (there are 20 different amino acids) And as there are 347895532 letters in that databaseE-value = ~35 x 108 (20 x 20 x 20 hellip20 times) = ~35 x 10-18

But this is what we get if we run the blast at NCBI

Score = 431 bits (100) Expect = 8e-04 Identities = 2020 (100) Positives = 2020 (100) Gaps = 020 (0) Frame = +3

Query 3 SSSSFRAYRAALSEVEPPCI 62 SSSSFRAYRAALSEVEPPCISbjct 972 SSSSFRAYRAALSEVEPPCI 991

Really too big a discrepancy to easily explain with hand wavinghellip

Amino Acid Substitutions

A SC F LWYG I LMVL IMFVM ILVP V ILMW FY

N DHSQ REHKS ANTT SY HFW

H NQYK RQER QK

D NEE DQK

In fact we need to take into account both amino acid substitutability as well as as before allowing gapped alignments On average any residue can be substituted for by about 2 others so each position has about 17th chance of lsquomatchingrsquo rather than 120th

So now we getE-value = ~35 x 108 (7 x 7 x 7 hellip20 times) = ~44 x 10-9which is much closer to the actual BLAST value

These substitutabilities are dealt with by the BLOSUM and PAM matrices

ExercisesGo to the file random-DNA-sequenceshtml select one of the 20 randomly generated nucleotide sequences and do a BLASTx (translated DNA-gtprotein) at NCBI against the nr protein database

Did you find any lsquosignificantrsquo hits

Repeat with a second sequence

What conclusions might you draw from this exercise

Try the same sequence(s) against the nr nucleotide database

Is there any general difference

Part 4 Tweaking BLASTAlthough you normally see BLAST as a web page with boxes to place data in and tick boxes etc it is actually a command line program that can be run just by typing the appropriate command and options eg

promptgt blastall ndashp ltblast typegt ndashi ltinput sequencegt ndashd ltdatabasegt

This is the simplest form where the basic program lsquoblastallrsquo takes a number of different options or parameters indicated by the ndashx and followed by its value -p ltwhich blast flavour to rungt-i ltfile with query sequence ingt-d ltpre-indexed database namegt

Not All Parameters are hellipThere are many other parameters and if not listed explicitly they will use a default value most appropriate to the blast flavour requested Eg for ndashW ltword sizegt blastn uses ndashW 11 where blastx uses ndashW 3

There are also some options that appear on the web pages that are not really parameters but manage the job in a similar way One of the most useful of these is on the NCBI blast pages where you can use Entrez queries or pick from an organism list to modify your search

The Many Parameters of BLASTThere are almost literally hundreds of parameters but most are way too obscure even for die-hard techies like me Very few of them are regularly useful in any but their default value but just occasionally they are very necessary

Here are some of the ones that I have used

-e max expected value -m output format (graphical or tabularspreadsheet)-F filter query sequence for low complexity (default TRUE)-U use only upper case regions of query (default FALSE)-G gap opening cost-E gap extension cost-q nucleotide mismatch penalty (BLASTx uses matrices)-r nucleotide match reward-b number of matching sequences to report-g allow gaps (default TRUE)-W word size-z effective database size (removes effect of actual database size)-S query strands to search (default both directions)-l restrict database sequences to given list of lsquogilsquo numbers

BLAST Parameters Exercises1 BLASTn vs BLASTx

Open the file example-sequenceshtml copy the sequence gtblastn-vs-blastxThis is a Xenopus tropicalis cDNA sequence

Go to the NCBI BLAST Home PageNucleotide-nucleotide BLAST (blastn) section Paste your sequence into the box

Run BLASTn against the nr nucleotide database using all default optionsThen hit [format] to wait for the results in a new page(hint if you paste the sequence definition line lsquogtnamersquo into the box as well your results will be labelled accordingly which can be useful)

Now repeat but go to the TRANSLATED BLAST section and BLAST against the nr protein database using BLASTx

How might the different results help us view the presence of this gene in other vertebrates

Results for Exercise 1 BLASTn

BLASTx

BLAST Parameters Exercises2 Low complexity filtering

Open the file example-sequenceshtml copy the sequence gtlow-complexity-filtering-A This sequence contains a long AT tandem repeat

Go to the NCBI BLAST Home PageTRANSLATED BLAST sectionBLASTx Paste your sequence into the box

Carefully UNTICK the ldquoChoose filter [ ] Low complexityrdquo BOX in the second section And then run BLASTx against the nr database

What do you feel about these alignmentsRe-run but leave the low-complexity filter ON this timeDoes this change our view of the protein matches

Now continue with gtlow-complexity-filtering-B and ndashCC is an especially interesting case ndash what can we deduce about the cDNA sequence Annotators beware

Results for Exercise 2A (OFF) BLASTn ndash low complexity filtering OFF

Results for Exercise 2A (ON) BLASTn ndash low complexity filtering ON

Results for Exercise 2B

ON OFF

Results for Exercise 2C

There is a sequence error an extra G at position 117 in the sequence cDNA (117)AGAAAAGAAGAAACATGGCAATGGATCAGAA|||||||||||||||| ||||||||||||||AGAAAAGAAGAAACAT-GCAATGGATCAGAA

Genomic sequence

ON

OFF

BLAST Parameters Exercises3 Limit by Entrez query

Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching sequences in fruit fly enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT

Open the file example-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home Page TRANSLATED BLAST sectionBLASTx and paste the sequence

Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip

BLAST Parameters Exercises4 BLASTn vs tBLASTx and nucleotide mismatch penalties

Open the file example-sequenceshtml

Also open the NCBI BLAST Home PageSPECIAL ndash Align two sequences section

There are several Xenopus tropicalis cyclins in the examples fileCopy the sequence gtcyclin-A1-Xt to the Sequence 1 BLAST windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 BLAST window(i) Run the default comparison should be BLASTn Note the alignmentNow run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these

(ii) Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1 Then try reducing the gap open and gap extension penaltieshellipWhat do we learn from this

(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip

Results for Exercise 4 (i)

BLASTn tBLASTx

Results for Exercise 4 (ii)

Mismatch penalty = -2 (default) Mismatch penalty = -1

BLAST Parameters Exercises5 E-Value maximum for reporting

Open the file example-sequenceshtml

Copy the sequence gtsumo-binding-motif and go to the NCBI BLAST Home PageGo to the PROTEIN BLAST section BLASTp and paste the sequence

Run the search with the default values

Now re-run the search setting the maximum E-value in the box

Expect 100

What difference does this make

BLAST Parameters Exercises6 Word Size

Open the file example-sequenceshtml

Copy the sequence gtmorpholino and go to the NCBI BLAST Home PageGo to the NUCLEOTIDE BLAST section BLASTn and paste the sequence

Check OFF the low complexity filter and then run the search

Now re-run the search setting the following parameters

Low complexity OFFExpect 100Word Size 7Other advanced -q-1 (mismatch penalty -1 instead of default -3)

What difference does this make

END

  • Bioinformatics Workshop 1 Sequences and Similarity Searches
  • The Basic Questions
  • Part 1 Structural Genomics
  • Chromosomes and Genes
  • Gene to Protein
  • Sequence Signals
  • Genomic Signals
  • Derivative Sequences
  • Gene Models
  • Sequences and Genes (Accession Numbers and Names)
  • Gene Symbols Names Etc
  • A Gene-Centric View
  • Sequences and Accession Numbers
  • mRNA Splicing Signals
  • Gene Predictions
  • Supporting Evidence
  • TheoreticalPredicted Sequences
  • Sequences for a model organism
  • So Whatrsquos in the Databases Now
  • Part 2 Comparative Genomics
  • Speciation
  • Residual Similarity
  • Computers Can Detect Homology
  • Orthologs
  • Paralogs
  • lsquoOtherrsquo-logs
  • The Essential Paradigm
  • Function Conserved Longer than Detectable Similarity
  • Redundancy in the Genetic Code
  • Protein Similarity Persists Longer
  • Always Compare Protein Sequences
  • Exercise 1 nucleotide vs amino acid search
  • Answers Exercise 1
  • The Essential Task
  • Functional Orthologs
  • Finding Orthologs
  • Using Synteny is Better
  • Metazome
  • Metazome Exercise
  • Part 3 Finding Sequence Similarities
  • Gaps in Alignments
  • The Downside of Gaps
  • BLAST
  • Flavours of BLAST
  • How does it work
  • BLAST WORDS and INDEXING
  • Analyse the Query Sequence
  • Expand from Word Based Matches
  • BLAST ndashTypical Output
  • When is a match significant
  • E-values
  • E-values From First Principles
  • Calculating an E-value
  • E-values In Practice
  • E-value Exercise
  • E-value Exercise Answer
  • E-values Effect of Database Size
  • Slide 58
  • Why were the values different
  • E-values Effect of Query Length
  • Why not just use identity
  • So whatrsquos the real problem
  • Rules of Thumb
  • Protein BLAST
  • Amino Acid Substitutions
  • Exercises
  • Part 4 Tweaking BLAST
  • Not All Parameters are hellip
  • The Many Parameters of BLAST
  • Slide 70
  • BLAST Parameters Exercises
  • Results for Exercise 1
  • Slide 73
  • Results for Exercise 2A (OFF)
  • Results for Exercise 2A (ON)
  • Results for Exercise 2B
  • Results for Exercise 2C
  • Slide 78
  • Slide 79
  • Results for Exercise 4 (i)
  • Results for Exercise 4 (ii)
  • Slide 82
  • Slide 83
  • END
Page 56: Bioinformatics Workshop 1 Sequences and Similarity Searches

E-value Exercise AnswerACC[TG]TA

Expect lsquoArsquo every 4 ntExpect lsquoACCrsquo every 4x4x4 = 64 nt

Expect lsquoT or Grsquo every 2nd ntExpect lsquoACC[TG]rsquo every 64x2 nt = 128 nt

Expect lsquoTArsquo every 4x4 = 16 ntExpect lsquoACC[TG]TArsquo every 128x16 nt = 2048 nt (4x4x4x2x4x4)We would expect ~5 of these promoter sites every 10k by chance

If also ACC[TG]TAA allowed

The two motifs independently have the same E-valueTo allow either means we expect twice as many

E-values Effect of Database SizeThe nr mRNA database has ~ 14 x 1010 letters (was RefSeq and 50 x108)There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA

Query = lsquoArsquo

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACrsquo

AC AC AC AC

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACGrsquo

ACG

Expected number of matches = (14 x 1010) 4 = ~12 x 108

Expected number of matches = (14 x 1010) (4x 4) = ~31 x 107

Expected number of matches = (14 x 1010) (4 x 4 x 4) = ~81 x 106

Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer

Expected number of matches = (14 x 1010) (4 x 4 x 4 x 4 hellip 60 times ) = (14 x 1010) 1036 = 14 x 10-26

E-value = 14 x 10-26

(was E-value = 50 x 10-28)

E-values Effect of Database Size

The E-value is simply dependent on database size

RefSeq

nr

14 x 1010 letters

50 x108 letters

30 x bigger

BLAST the same sequenceagainst each E-value = 14e-26

E-value = 50e-28

The database was ~30 times bigger and so the E-value was ~30 times bigger

Why were the values differentOur calculated E-value for searching against the RefSeq mRNA database was 50 x 10-28But our actual BLAST search at NCBI gave a value of

20 x 10-26 - about 40x larger - why is this

Gapped alignments

If we were expecting N matches for a query sequence lsquoACGTACGTACGTrsquo imagine what would happen to N if we allowed gaps in our matches

ACGTACGTACGT

This would now give us additional possible alignments that would meet our lsquomatchrsquo criteria

ACGTACGTACGT ACGTACAGTACGT ACGTACCGTACGT etc|||||||||||| |||||| |||||| |||||| ||||||ACGTACGTACGT ACGTAC-GTACGT ACGTAC-GTACGT

We will expect many more matches in a given database if we allow our alignments to have gaps The E-value will be larger

E-values Effect of Query Length

Biologically itrsquos the same match Does it mean we are any less sure that this match didnrsquot occur by chance The E-value is simply dependent on match length

database

BLAST 500 nt sequence against a database

BLASTn Get a full length match with sequence XYZ at an E-value = 50e-160

gtsequence ACTAGTCTAGCTAGACATCGATCGATGATGCTACACAGATAGACGATAGATAGTAAGTCGATCGATCGCGCATCGATCGTCTAGATCGATCGCTCGCTGTGTAGATAGATCGGCGATAGA

database

BLAST half of the same sequence against the same database

BLASTn

gtsequence ACTAGTCTAGCTAGACATCGATCGATGATGCTACACAGATAGACGATAGATAGTAAGTCG

Get a match with sequence XYZ again but at an E-value = 50e-80

Why not just use identityAt some levels this a good question

But consider two very different searches both of which give a 75 identity match

Query1 was 60 nt longCGGAGCTCAGGGCTTAACGACTGATATCTCCGCGCATGTCGAGAAACGATACAGCCAGCG||||||||||| || | || | || || |||| | | | |||||| | ||||||||||CGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGWhich would have an E-value ~ 50 x 10-19

And Query2 only 16 nt longACGTACGTACGTACGT||| || | |||| ||ACGCACCTTCGTAGGTWhich would have an E-value ~ 30

And intuitively we feel we would expect to see that sort of number of matches in the database just by chancehellip

So whatrsquos the real problemBasically you are usually trying to answer the question

Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism

Are there any useful guidelines though at least for biological meaningfulness

Basically you are usually trying to answer the question

Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism

BLAST

The difficulty is because

ORTHOLOGY

BLAST Similarity + Probability

biological knowledge

nature of query sequence

phylogenetic relationship

match length PI size of databasehellip

Rules of ThumbHow good does an E-value have to be before we might even think we have an ortholog

largerworse smallerbetter

E-values 10-5 10-10 10-40 10-100 00

fantasy borderline encouraging

pretty good canrsquot get

better

But note that in some gene families with closely related members you can get an E-value of 00 for several different matches and then identity may be more sensitive Also bear in mind in cases like this that ideas of lsquofunctionalrsquo orthology may break down with more than one locus producing identical proteins which share the same functionhellip

Protein BLASTItrsquos (nearly) always better to make comparisons at the amino acid level between protein sequences than the DNA level because the amino acid sequence is more conserved than the underlying DNA sequenceDoes this cause us to treat expected values any differently

If we follow the argument as before then for an exact match of a 20 amino acid sequence in the RefSeq protein database each additional amino acid will reduce the E-value by 120th (there are 20 different amino acids) And as there are 347895532 letters in that databaseE-value = ~35 x 108 (20 x 20 x 20 hellip20 times) = ~35 x 10-18

But this is what we get if we run the blast at NCBI

Score = 431 bits (100) Expect = 8e-04 Identities = 2020 (100) Positives = 2020 (100) Gaps = 020 (0) Frame = +3

Query 3 SSSSFRAYRAALSEVEPPCI 62 SSSSFRAYRAALSEVEPPCISbjct 972 SSSSFRAYRAALSEVEPPCI 991

Really too big a discrepancy to easily explain with hand wavinghellip

Amino Acid Substitutions

A SC F LWYG I LMVL IMFVM ILVP V ILMW FY

N DHSQ REHKS ANTT SY HFW

H NQYK RQER QK

D NEE DQK

In fact we need to take into account both amino acid substitutability as well as as before allowing gapped alignments On average any residue can be substituted for by about 2 others so each position has about 17th chance of lsquomatchingrsquo rather than 120th

So now we getE-value = ~35 x 108 (7 x 7 x 7 hellip20 times) = ~44 x 10-9which is much closer to the actual BLAST value

These substitutabilities are dealt with by the BLOSUM and PAM matrices

ExercisesGo to the file random-DNA-sequenceshtml select one of the 20 randomly generated nucleotide sequences and do a BLASTx (translated DNA-gtprotein) at NCBI against the nr protein database

Did you find any lsquosignificantrsquo hits

Repeat with a second sequence

What conclusions might you draw from this exercise

Try the same sequence(s) against the nr nucleotide database

Is there any general difference

Part 4 Tweaking BLASTAlthough you normally see BLAST as a web page with boxes to place data in and tick boxes etc it is actually a command line program that can be run just by typing the appropriate command and options eg

promptgt blastall ndashp ltblast typegt ndashi ltinput sequencegt ndashd ltdatabasegt

This is the simplest form where the basic program lsquoblastallrsquo takes a number of different options or parameters indicated by the ndashx and followed by its value -p ltwhich blast flavour to rungt-i ltfile with query sequence ingt-d ltpre-indexed database namegt

Not All Parameters are hellipThere are many other parameters and if not listed explicitly they will use a default value most appropriate to the blast flavour requested Eg for ndashW ltword sizegt blastn uses ndashW 11 where blastx uses ndashW 3

There are also some options that appear on the web pages that are not really parameters but manage the job in a similar way One of the most useful of these is on the NCBI blast pages where you can use Entrez queries or pick from an organism list to modify your search

The Many Parameters of BLASTThere are almost literally hundreds of parameters but most are way too obscure even for die-hard techies like me Very few of them are regularly useful in any but their default value but just occasionally they are very necessary

Here are some of the ones that I have used

-e max expected value -m output format (graphical or tabularspreadsheet)-F filter query sequence for low complexity (default TRUE)-U use only upper case regions of query (default FALSE)-G gap opening cost-E gap extension cost-q nucleotide mismatch penalty (BLASTx uses matrices)-r nucleotide match reward-b number of matching sequences to report-g allow gaps (default TRUE)-W word size-z effective database size (removes effect of actual database size)-S query strands to search (default both directions)-l restrict database sequences to given list of lsquogilsquo numbers

BLAST Parameters Exercises1 BLASTn vs BLASTx

Open the file example-sequenceshtml copy the sequence gtblastn-vs-blastxThis is a Xenopus tropicalis cDNA sequence

Go to the NCBI BLAST Home PageNucleotide-nucleotide BLAST (blastn) section Paste your sequence into the box

Run BLASTn against the nr nucleotide database using all default optionsThen hit [format] to wait for the results in a new page(hint if you paste the sequence definition line lsquogtnamersquo into the box as well your results will be labelled accordingly which can be useful)

Now repeat but go to the TRANSLATED BLAST section and BLAST against the nr protein database using BLASTx

How might the different results help us view the presence of this gene in other vertebrates

Results for Exercise 1 BLASTn

BLASTx

BLAST Parameters Exercises2 Low complexity filtering

Open the file example-sequenceshtml copy the sequence gtlow-complexity-filtering-A This sequence contains a long AT tandem repeat

Go to the NCBI BLAST Home PageTRANSLATED BLAST sectionBLASTx Paste your sequence into the box

Carefully UNTICK the ldquoChoose filter [ ] Low complexityrdquo BOX in the second section And then run BLASTx against the nr database

What do you feel about these alignmentsRe-run but leave the low-complexity filter ON this timeDoes this change our view of the protein matches

Now continue with gtlow-complexity-filtering-B and ndashCC is an especially interesting case ndash what can we deduce about the cDNA sequence Annotators beware

Results for Exercise 2A (OFF) BLASTn ndash low complexity filtering OFF

Results for Exercise 2A (ON) BLASTn ndash low complexity filtering ON

Results for Exercise 2B

ON OFF

Results for Exercise 2C

There is a sequence error an extra G at position 117 in the sequence cDNA (117)AGAAAAGAAGAAACATGGCAATGGATCAGAA|||||||||||||||| ||||||||||||||AGAAAAGAAGAAACAT-GCAATGGATCAGAA

Genomic sequence

ON

OFF

BLAST Parameters Exercises3 Limit by Entrez query

Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching sequences in fruit fly enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT

Open the file example-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home Page TRANSLATED BLAST sectionBLASTx and paste the sequence

Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip

BLAST Parameters Exercises4 BLASTn vs tBLASTx and nucleotide mismatch penalties

Open the file example-sequenceshtml

Also open the NCBI BLAST Home PageSPECIAL ndash Align two sequences section

There are several Xenopus tropicalis cyclins in the examples fileCopy the sequence gtcyclin-A1-Xt to the Sequence 1 BLAST windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 BLAST window(i) Run the default comparison should be BLASTn Note the alignmentNow run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these

(ii) Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1 Then try reducing the gap open and gap extension penaltieshellipWhat do we learn from this

(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip

Results for Exercise 4 (i)

BLASTn tBLASTx

Results for Exercise 4 (ii)

Mismatch penalty = -2 (default) Mismatch penalty = -1

BLAST Parameters Exercises5 E-Value maximum for reporting

Open the file example-sequenceshtml

Copy the sequence gtsumo-binding-motif and go to the NCBI BLAST Home PageGo to the PROTEIN BLAST section BLASTp and paste the sequence

Run the search with the default values

Now re-run the search setting the maximum E-value in the box

Expect 100

What difference does this make

BLAST Parameters Exercises6 Word Size

Open the file example-sequenceshtml

Copy the sequence gtmorpholino and go to the NCBI BLAST Home PageGo to the NUCLEOTIDE BLAST section BLASTn and paste the sequence

Check OFF the low complexity filter and then run the search

Now re-run the search setting the following parameters

Low complexity OFFExpect 100Word Size 7Other advanced -q-1 (mismatch penalty -1 instead of default -3)

What difference does this make

END

  • Bioinformatics Workshop 1 Sequences and Similarity Searches
  • The Basic Questions
  • Part 1 Structural Genomics
  • Chromosomes and Genes
  • Gene to Protein
  • Sequence Signals
  • Genomic Signals
  • Derivative Sequences
  • Gene Models
  • Sequences and Genes (Accession Numbers and Names)
  • Gene Symbols Names Etc
  • A Gene-Centric View
  • Sequences and Accession Numbers
  • mRNA Splicing Signals
  • Gene Predictions
  • Supporting Evidence
  • TheoreticalPredicted Sequences
  • Sequences for a model organism
  • So Whatrsquos in the Databases Now
  • Part 2 Comparative Genomics
  • Speciation
  • Residual Similarity
  • Computers Can Detect Homology
  • Orthologs
  • Paralogs
  • lsquoOtherrsquo-logs
  • The Essential Paradigm
  • Function Conserved Longer than Detectable Similarity
  • Redundancy in the Genetic Code
  • Protein Similarity Persists Longer
  • Always Compare Protein Sequences
  • Exercise 1 nucleotide vs amino acid search
  • Answers Exercise 1
  • The Essential Task
  • Functional Orthologs
  • Finding Orthologs
  • Using Synteny is Better
  • Metazome
  • Metazome Exercise
  • Part 3 Finding Sequence Similarities
  • Gaps in Alignments
  • The Downside of Gaps
  • BLAST
  • Flavours of BLAST
  • How does it work
  • BLAST WORDS and INDEXING
  • Analyse the Query Sequence
  • Expand from Word Based Matches
  • BLAST ndashTypical Output
  • When is a match significant
  • E-values
  • E-values From First Principles
  • Calculating an E-value
  • E-values In Practice
  • E-value Exercise
  • E-value Exercise Answer
  • E-values Effect of Database Size
  • Slide 58
  • Why were the values different
  • E-values Effect of Query Length
  • Why not just use identity
  • So whatrsquos the real problem
  • Rules of Thumb
  • Protein BLAST
  • Amino Acid Substitutions
  • Exercises
  • Part 4 Tweaking BLAST
  • Not All Parameters are hellip
  • The Many Parameters of BLAST
  • Slide 70
  • BLAST Parameters Exercises
  • Results for Exercise 1
  • Slide 73
  • Results for Exercise 2A (OFF)
  • Results for Exercise 2A (ON)
  • Results for Exercise 2B
  • Results for Exercise 2C
  • Slide 78
  • Slide 79
  • Results for Exercise 4 (i)
  • Results for Exercise 4 (ii)
  • Slide 82
  • Slide 83
  • END
Page 57: Bioinformatics Workshop 1 Sequences and Similarity Searches

E-values Effect of Database SizeThe nr mRNA database has ~ 14 x 1010 letters (was RefSeq and 50 x108)There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA

Query = lsquoArsquo

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACrsquo

AC AC AC AC

CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGQuery = lsquoACGrsquo

ACG

Expected number of matches = (14 x 1010) 4 = ~12 x 108

Expected number of matches = (14 x 1010) (4x 4) = ~31 x 107

Expected number of matches = (14 x 1010) (4 x 4 x 4) = ~81 x 106

Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer

Expected number of matches = (14 x 1010) (4 x 4 x 4 x 4 hellip 60 times ) = (14 x 1010) 1036 = 14 x 10-26

E-value = 14 x 10-26

(was E-value = 50 x 10-28)

E-values Effect of Database Size

The E-value is simply dependent on database size

RefSeq

nr

14 x 1010 letters

50 x108 letters

30 x bigger

BLAST the same sequenceagainst each E-value = 14e-26

E-value = 50e-28

The database was ~30 times bigger and so the E-value was ~30 times bigger

Why were the values differentOur calculated E-value for searching against the RefSeq mRNA database was 50 x 10-28But our actual BLAST search at NCBI gave a value of

20 x 10-26 - about 40x larger - why is this

Gapped alignments

If we were expecting N matches for a query sequence lsquoACGTACGTACGTrsquo imagine what would happen to N if we allowed gaps in our matches

ACGTACGTACGT

This would now give us additional possible alignments that would meet our lsquomatchrsquo criteria

ACGTACGTACGT ACGTACAGTACGT ACGTACCGTACGT etc|||||||||||| |||||| |||||| |||||| ||||||ACGTACGTACGT ACGTAC-GTACGT ACGTAC-GTACGT

We will expect many more matches in a given database if we allow our alignments to have gaps The E-value will be larger

E-values Effect of Query Length

Biologically itrsquos the same match Does it mean we are any less sure that this match didnrsquot occur by chance The E-value is simply dependent on match length

database

BLAST 500 nt sequence against a database

BLASTn Get a full length match with sequence XYZ at an E-value = 50e-160

gtsequence ACTAGTCTAGCTAGACATCGATCGATGATGCTACACAGATAGACGATAGATAGTAAGTCGATCGATCGCGCATCGATCGTCTAGATCGATCGCTCGCTGTGTAGATAGATCGGCGATAGA

database

BLAST half of the same sequence against the same database

BLASTn

gtsequence ACTAGTCTAGCTAGACATCGATCGATGATGCTACACAGATAGACGATAGATAGTAAGTCG

Get a match with sequence XYZ again but at an E-value = 50e-80

Why not just use identityAt some levels this a good question

But consider two very different searches both of which give a 75 identity match

Query1 was 60 nt longCGGAGCTCAGGGCTTAACGACTGATATCTCCGCGCATGTCGAGAAACGATACAGCCAGCG||||||||||| || | || | || || |||| | | | |||||| | ||||||||||CGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGWhich would have an E-value ~ 50 x 10-19

And Query2 only 16 nt longACGTACGTACGTACGT||| || | |||| ||ACGCACCTTCGTAGGTWhich would have an E-value ~ 30

And intuitively we feel we would expect to see that sort of number of matches in the database just by chancehellip

So whatrsquos the real problemBasically you are usually trying to answer the question

Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism

Are there any useful guidelines though at least for biological meaningfulness

Basically you are usually trying to answer the question

Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism

BLAST

The difficulty is because

ORTHOLOGY

BLAST Similarity + Probability

biological knowledge

nature of query sequence

phylogenetic relationship

match length PI size of databasehellip

Rules of ThumbHow good does an E-value have to be before we might even think we have an ortholog

largerworse smallerbetter

E-values 10-5 10-10 10-40 10-100 00

fantasy borderline encouraging

pretty good canrsquot get

better

But note that in some gene families with closely related members you can get an E-value of 00 for several different matches and then identity may be more sensitive Also bear in mind in cases like this that ideas of lsquofunctionalrsquo orthology may break down with more than one locus producing identical proteins which share the same functionhellip

Protein BLASTItrsquos (nearly) always better to make comparisons at the amino acid level between protein sequences than the DNA level because the amino acid sequence is more conserved than the underlying DNA sequenceDoes this cause us to treat expected values any differently

If we follow the argument as before then for an exact match of a 20 amino acid sequence in the RefSeq protein database each additional amino acid will reduce the E-value by 120th (there are 20 different amino acids) And as there are 347895532 letters in that databaseE-value = ~35 x 108 (20 x 20 x 20 hellip20 times) = ~35 x 10-18

But this is what we get if we run the blast at NCBI

Score = 431 bits (100) Expect = 8e-04 Identities = 2020 (100) Positives = 2020 (100) Gaps = 020 (0) Frame = +3

Query 3 SSSSFRAYRAALSEVEPPCI 62 SSSSFRAYRAALSEVEPPCISbjct 972 SSSSFRAYRAALSEVEPPCI 991

Really too big a discrepancy to easily explain with hand wavinghellip

Amino Acid Substitutions

A SC F LWYG I LMVL IMFVM ILVP V ILMW FY

N DHSQ REHKS ANTT SY HFW

H NQYK RQER QK

D NEE DQK

In fact we need to take into account both amino acid substitutability as well as as before allowing gapped alignments On average any residue can be substituted for by about 2 others so each position has about 17th chance of lsquomatchingrsquo rather than 120th

So now we getE-value = ~35 x 108 (7 x 7 x 7 hellip20 times) = ~44 x 10-9which is much closer to the actual BLAST value

These substitutabilities are dealt with by the BLOSUM and PAM matrices

ExercisesGo to the file random-DNA-sequenceshtml select one of the 20 randomly generated nucleotide sequences and do a BLASTx (translated DNA-gtprotein) at NCBI against the nr protein database

Did you find any lsquosignificantrsquo hits

Repeat with a second sequence

What conclusions might you draw from this exercise

Try the same sequence(s) against the nr nucleotide database

Is there any general difference

Part 4 Tweaking BLASTAlthough you normally see BLAST as a web page with boxes to place data in and tick boxes etc it is actually a command line program that can be run just by typing the appropriate command and options eg

promptgt blastall ndashp ltblast typegt ndashi ltinput sequencegt ndashd ltdatabasegt

This is the simplest form where the basic program lsquoblastallrsquo takes a number of different options or parameters indicated by the ndashx and followed by its value -p ltwhich blast flavour to rungt-i ltfile with query sequence ingt-d ltpre-indexed database namegt

Not All Parameters are hellipThere are many other parameters and if not listed explicitly they will use a default value most appropriate to the blast flavour requested Eg for ndashW ltword sizegt blastn uses ndashW 11 where blastx uses ndashW 3

There are also some options that appear on the web pages that are not really parameters but manage the job in a similar way One of the most useful of these is on the NCBI blast pages where you can use Entrez queries or pick from an organism list to modify your search

The Many Parameters of BLASTThere are almost literally hundreds of parameters but most are way too obscure even for die-hard techies like me Very few of them are regularly useful in any but their default value but just occasionally they are very necessary

Here are some of the ones that I have used

-e max expected value -m output format (graphical or tabularspreadsheet)-F filter query sequence for low complexity (default TRUE)-U use only upper case regions of query (default FALSE)-G gap opening cost-E gap extension cost-q nucleotide mismatch penalty (BLASTx uses matrices)-r nucleotide match reward-b number of matching sequences to report-g allow gaps (default TRUE)-W word size-z effective database size (removes effect of actual database size)-S query strands to search (default both directions)-l restrict database sequences to given list of lsquogilsquo numbers

BLAST Parameters Exercises1 BLASTn vs BLASTx

Open the file example-sequenceshtml copy the sequence gtblastn-vs-blastxThis is a Xenopus tropicalis cDNA sequence

Go to the NCBI BLAST Home PageNucleotide-nucleotide BLAST (blastn) section Paste your sequence into the box

Run BLASTn against the nr nucleotide database using all default optionsThen hit [format] to wait for the results in a new page(hint if you paste the sequence definition line lsquogtnamersquo into the box as well your results will be labelled accordingly which can be useful)

Now repeat but go to the TRANSLATED BLAST section and BLAST against the nr protein database using BLASTx

How might the different results help us view the presence of this gene in other vertebrates

Results for Exercise 1 BLASTn

BLASTx

BLAST Parameters Exercises2 Low complexity filtering

Open the file example-sequenceshtml copy the sequence gtlow-complexity-filtering-A This sequence contains a long AT tandem repeat

Go to the NCBI BLAST Home PageTRANSLATED BLAST sectionBLASTx Paste your sequence into the box

Carefully UNTICK the ldquoChoose filter [ ] Low complexityrdquo BOX in the second section And then run BLASTx against the nr database

What do you feel about these alignmentsRe-run but leave the low-complexity filter ON this timeDoes this change our view of the protein matches

Now continue with gtlow-complexity-filtering-B and ndashCC is an especially interesting case ndash what can we deduce about the cDNA sequence Annotators beware

Results for Exercise 2A (OFF) BLASTn ndash low complexity filtering OFF

Results for Exercise 2A (ON) BLASTn ndash low complexity filtering ON

Results for Exercise 2B

ON OFF

Results for Exercise 2C

There is a sequence error an extra G at position 117 in the sequence cDNA (117)AGAAAAGAAGAAACATGGCAATGGATCAGAA|||||||||||||||| ||||||||||||||AGAAAAGAAGAAACAT-GCAATGGATCAGAA

Genomic sequence

ON

OFF

BLAST Parameters Exercises3 Limit by Entrez query

Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching sequences in fruit fly enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT

Open the file example-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home Page TRANSLATED BLAST sectionBLASTx and paste the sequence

Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip

BLAST Parameters Exercises4 BLASTn vs tBLASTx and nucleotide mismatch penalties

Open the file example-sequenceshtml

Also open the NCBI BLAST Home PageSPECIAL ndash Align two sequences section

There are several Xenopus tropicalis cyclins in the examples fileCopy the sequence gtcyclin-A1-Xt to the Sequence 1 BLAST windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 BLAST window(i) Run the default comparison should be BLASTn Note the alignmentNow run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these

(ii) Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1 Then try reducing the gap open and gap extension penaltieshellipWhat do we learn from this

(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip

Results for Exercise 4 (i)

BLASTn tBLASTx

Results for Exercise 4 (ii)

Mismatch penalty = -2 (default) Mismatch penalty = -1

BLAST Parameters Exercises5 E-Value maximum for reporting

Open the file example-sequenceshtml

Copy the sequence gtsumo-binding-motif and go to the NCBI BLAST Home PageGo to the PROTEIN BLAST section BLASTp and paste the sequence

Run the search with the default values

Now re-run the search setting the maximum E-value in the box

Expect 100

What difference does this make

BLAST Parameters Exercises6 Word Size

Open the file example-sequenceshtml

Copy the sequence gtmorpholino and go to the NCBI BLAST Home PageGo to the NUCLEOTIDE BLAST section BLASTn and paste the sequence

Check OFF the low complexity filter and then run the search

Now re-run the search setting the following parameters

Low complexity OFFExpect 100Word Size 7Other advanced -q-1 (mismatch penalty -1 instead of default -3)

What difference does this make

END

  • Bioinformatics Workshop 1 Sequences and Similarity Searches
  • The Basic Questions
  • Part 1 Structural Genomics
  • Chromosomes and Genes
  • Gene to Protein
  • Sequence Signals
  • Genomic Signals
  • Derivative Sequences
  • Gene Models
  • Sequences and Genes (Accession Numbers and Names)
  • Gene Symbols Names Etc
  • A Gene-Centric View
  • Sequences and Accession Numbers
  • mRNA Splicing Signals
  • Gene Predictions
  • Supporting Evidence
  • TheoreticalPredicted Sequences
  • Sequences for a model organism
  • So Whatrsquos in the Databases Now
  • Part 2 Comparative Genomics
  • Speciation
  • Residual Similarity
  • Computers Can Detect Homology
  • Orthologs
  • Paralogs
  • lsquoOtherrsquo-logs
  • The Essential Paradigm
  • Function Conserved Longer than Detectable Similarity
  • Redundancy in the Genetic Code
  • Protein Similarity Persists Longer
  • Always Compare Protein Sequences
  • Exercise 1 nucleotide vs amino acid search
  • Answers Exercise 1
  • The Essential Task
  • Functional Orthologs
  • Finding Orthologs
  • Using Synteny is Better
  • Metazome
  • Metazome Exercise
  • Part 3 Finding Sequence Similarities
  • Gaps in Alignments
  • The Downside of Gaps
  • BLAST
  • Flavours of BLAST
  • How does it work
  • BLAST WORDS and INDEXING
  • Analyse the Query Sequence
  • Expand from Word Based Matches
  • BLAST ndashTypical Output
  • When is a match significant
  • E-values
  • E-values From First Principles
  • Calculating an E-value
  • E-values In Practice
  • E-value Exercise
  • E-value Exercise Answer
  • E-values Effect of Database Size
  • Slide 58
  • Why were the values different
  • E-values Effect of Query Length
  • Why not just use identity
  • So whatrsquos the real problem
  • Rules of Thumb
  • Protein BLAST
  • Amino Acid Substitutions
  • Exercises
  • Part 4 Tweaking BLAST
  • Not All Parameters are hellip
  • The Many Parameters of BLAST
  • Slide 70
  • BLAST Parameters Exercises
  • Results for Exercise 1
  • Slide 73
  • Results for Exercise 2A (OFF)
  • Results for Exercise 2A (ON)
  • Results for Exercise 2B
  • Results for Exercise 2C
  • Slide 78
  • Slide 79
  • Results for Exercise 4 (i)
  • Results for Exercise 4 (ii)
  • Slide 82
  • Slide 83
  • END
Page 58: Bioinformatics Workshop 1 Sequences and Similarity Searches

E-values Effect of Database Size

The E-value is simply dependent on database size

RefSeq

nr

14 x 1010 letters

50 x108 letters

30 x bigger

BLAST the same sequenceagainst each E-value = 14e-26

E-value = 50e-28

The database was ~30 times bigger and so the E-value was ~30 times bigger

Why were the values differentOur calculated E-value for searching against the RefSeq mRNA database was 50 x 10-28But our actual BLAST search at NCBI gave a value of

20 x 10-26 - about 40x larger - why is this

Gapped alignments

If we were expecting N matches for a query sequence lsquoACGTACGTACGTrsquo imagine what would happen to N if we allowed gaps in our matches

ACGTACGTACGT

This would now give us additional possible alignments that would meet our lsquomatchrsquo criteria

ACGTACGTACGT ACGTACAGTACGT ACGTACCGTACGT etc|||||||||||| |||||| |||||| |||||| ||||||ACGTACGTACGT ACGTAC-GTACGT ACGTAC-GTACGT

We will expect many more matches in a given database if we allow our alignments to have gaps The E-value will be larger

E-values Effect of Query Length

Biologically itrsquos the same match Does it mean we are any less sure that this match didnrsquot occur by chance The E-value is simply dependent on match length

database

BLAST 500 nt sequence against a database

BLASTn Get a full length match with sequence XYZ at an E-value = 50e-160

gtsequence ACTAGTCTAGCTAGACATCGATCGATGATGCTACACAGATAGACGATAGATAGTAAGTCGATCGATCGCGCATCGATCGTCTAGATCGATCGCTCGCTGTGTAGATAGATCGGCGATAGA

database

BLAST half of the same sequence against the same database

BLASTn

gtsequence ACTAGTCTAGCTAGACATCGATCGATGATGCTACACAGATAGACGATAGATAGTAAGTCG

Get a match with sequence XYZ again but at an E-value = 50e-80

Why not just use identityAt some levels this a good question

But consider two very different searches both of which give a 75 identity match

Query1 was 60 nt longCGGAGCTCAGGGCTTAACGACTGATATCTCCGCGCATGTCGAGAAACGATACAGCCAGCG||||||||||| || | || | || || |||| | | | |||||| | ||||||||||CGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGWhich would have an E-value ~ 50 x 10-19

And Query2 only 16 nt longACGTACGTACGTACGT||| || | |||| ||ACGCACCTTCGTAGGTWhich would have an E-value ~ 30

And intuitively we feel we would expect to see that sort of number of matches in the database just by chancehellip

So whatrsquos the real problemBasically you are usually trying to answer the question

Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism

Are there any useful guidelines though at least for biological meaningfulness

Basically you are usually trying to answer the question

Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism

BLAST

The difficulty is because

ORTHOLOGY

BLAST Similarity + Probability

biological knowledge

nature of query sequence

phylogenetic relationship

match length PI size of databasehellip

Rules of ThumbHow good does an E-value have to be before we might even think we have an ortholog

largerworse smallerbetter

E-values 10-5 10-10 10-40 10-100 00

fantasy borderline encouraging

pretty good canrsquot get

better

But note that in some gene families with closely related members you can get an E-value of 00 for several different matches and then identity may be more sensitive Also bear in mind in cases like this that ideas of lsquofunctionalrsquo orthology may break down with more than one locus producing identical proteins which share the same functionhellip

Protein BLASTItrsquos (nearly) always better to make comparisons at the amino acid level between protein sequences than the DNA level because the amino acid sequence is more conserved than the underlying DNA sequenceDoes this cause us to treat expected values any differently

If we follow the argument as before then for an exact match of a 20 amino acid sequence in the RefSeq protein database each additional amino acid will reduce the E-value by 120th (there are 20 different amino acids) And as there are 347895532 letters in that databaseE-value = ~35 x 108 (20 x 20 x 20 hellip20 times) = ~35 x 10-18

But this is what we get if we run the blast at NCBI

Score = 431 bits (100) Expect = 8e-04 Identities = 2020 (100) Positives = 2020 (100) Gaps = 020 (0) Frame = +3

Query 3 SSSSFRAYRAALSEVEPPCI 62 SSSSFRAYRAALSEVEPPCISbjct 972 SSSSFRAYRAALSEVEPPCI 991

Really too big a discrepancy to easily explain with hand wavinghellip

Amino Acid Substitutions

A SC F LWYG I LMVL IMFVM ILVP V ILMW FY

N DHSQ REHKS ANTT SY HFW

H NQYK RQER QK

D NEE DQK

In fact we need to take into account both amino acid substitutability as well as as before allowing gapped alignments On average any residue can be substituted for by about 2 others so each position has about 17th chance of lsquomatchingrsquo rather than 120th

So now we getE-value = ~35 x 108 (7 x 7 x 7 hellip20 times) = ~44 x 10-9which is much closer to the actual BLAST value

These substitutabilities are dealt with by the BLOSUM and PAM matrices

ExercisesGo to the file random-DNA-sequenceshtml select one of the 20 randomly generated nucleotide sequences and do a BLASTx (translated DNA-gtprotein) at NCBI against the nr protein database

Did you find any lsquosignificantrsquo hits

Repeat with a second sequence

What conclusions might you draw from this exercise

Try the same sequence(s) against the nr nucleotide database

Is there any general difference

Part 4 Tweaking BLASTAlthough you normally see BLAST as a web page with boxes to place data in and tick boxes etc it is actually a command line program that can be run just by typing the appropriate command and options eg

promptgt blastall ndashp ltblast typegt ndashi ltinput sequencegt ndashd ltdatabasegt

This is the simplest form where the basic program lsquoblastallrsquo takes a number of different options or parameters indicated by the ndashx and followed by its value -p ltwhich blast flavour to rungt-i ltfile with query sequence ingt-d ltpre-indexed database namegt

Not All Parameters are hellipThere are many other parameters and if not listed explicitly they will use a default value most appropriate to the blast flavour requested Eg for ndashW ltword sizegt blastn uses ndashW 11 where blastx uses ndashW 3

There are also some options that appear on the web pages that are not really parameters but manage the job in a similar way One of the most useful of these is on the NCBI blast pages where you can use Entrez queries or pick from an organism list to modify your search

The Many Parameters of BLASTThere are almost literally hundreds of parameters but most are way too obscure even for die-hard techies like me Very few of them are regularly useful in any but their default value but just occasionally they are very necessary

Here are some of the ones that I have used

-e max expected value -m output format (graphical or tabularspreadsheet)-F filter query sequence for low complexity (default TRUE)-U use only upper case regions of query (default FALSE)-G gap opening cost-E gap extension cost-q nucleotide mismatch penalty (BLASTx uses matrices)-r nucleotide match reward-b number of matching sequences to report-g allow gaps (default TRUE)-W word size-z effective database size (removes effect of actual database size)-S query strands to search (default both directions)-l restrict database sequences to given list of lsquogilsquo numbers

BLAST Parameters Exercises1 BLASTn vs BLASTx

Open the file example-sequenceshtml copy the sequence gtblastn-vs-blastxThis is a Xenopus tropicalis cDNA sequence

Go to the NCBI BLAST Home PageNucleotide-nucleotide BLAST (blastn) section Paste your sequence into the box

Run BLASTn against the nr nucleotide database using all default optionsThen hit [format] to wait for the results in a new page(hint if you paste the sequence definition line lsquogtnamersquo into the box as well your results will be labelled accordingly which can be useful)

Now repeat but go to the TRANSLATED BLAST section and BLAST against the nr protein database using BLASTx

How might the different results help us view the presence of this gene in other vertebrates

Results for Exercise 1 BLASTn

BLASTx

BLAST Parameters Exercises2 Low complexity filtering

Open the file example-sequenceshtml copy the sequence gtlow-complexity-filtering-A This sequence contains a long AT tandem repeat

Go to the NCBI BLAST Home PageTRANSLATED BLAST sectionBLASTx Paste your sequence into the box

Carefully UNTICK the ldquoChoose filter [ ] Low complexityrdquo BOX in the second section And then run BLASTx against the nr database

What do you feel about these alignmentsRe-run but leave the low-complexity filter ON this timeDoes this change our view of the protein matches

Now continue with gtlow-complexity-filtering-B and ndashCC is an especially interesting case ndash what can we deduce about the cDNA sequence Annotators beware

Results for Exercise 2A (OFF) BLASTn ndash low complexity filtering OFF

Results for Exercise 2A (ON) BLASTn ndash low complexity filtering ON

Results for Exercise 2B

ON OFF

Results for Exercise 2C

There is a sequence error an extra G at position 117 in the sequence cDNA (117)AGAAAAGAAGAAACATGGCAATGGATCAGAA|||||||||||||||| ||||||||||||||AGAAAAGAAGAAACAT-GCAATGGATCAGAA

Genomic sequence

ON

OFF

BLAST Parameters Exercises3 Limit by Entrez query

Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching sequences in fruit fly enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT

Open the file example-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home Page TRANSLATED BLAST sectionBLASTx and paste the sequence

Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip

BLAST Parameters Exercises4 BLASTn vs tBLASTx and nucleotide mismatch penalties

Open the file example-sequenceshtml

Also open the NCBI BLAST Home PageSPECIAL ndash Align two sequences section

There are several Xenopus tropicalis cyclins in the examples fileCopy the sequence gtcyclin-A1-Xt to the Sequence 1 BLAST windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 BLAST window(i) Run the default comparison should be BLASTn Note the alignmentNow run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these

(ii) Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1 Then try reducing the gap open and gap extension penaltieshellipWhat do we learn from this

(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip

Results for Exercise 4 (i)

BLASTn tBLASTx

Results for Exercise 4 (ii)

Mismatch penalty = -2 (default) Mismatch penalty = -1

BLAST Parameters Exercises5 E-Value maximum for reporting

Open the file example-sequenceshtml

Copy the sequence gtsumo-binding-motif and go to the NCBI BLAST Home PageGo to the PROTEIN BLAST section BLASTp and paste the sequence

Run the search with the default values

Now re-run the search setting the maximum E-value in the box

Expect 100

What difference does this make

BLAST Parameters Exercises6 Word Size

Open the file example-sequenceshtml

Copy the sequence gtmorpholino and go to the NCBI BLAST Home PageGo to the NUCLEOTIDE BLAST section BLASTn and paste the sequence

Check OFF the low complexity filter and then run the search

Now re-run the search setting the following parameters

Low complexity OFFExpect 100Word Size 7Other advanced -q-1 (mismatch penalty -1 instead of default -3)

What difference does this make

END

  • Bioinformatics Workshop 1 Sequences and Similarity Searches
  • The Basic Questions
  • Part 1 Structural Genomics
  • Chromosomes and Genes
  • Gene to Protein
  • Sequence Signals
  • Genomic Signals
  • Derivative Sequences
  • Gene Models
  • Sequences and Genes (Accession Numbers and Names)
  • Gene Symbols Names Etc
  • A Gene-Centric View
  • Sequences and Accession Numbers
  • mRNA Splicing Signals
  • Gene Predictions
  • Supporting Evidence
  • TheoreticalPredicted Sequences
  • Sequences for a model organism
  • So Whatrsquos in the Databases Now
  • Part 2 Comparative Genomics
  • Speciation
  • Residual Similarity
  • Computers Can Detect Homology
  • Orthologs
  • Paralogs
  • lsquoOtherrsquo-logs
  • The Essential Paradigm
  • Function Conserved Longer than Detectable Similarity
  • Redundancy in the Genetic Code
  • Protein Similarity Persists Longer
  • Always Compare Protein Sequences
  • Exercise 1 nucleotide vs amino acid search
  • Answers Exercise 1
  • The Essential Task
  • Functional Orthologs
  • Finding Orthologs
  • Using Synteny is Better
  • Metazome
  • Metazome Exercise
  • Part 3 Finding Sequence Similarities
  • Gaps in Alignments
  • The Downside of Gaps
  • BLAST
  • Flavours of BLAST
  • How does it work
  • BLAST WORDS and INDEXING
  • Analyse the Query Sequence
  • Expand from Word Based Matches
  • BLAST ndashTypical Output
  • When is a match significant
  • E-values
  • E-values From First Principles
  • Calculating an E-value
  • E-values In Practice
  • E-value Exercise
  • E-value Exercise Answer
  • E-values Effect of Database Size
  • Slide 58
  • Why were the values different
  • E-values Effect of Query Length
  • Why not just use identity
  • So whatrsquos the real problem
  • Rules of Thumb
  • Protein BLAST
  • Amino Acid Substitutions
  • Exercises
  • Part 4 Tweaking BLAST
  • Not All Parameters are hellip
  • The Many Parameters of BLAST
  • Slide 70
  • BLAST Parameters Exercises
  • Results for Exercise 1
  • Slide 73
  • Results for Exercise 2A (OFF)
  • Results for Exercise 2A (ON)
  • Results for Exercise 2B
  • Results for Exercise 2C
  • Slide 78
  • Slide 79
  • Results for Exercise 4 (i)
  • Results for Exercise 4 (ii)
  • Slide 82
  • Slide 83
  • END
Page 59: Bioinformatics Workshop 1 Sequences and Similarity Searches

Why were the values differentOur calculated E-value for searching against the RefSeq mRNA database was 50 x 10-28But our actual BLAST search at NCBI gave a value of

20 x 10-26 - about 40x larger - why is this

Gapped alignments

If we were expecting N matches for a query sequence lsquoACGTACGTACGTrsquo imagine what would happen to N if we allowed gaps in our matches

ACGTACGTACGT

This would now give us additional possible alignments that would meet our lsquomatchrsquo criteria

ACGTACGTACGT ACGTACAGTACGT ACGTACCGTACGT etc|||||||||||| |||||| |||||| |||||| ||||||ACGTACGTACGT ACGTAC-GTACGT ACGTAC-GTACGT

We will expect many more matches in a given database if we allow our alignments to have gaps The E-value will be larger

E-values Effect of Query Length

Biologically itrsquos the same match Does it mean we are any less sure that this match didnrsquot occur by chance The E-value is simply dependent on match length

database

BLAST 500 nt sequence against a database

BLASTn Get a full length match with sequence XYZ at an E-value = 50e-160

gtsequence ACTAGTCTAGCTAGACATCGATCGATGATGCTACACAGATAGACGATAGATAGTAAGTCGATCGATCGCGCATCGATCGTCTAGATCGATCGCTCGCTGTGTAGATAGATCGGCGATAGA

database

BLAST half of the same sequence against the same database

BLASTn

gtsequence ACTAGTCTAGCTAGACATCGATCGATGATGCTACACAGATAGACGATAGATAGTAAGTCG

Get a match with sequence XYZ again but at an E-value = 50e-80

Why not just use identityAt some levels this a good question

But consider two very different searches both of which give a 75 identity match

Query1 was 60 nt longCGGAGCTCAGGGCTTAACGACTGATATCTCCGCGCATGTCGAGAAACGATACAGCCAGCG||||||||||| || | || | || || |||| | | | |||||| | ||||||||||CGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGWhich would have an E-value ~ 50 x 10-19

And Query2 only 16 nt longACGTACGTACGTACGT||| || | |||| ||ACGCACCTTCGTAGGTWhich would have an E-value ~ 30

And intuitively we feel we would expect to see that sort of number of matches in the database just by chancehellip

So whatrsquos the real problemBasically you are usually trying to answer the question

Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism

Are there any useful guidelines though at least for biological meaningfulness

Basically you are usually trying to answer the question

Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism

BLAST

The difficulty is because

ORTHOLOGY

BLAST Similarity + Probability

biological knowledge

nature of query sequence

phylogenetic relationship

match length PI size of databasehellip

Rules of ThumbHow good does an E-value have to be before we might even think we have an ortholog

largerworse smallerbetter

E-values 10-5 10-10 10-40 10-100 00

fantasy borderline encouraging

pretty good canrsquot get

better

But note that in some gene families with closely related members you can get an E-value of 00 for several different matches and then identity may be more sensitive Also bear in mind in cases like this that ideas of lsquofunctionalrsquo orthology may break down with more than one locus producing identical proteins which share the same functionhellip

Protein BLASTItrsquos (nearly) always better to make comparisons at the amino acid level between protein sequences than the DNA level because the amino acid sequence is more conserved than the underlying DNA sequenceDoes this cause us to treat expected values any differently

If we follow the argument as before then for an exact match of a 20 amino acid sequence in the RefSeq protein database each additional amino acid will reduce the E-value by 120th (there are 20 different amino acids) And as there are 347895532 letters in that databaseE-value = ~35 x 108 (20 x 20 x 20 hellip20 times) = ~35 x 10-18

But this is what we get if we run the blast at NCBI

Score = 431 bits (100) Expect = 8e-04 Identities = 2020 (100) Positives = 2020 (100) Gaps = 020 (0) Frame = +3

Query 3 SSSSFRAYRAALSEVEPPCI 62 SSSSFRAYRAALSEVEPPCISbjct 972 SSSSFRAYRAALSEVEPPCI 991

Really too big a discrepancy to easily explain with hand wavinghellip

Amino Acid Substitutions

A SC F LWYG I LMVL IMFVM ILVP V ILMW FY

N DHSQ REHKS ANTT SY HFW

H NQYK RQER QK

D NEE DQK

In fact we need to take into account both amino acid substitutability as well as as before allowing gapped alignments On average any residue can be substituted for by about 2 others so each position has about 17th chance of lsquomatchingrsquo rather than 120th

So now we getE-value = ~35 x 108 (7 x 7 x 7 hellip20 times) = ~44 x 10-9which is much closer to the actual BLAST value

These substitutabilities are dealt with by the BLOSUM and PAM matrices

ExercisesGo to the file random-DNA-sequenceshtml select one of the 20 randomly generated nucleotide sequences and do a BLASTx (translated DNA-gtprotein) at NCBI against the nr protein database

Did you find any lsquosignificantrsquo hits

Repeat with a second sequence

What conclusions might you draw from this exercise

Try the same sequence(s) against the nr nucleotide database

Is there any general difference

Part 4 Tweaking BLASTAlthough you normally see BLAST as a web page with boxes to place data in and tick boxes etc it is actually a command line program that can be run just by typing the appropriate command and options eg

promptgt blastall ndashp ltblast typegt ndashi ltinput sequencegt ndashd ltdatabasegt

This is the simplest form where the basic program lsquoblastallrsquo takes a number of different options or parameters indicated by the ndashx and followed by its value -p ltwhich blast flavour to rungt-i ltfile with query sequence ingt-d ltpre-indexed database namegt

Not All Parameters are hellipThere are many other parameters and if not listed explicitly they will use a default value most appropriate to the blast flavour requested Eg for ndashW ltword sizegt blastn uses ndashW 11 where blastx uses ndashW 3

There are also some options that appear on the web pages that are not really parameters but manage the job in a similar way One of the most useful of these is on the NCBI blast pages where you can use Entrez queries or pick from an organism list to modify your search

The Many Parameters of BLASTThere are almost literally hundreds of parameters but most are way too obscure even for die-hard techies like me Very few of them are regularly useful in any but their default value but just occasionally they are very necessary

Here are some of the ones that I have used

-e max expected value -m output format (graphical or tabularspreadsheet)-F filter query sequence for low complexity (default TRUE)-U use only upper case regions of query (default FALSE)-G gap opening cost-E gap extension cost-q nucleotide mismatch penalty (BLASTx uses matrices)-r nucleotide match reward-b number of matching sequences to report-g allow gaps (default TRUE)-W word size-z effective database size (removes effect of actual database size)-S query strands to search (default both directions)-l restrict database sequences to given list of lsquogilsquo numbers

BLAST Parameters Exercises1 BLASTn vs BLASTx

Open the file example-sequenceshtml copy the sequence gtblastn-vs-blastxThis is a Xenopus tropicalis cDNA sequence

Go to the NCBI BLAST Home PageNucleotide-nucleotide BLAST (blastn) section Paste your sequence into the box

Run BLASTn against the nr nucleotide database using all default optionsThen hit [format] to wait for the results in a new page(hint if you paste the sequence definition line lsquogtnamersquo into the box as well your results will be labelled accordingly which can be useful)

Now repeat but go to the TRANSLATED BLAST section and BLAST against the nr protein database using BLASTx

How might the different results help us view the presence of this gene in other vertebrates

Results for Exercise 1 BLASTn

BLASTx

BLAST Parameters Exercises2 Low complexity filtering

Open the file example-sequenceshtml copy the sequence gtlow-complexity-filtering-A This sequence contains a long AT tandem repeat

Go to the NCBI BLAST Home PageTRANSLATED BLAST sectionBLASTx Paste your sequence into the box

Carefully UNTICK the ldquoChoose filter [ ] Low complexityrdquo BOX in the second section And then run BLASTx against the nr database

What do you feel about these alignmentsRe-run but leave the low-complexity filter ON this timeDoes this change our view of the protein matches

Now continue with gtlow-complexity-filtering-B and ndashCC is an especially interesting case ndash what can we deduce about the cDNA sequence Annotators beware

Results for Exercise 2A (OFF) BLASTn ndash low complexity filtering OFF

Results for Exercise 2A (ON) BLASTn ndash low complexity filtering ON

Results for Exercise 2B

ON OFF

Results for Exercise 2C

There is a sequence error an extra G at position 117 in the sequence cDNA (117)AGAAAAGAAGAAACATGGCAATGGATCAGAA|||||||||||||||| ||||||||||||||AGAAAAGAAGAAACAT-GCAATGGATCAGAA

Genomic sequence

ON

OFF

BLAST Parameters Exercises3 Limit by Entrez query

Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching sequences in fruit fly enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT

Open the file example-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home Page TRANSLATED BLAST sectionBLASTx and paste the sequence

Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip

BLAST Parameters Exercises4 BLASTn vs tBLASTx and nucleotide mismatch penalties

Open the file example-sequenceshtml

Also open the NCBI BLAST Home PageSPECIAL ndash Align two sequences section

There are several Xenopus tropicalis cyclins in the examples fileCopy the sequence gtcyclin-A1-Xt to the Sequence 1 BLAST windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 BLAST window(i) Run the default comparison should be BLASTn Note the alignmentNow run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these

(ii) Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1 Then try reducing the gap open and gap extension penaltieshellipWhat do we learn from this

(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip

Results for Exercise 4 (i)

BLASTn tBLASTx

Results for Exercise 4 (ii)

Mismatch penalty = -2 (default) Mismatch penalty = -1

BLAST Parameters Exercises5 E-Value maximum for reporting

Open the file example-sequenceshtml

Copy the sequence gtsumo-binding-motif and go to the NCBI BLAST Home PageGo to the PROTEIN BLAST section BLASTp and paste the sequence

Run the search with the default values

Now re-run the search setting the maximum E-value in the box

Expect 100

What difference does this make

BLAST Parameters Exercises6 Word Size

Open the file example-sequenceshtml

Copy the sequence gtmorpholino and go to the NCBI BLAST Home PageGo to the NUCLEOTIDE BLAST section BLASTn and paste the sequence

Check OFF the low complexity filter and then run the search

Now re-run the search setting the following parameters

Low complexity OFFExpect 100Word Size 7Other advanced -q-1 (mismatch penalty -1 instead of default -3)

What difference does this make

END

  • Bioinformatics Workshop 1 Sequences and Similarity Searches
  • The Basic Questions
  • Part 1 Structural Genomics
  • Chromosomes and Genes
  • Gene to Protein
  • Sequence Signals
  • Genomic Signals
  • Derivative Sequences
  • Gene Models
  • Sequences and Genes (Accession Numbers and Names)
  • Gene Symbols Names Etc
  • A Gene-Centric View
  • Sequences and Accession Numbers
  • mRNA Splicing Signals
  • Gene Predictions
  • Supporting Evidence
  • TheoreticalPredicted Sequences
  • Sequences for a model organism
  • So Whatrsquos in the Databases Now
  • Part 2 Comparative Genomics
  • Speciation
  • Residual Similarity
  • Computers Can Detect Homology
  • Orthologs
  • Paralogs
  • lsquoOtherrsquo-logs
  • The Essential Paradigm
  • Function Conserved Longer than Detectable Similarity
  • Redundancy in the Genetic Code
  • Protein Similarity Persists Longer
  • Always Compare Protein Sequences
  • Exercise 1 nucleotide vs amino acid search
  • Answers Exercise 1
  • The Essential Task
  • Functional Orthologs
  • Finding Orthologs
  • Using Synteny is Better
  • Metazome
  • Metazome Exercise
  • Part 3 Finding Sequence Similarities
  • Gaps in Alignments
  • The Downside of Gaps
  • BLAST
  • Flavours of BLAST
  • How does it work
  • BLAST WORDS and INDEXING
  • Analyse the Query Sequence
  • Expand from Word Based Matches
  • BLAST ndashTypical Output
  • When is a match significant
  • E-values
  • E-values From First Principles
  • Calculating an E-value
  • E-values In Practice
  • E-value Exercise
  • E-value Exercise Answer
  • E-values Effect of Database Size
  • Slide 58
  • Why were the values different
  • E-values Effect of Query Length
  • Why not just use identity
  • So whatrsquos the real problem
  • Rules of Thumb
  • Protein BLAST
  • Amino Acid Substitutions
  • Exercises
  • Part 4 Tweaking BLAST
  • Not All Parameters are hellip
  • The Many Parameters of BLAST
  • Slide 70
  • BLAST Parameters Exercises
  • Results for Exercise 1
  • Slide 73
  • Results for Exercise 2A (OFF)
  • Results for Exercise 2A (ON)
  • Results for Exercise 2B
  • Results for Exercise 2C
  • Slide 78
  • Slide 79
  • Results for Exercise 4 (i)
  • Results for Exercise 4 (ii)
  • Slide 82
  • Slide 83
  • END
Page 60: Bioinformatics Workshop 1 Sequences and Similarity Searches

E-values Effect of Query Length

Biologically itrsquos the same match Does it mean we are any less sure that this match didnrsquot occur by chance The E-value is simply dependent on match length

database

BLAST 500 nt sequence against a database

BLASTn Get a full length match with sequence XYZ at an E-value = 50e-160

gtsequence ACTAGTCTAGCTAGACATCGATCGATGATGCTACACAGATAGACGATAGATAGTAAGTCGATCGATCGCGCATCGATCGTCTAGATCGATCGCTCGCTGTGTAGATAGATCGGCGATAGA

database

BLAST half of the same sequence against the same database

BLASTn

gtsequence ACTAGTCTAGCTAGACATCGATCGATGATGCTACACAGATAGACGATAGATAGTAAGTCG

Get a match with sequence XYZ again but at an E-value = 50e-80

Why not just use identityAt some levels this a good question

But consider two very different searches both of which give a 75 identity match

Query1 was 60 nt longCGGAGCTCAGGGCTTAACGACTGATATCTCCGCGCATGTCGAGAAACGATACAGCCAGCG||||||||||| || | || | || || |||| | | | |||||| | ||||||||||CGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGWhich would have an E-value ~ 50 x 10-19

And Query2 only 16 nt longACGTACGTACGTACGT||| || | |||| ||ACGCACCTTCGTAGGTWhich would have an E-value ~ 30

And intuitively we feel we would expect to see that sort of number of matches in the database just by chancehellip

So whatrsquos the real problemBasically you are usually trying to answer the question

Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism

Are there any useful guidelines though at least for biological meaningfulness

Basically you are usually trying to answer the question

Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism

BLAST

The difficulty is because

ORTHOLOGY

BLAST Similarity + Probability

biological knowledge

nature of query sequence

phylogenetic relationship

match length PI size of databasehellip

Rules of ThumbHow good does an E-value have to be before we might even think we have an ortholog

largerworse smallerbetter

E-values 10-5 10-10 10-40 10-100 00

fantasy borderline encouraging

pretty good canrsquot get

better

But note that in some gene families with closely related members you can get an E-value of 00 for several different matches and then identity may be more sensitive Also bear in mind in cases like this that ideas of lsquofunctionalrsquo orthology may break down with more than one locus producing identical proteins which share the same functionhellip

Protein BLASTItrsquos (nearly) always better to make comparisons at the amino acid level between protein sequences than the DNA level because the amino acid sequence is more conserved than the underlying DNA sequenceDoes this cause us to treat expected values any differently

If we follow the argument as before then for an exact match of a 20 amino acid sequence in the RefSeq protein database each additional amino acid will reduce the E-value by 120th (there are 20 different amino acids) And as there are 347895532 letters in that databaseE-value = ~35 x 108 (20 x 20 x 20 hellip20 times) = ~35 x 10-18

But this is what we get if we run the blast at NCBI

Score = 431 bits (100) Expect = 8e-04 Identities = 2020 (100) Positives = 2020 (100) Gaps = 020 (0) Frame = +3

Query 3 SSSSFRAYRAALSEVEPPCI 62 SSSSFRAYRAALSEVEPPCISbjct 972 SSSSFRAYRAALSEVEPPCI 991

Really too big a discrepancy to easily explain with hand wavinghellip

Amino Acid Substitutions

A SC F LWYG I LMVL IMFVM ILVP V ILMW FY

N DHSQ REHKS ANTT SY HFW

H NQYK RQER QK

D NEE DQK

In fact we need to take into account both amino acid substitutability as well as as before allowing gapped alignments On average any residue can be substituted for by about 2 others so each position has about 17th chance of lsquomatchingrsquo rather than 120th

So now we getE-value = ~35 x 108 (7 x 7 x 7 hellip20 times) = ~44 x 10-9which is much closer to the actual BLAST value

These substitutabilities are dealt with by the BLOSUM and PAM matrices

ExercisesGo to the file random-DNA-sequenceshtml select one of the 20 randomly generated nucleotide sequences and do a BLASTx (translated DNA-gtprotein) at NCBI against the nr protein database

Did you find any lsquosignificantrsquo hits

Repeat with a second sequence

What conclusions might you draw from this exercise

Try the same sequence(s) against the nr nucleotide database

Is there any general difference

Part 4 Tweaking BLASTAlthough you normally see BLAST as a web page with boxes to place data in and tick boxes etc it is actually a command line program that can be run just by typing the appropriate command and options eg

promptgt blastall ndashp ltblast typegt ndashi ltinput sequencegt ndashd ltdatabasegt

This is the simplest form where the basic program lsquoblastallrsquo takes a number of different options or parameters indicated by the ndashx and followed by its value -p ltwhich blast flavour to rungt-i ltfile with query sequence ingt-d ltpre-indexed database namegt

Not All Parameters are hellipThere are many other parameters and if not listed explicitly they will use a default value most appropriate to the blast flavour requested Eg for ndashW ltword sizegt blastn uses ndashW 11 where blastx uses ndashW 3

There are also some options that appear on the web pages that are not really parameters but manage the job in a similar way One of the most useful of these is on the NCBI blast pages where you can use Entrez queries or pick from an organism list to modify your search

The Many Parameters of BLASTThere are almost literally hundreds of parameters but most are way too obscure even for die-hard techies like me Very few of them are regularly useful in any but their default value but just occasionally they are very necessary

Here are some of the ones that I have used

-e max expected value -m output format (graphical or tabularspreadsheet)-F filter query sequence for low complexity (default TRUE)-U use only upper case regions of query (default FALSE)-G gap opening cost-E gap extension cost-q nucleotide mismatch penalty (BLASTx uses matrices)-r nucleotide match reward-b number of matching sequences to report-g allow gaps (default TRUE)-W word size-z effective database size (removes effect of actual database size)-S query strands to search (default both directions)-l restrict database sequences to given list of lsquogilsquo numbers

BLAST Parameters Exercises1 BLASTn vs BLASTx

Open the file example-sequenceshtml copy the sequence gtblastn-vs-blastxThis is a Xenopus tropicalis cDNA sequence

Go to the NCBI BLAST Home PageNucleotide-nucleotide BLAST (blastn) section Paste your sequence into the box

Run BLASTn against the nr nucleotide database using all default optionsThen hit [format] to wait for the results in a new page(hint if you paste the sequence definition line lsquogtnamersquo into the box as well your results will be labelled accordingly which can be useful)

Now repeat but go to the TRANSLATED BLAST section and BLAST against the nr protein database using BLASTx

How might the different results help us view the presence of this gene in other vertebrates

Results for Exercise 1 BLASTn

BLASTx

BLAST Parameters Exercises2 Low complexity filtering

Open the file example-sequenceshtml copy the sequence gtlow-complexity-filtering-A This sequence contains a long AT tandem repeat

Go to the NCBI BLAST Home PageTRANSLATED BLAST sectionBLASTx Paste your sequence into the box

Carefully UNTICK the ldquoChoose filter [ ] Low complexityrdquo BOX in the second section And then run BLASTx against the nr database

What do you feel about these alignmentsRe-run but leave the low-complexity filter ON this timeDoes this change our view of the protein matches

Now continue with gtlow-complexity-filtering-B and ndashCC is an especially interesting case ndash what can we deduce about the cDNA sequence Annotators beware

Results for Exercise 2A (OFF) BLASTn ndash low complexity filtering OFF

Results for Exercise 2A (ON) BLASTn ndash low complexity filtering ON

Results for Exercise 2B

ON OFF

Results for Exercise 2C

There is a sequence error an extra G at position 117 in the sequence cDNA (117)AGAAAAGAAGAAACATGGCAATGGATCAGAA|||||||||||||||| ||||||||||||||AGAAAAGAAGAAACAT-GCAATGGATCAGAA

Genomic sequence

ON

OFF

BLAST Parameters Exercises3 Limit by Entrez query

Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching sequences in fruit fly enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT

Open the file example-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home Page TRANSLATED BLAST sectionBLASTx and paste the sequence

Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip

BLAST Parameters Exercises4 BLASTn vs tBLASTx and nucleotide mismatch penalties

Open the file example-sequenceshtml

Also open the NCBI BLAST Home PageSPECIAL ndash Align two sequences section

There are several Xenopus tropicalis cyclins in the examples fileCopy the sequence gtcyclin-A1-Xt to the Sequence 1 BLAST windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 BLAST window(i) Run the default comparison should be BLASTn Note the alignmentNow run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these

(ii) Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1 Then try reducing the gap open and gap extension penaltieshellipWhat do we learn from this

(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip

Results for Exercise 4 (i)

BLASTn tBLASTx

Results for Exercise 4 (ii)

Mismatch penalty = -2 (default) Mismatch penalty = -1

BLAST Parameters Exercises5 E-Value maximum for reporting

Open the file example-sequenceshtml

Copy the sequence gtsumo-binding-motif and go to the NCBI BLAST Home PageGo to the PROTEIN BLAST section BLASTp and paste the sequence

Run the search with the default values

Now re-run the search setting the maximum E-value in the box

Expect 100

What difference does this make

BLAST Parameters Exercises6 Word Size

Open the file example-sequenceshtml

Copy the sequence gtmorpholino and go to the NCBI BLAST Home PageGo to the NUCLEOTIDE BLAST section BLASTn and paste the sequence

Check OFF the low complexity filter and then run the search

Now re-run the search setting the following parameters

Low complexity OFFExpect 100Word Size 7Other advanced -q-1 (mismatch penalty -1 instead of default -3)

What difference does this make

END

  • Bioinformatics Workshop 1 Sequences and Similarity Searches
  • The Basic Questions
  • Part 1 Structural Genomics
  • Chromosomes and Genes
  • Gene to Protein
  • Sequence Signals
  • Genomic Signals
  • Derivative Sequences
  • Gene Models
  • Sequences and Genes (Accession Numbers and Names)
  • Gene Symbols Names Etc
  • A Gene-Centric View
  • Sequences and Accession Numbers
  • mRNA Splicing Signals
  • Gene Predictions
  • Supporting Evidence
  • TheoreticalPredicted Sequences
  • Sequences for a model organism
  • So Whatrsquos in the Databases Now
  • Part 2 Comparative Genomics
  • Speciation
  • Residual Similarity
  • Computers Can Detect Homology
  • Orthologs
  • Paralogs
  • lsquoOtherrsquo-logs
  • The Essential Paradigm
  • Function Conserved Longer than Detectable Similarity
  • Redundancy in the Genetic Code
  • Protein Similarity Persists Longer
  • Always Compare Protein Sequences
  • Exercise 1 nucleotide vs amino acid search
  • Answers Exercise 1
  • The Essential Task
  • Functional Orthologs
  • Finding Orthologs
  • Using Synteny is Better
  • Metazome
  • Metazome Exercise
  • Part 3 Finding Sequence Similarities
  • Gaps in Alignments
  • The Downside of Gaps
  • BLAST
  • Flavours of BLAST
  • How does it work
  • BLAST WORDS and INDEXING
  • Analyse the Query Sequence
  • Expand from Word Based Matches
  • BLAST ndashTypical Output
  • When is a match significant
  • E-values
  • E-values From First Principles
  • Calculating an E-value
  • E-values In Practice
  • E-value Exercise
  • E-value Exercise Answer
  • E-values Effect of Database Size
  • Slide 58
  • Why were the values different
  • E-values Effect of Query Length
  • Why not just use identity
  • So whatrsquos the real problem
  • Rules of Thumb
  • Protein BLAST
  • Amino Acid Substitutions
  • Exercises
  • Part 4 Tweaking BLAST
  • Not All Parameters are hellip
  • The Many Parameters of BLAST
  • Slide 70
  • BLAST Parameters Exercises
  • Results for Exercise 1
  • Slide 73
  • Results for Exercise 2A (OFF)
  • Results for Exercise 2A (ON)
  • Results for Exercise 2B
  • Results for Exercise 2C
  • Slide 78
  • Slide 79
  • Results for Exercise 4 (i)
  • Results for Exercise 4 (ii)
  • Slide 82
  • Slide 83
  • END
Page 61: Bioinformatics Workshop 1 Sequences and Similarity Searches

Why not just use identityAt some levels this a good question

But consider two very different searches both of which give a 75 identity match

Query1 was 60 nt longCGGAGCTCAGGGCTTAACGACTGATATCTCCGCGCATGTCGAGAAACGATACAGCCAGCG||||||||||| || | || | || || |||| | | | |||||| | ||||||||||CGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGWhich would have an E-value ~ 50 x 10-19

And Query2 only 16 nt longACGTACGTACGTACGT||| || | |||| ||ACGCACCTTCGTAGGTWhich would have an E-value ~ 30

And intuitively we feel we would expect to see that sort of number of matches in the database just by chancehellip

So whatrsquos the real problemBasically you are usually trying to answer the question

Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism

Are there any useful guidelines though at least for biological meaningfulness

Basically you are usually trying to answer the question

Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism

BLAST

The difficulty is because

ORTHOLOGY

BLAST Similarity + Probability

biological knowledge

nature of query sequence

phylogenetic relationship

match length PI size of databasehellip

Rules of ThumbHow good does an E-value have to be before we might even think we have an ortholog

largerworse smallerbetter

E-values 10-5 10-10 10-40 10-100 00

fantasy borderline encouraging

pretty good canrsquot get

better

But note that in some gene families with closely related members you can get an E-value of 00 for several different matches and then identity may be more sensitive Also bear in mind in cases like this that ideas of lsquofunctionalrsquo orthology may break down with more than one locus producing identical proteins which share the same functionhellip

Protein BLASTItrsquos (nearly) always better to make comparisons at the amino acid level between protein sequences than the DNA level because the amino acid sequence is more conserved than the underlying DNA sequenceDoes this cause us to treat expected values any differently

If we follow the argument as before then for an exact match of a 20 amino acid sequence in the RefSeq protein database each additional amino acid will reduce the E-value by 120th (there are 20 different amino acids) And as there are 347895532 letters in that databaseE-value = ~35 x 108 (20 x 20 x 20 hellip20 times) = ~35 x 10-18

But this is what we get if we run the blast at NCBI

Score = 431 bits (100) Expect = 8e-04 Identities = 2020 (100) Positives = 2020 (100) Gaps = 020 (0) Frame = +3

Query 3 SSSSFRAYRAALSEVEPPCI 62 SSSSFRAYRAALSEVEPPCISbjct 972 SSSSFRAYRAALSEVEPPCI 991

Really too big a discrepancy to easily explain with hand wavinghellip

Amino Acid Substitutions

A SC F LWYG I LMVL IMFVM ILVP V ILMW FY

N DHSQ REHKS ANTT SY HFW

H NQYK RQER QK

D NEE DQK

In fact we need to take into account both amino acid substitutability as well as as before allowing gapped alignments On average any residue can be substituted for by about 2 others so each position has about 17th chance of lsquomatchingrsquo rather than 120th

So now we getE-value = ~35 x 108 (7 x 7 x 7 hellip20 times) = ~44 x 10-9which is much closer to the actual BLAST value

These substitutabilities are dealt with by the BLOSUM and PAM matrices

ExercisesGo to the file random-DNA-sequenceshtml select one of the 20 randomly generated nucleotide sequences and do a BLASTx (translated DNA-gtprotein) at NCBI against the nr protein database

Did you find any lsquosignificantrsquo hits

Repeat with a second sequence

What conclusions might you draw from this exercise

Try the same sequence(s) against the nr nucleotide database

Is there any general difference

Part 4 Tweaking BLASTAlthough you normally see BLAST as a web page with boxes to place data in and tick boxes etc it is actually a command line program that can be run just by typing the appropriate command and options eg

promptgt blastall ndashp ltblast typegt ndashi ltinput sequencegt ndashd ltdatabasegt

This is the simplest form where the basic program lsquoblastallrsquo takes a number of different options or parameters indicated by the ndashx and followed by its value -p ltwhich blast flavour to rungt-i ltfile with query sequence ingt-d ltpre-indexed database namegt

Not All Parameters are hellipThere are many other parameters and if not listed explicitly they will use a default value most appropriate to the blast flavour requested Eg for ndashW ltword sizegt blastn uses ndashW 11 where blastx uses ndashW 3

There are also some options that appear on the web pages that are not really parameters but manage the job in a similar way One of the most useful of these is on the NCBI blast pages where you can use Entrez queries or pick from an organism list to modify your search

The Many Parameters of BLASTThere are almost literally hundreds of parameters but most are way too obscure even for die-hard techies like me Very few of them are regularly useful in any but their default value but just occasionally they are very necessary

Here are some of the ones that I have used

-e max expected value -m output format (graphical or tabularspreadsheet)-F filter query sequence for low complexity (default TRUE)-U use only upper case regions of query (default FALSE)-G gap opening cost-E gap extension cost-q nucleotide mismatch penalty (BLASTx uses matrices)-r nucleotide match reward-b number of matching sequences to report-g allow gaps (default TRUE)-W word size-z effective database size (removes effect of actual database size)-S query strands to search (default both directions)-l restrict database sequences to given list of lsquogilsquo numbers

BLAST Parameters Exercises1 BLASTn vs BLASTx

Open the file example-sequenceshtml copy the sequence gtblastn-vs-blastxThis is a Xenopus tropicalis cDNA sequence

Go to the NCBI BLAST Home PageNucleotide-nucleotide BLAST (blastn) section Paste your sequence into the box

Run BLASTn against the nr nucleotide database using all default optionsThen hit [format] to wait for the results in a new page(hint if you paste the sequence definition line lsquogtnamersquo into the box as well your results will be labelled accordingly which can be useful)

Now repeat but go to the TRANSLATED BLAST section and BLAST against the nr protein database using BLASTx

How might the different results help us view the presence of this gene in other vertebrates

Results for Exercise 1 BLASTn

BLASTx

BLAST Parameters Exercises2 Low complexity filtering

Open the file example-sequenceshtml copy the sequence gtlow-complexity-filtering-A This sequence contains a long AT tandem repeat

Go to the NCBI BLAST Home PageTRANSLATED BLAST sectionBLASTx Paste your sequence into the box

Carefully UNTICK the ldquoChoose filter [ ] Low complexityrdquo BOX in the second section And then run BLASTx against the nr database

What do you feel about these alignmentsRe-run but leave the low-complexity filter ON this timeDoes this change our view of the protein matches

Now continue with gtlow-complexity-filtering-B and ndashCC is an especially interesting case ndash what can we deduce about the cDNA sequence Annotators beware

Results for Exercise 2A (OFF) BLASTn ndash low complexity filtering OFF

Results for Exercise 2A (ON) BLASTn ndash low complexity filtering ON

Results for Exercise 2B

ON OFF

Results for Exercise 2C

There is a sequence error an extra G at position 117 in the sequence cDNA (117)AGAAAAGAAGAAACATGGCAATGGATCAGAA|||||||||||||||| ||||||||||||||AGAAAAGAAGAAACAT-GCAATGGATCAGAA

Genomic sequence

ON

OFF

BLAST Parameters Exercises3 Limit by Entrez query

Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching sequences in fruit fly enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT

Open the file example-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home Page TRANSLATED BLAST sectionBLASTx and paste the sequence

Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip

BLAST Parameters Exercises4 BLASTn vs tBLASTx and nucleotide mismatch penalties

Open the file example-sequenceshtml

Also open the NCBI BLAST Home PageSPECIAL ndash Align two sequences section

There are several Xenopus tropicalis cyclins in the examples fileCopy the sequence gtcyclin-A1-Xt to the Sequence 1 BLAST windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 BLAST window(i) Run the default comparison should be BLASTn Note the alignmentNow run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these

(ii) Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1 Then try reducing the gap open and gap extension penaltieshellipWhat do we learn from this

(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip

Results for Exercise 4 (i)

BLASTn tBLASTx

Results for Exercise 4 (ii)

Mismatch penalty = -2 (default) Mismatch penalty = -1

BLAST Parameters Exercises5 E-Value maximum for reporting

Open the file example-sequenceshtml

Copy the sequence gtsumo-binding-motif and go to the NCBI BLAST Home PageGo to the PROTEIN BLAST section BLASTp and paste the sequence

Run the search with the default values

Now re-run the search setting the maximum E-value in the box

Expect 100

What difference does this make

BLAST Parameters Exercises6 Word Size

Open the file example-sequenceshtml

Copy the sequence gtmorpholino and go to the NCBI BLAST Home PageGo to the NUCLEOTIDE BLAST section BLASTn and paste the sequence

Check OFF the low complexity filter and then run the search

Now re-run the search setting the following parameters

Low complexity OFFExpect 100Word Size 7Other advanced -q-1 (mismatch penalty -1 instead of default -3)

What difference does this make

END

  • Bioinformatics Workshop 1 Sequences and Similarity Searches
  • The Basic Questions
  • Part 1 Structural Genomics
  • Chromosomes and Genes
  • Gene to Protein
  • Sequence Signals
  • Genomic Signals
  • Derivative Sequences
  • Gene Models
  • Sequences and Genes (Accession Numbers and Names)
  • Gene Symbols Names Etc
  • A Gene-Centric View
  • Sequences and Accession Numbers
  • mRNA Splicing Signals
  • Gene Predictions
  • Supporting Evidence
  • TheoreticalPredicted Sequences
  • Sequences for a model organism
  • So Whatrsquos in the Databases Now
  • Part 2 Comparative Genomics
  • Speciation
  • Residual Similarity
  • Computers Can Detect Homology
  • Orthologs
  • Paralogs
  • lsquoOtherrsquo-logs
  • The Essential Paradigm
  • Function Conserved Longer than Detectable Similarity
  • Redundancy in the Genetic Code
  • Protein Similarity Persists Longer
  • Always Compare Protein Sequences
  • Exercise 1 nucleotide vs amino acid search
  • Answers Exercise 1
  • The Essential Task
  • Functional Orthologs
  • Finding Orthologs
  • Using Synteny is Better
  • Metazome
  • Metazome Exercise
  • Part 3 Finding Sequence Similarities
  • Gaps in Alignments
  • The Downside of Gaps
  • BLAST
  • Flavours of BLAST
  • How does it work
  • BLAST WORDS and INDEXING
  • Analyse the Query Sequence
  • Expand from Word Based Matches
  • BLAST ndashTypical Output
  • When is a match significant
  • E-values
  • E-values From First Principles
  • Calculating an E-value
  • E-values In Practice
  • E-value Exercise
  • E-value Exercise Answer
  • E-values Effect of Database Size
  • Slide 58
  • Why were the values different
  • E-values Effect of Query Length
  • Why not just use identity
  • So whatrsquos the real problem
  • Rules of Thumb
  • Protein BLAST
  • Amino Acid Substitutions
  • Exercises
  • Part 4 Tweaking BLAST
  • Not All Parameters are hellip
  • The Many Parameters of BLAST
  • Slide 70
  • BLAST Parameters Exercises
  • Results for Exercise 1
  • Slide 73
  • Results for Exercise 2A (OFF)
  • Results for Exercise 2A (ON)
  • Results for Exercise 2B
  • Results for Exercise 2C
  • Slide 78
  • Slide 79
  • Results for Exercise 4 (i)
  • Results for Exercise 4 (ii)
  • Slide 82
  • Slide 83
  • END
Page 62: Bioinformatics Workshop 1 Sequences and Similarity Searches

So whatrsquos the real problemBasically you are usually trying to answer the question

Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism

Are there any useful guidelines though at least for biological meaningfulness

Basically you are usually trying to answer the question

Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism

BLAST

The difficulty is because

ORTHOLOGY

BLAST Similarity + Probability

biological knowledge

nature of query sequence

phylogenetic relationship

match length PI size of databasehellip

Rules of ThumbHow good does an E-value have to be before we might even think we have an ortholog

largerworse smallerbetter

E-values 10-5 10-10 10-40 10-100 00

fantasy borderline encouraging

pretty good canrsquot get

better

But note that in some gene families with closely related members you can get an E-value of 00 for several different matches and then identity may be more sensitive Also bear in mind in cases like this that ideas of lsquofunctionalrsquo orthology may break down with more than one locus producing identical proteins which share the same functionhellip

Protein BLASTItrsquos (nearly) always better to make comparisons at the amino acid level between protein sequences than the DNA level because the amino acid sequence is more conserved than the underlying DNA sequenceDoes this cause us to treat expected values any differently

If we follow the argument as before then for an exact match of a 20 amino acid sequence in the RefSeq protein database each additional amino acid will reduce the E-value by 120th (there are 20 different amino acids) And as there are 347895532 letters in that databaseE-value = ~35 x 108 (20 x 20 x 20 hellip20 times) = ~35 x 10-18

But this is what we get if we run the blast at NCBI

Score = 431 bits (100) Expect = 8e-04 Identities = 2020 (100) Positives = 2020 (100) Gaps = 020 (0) Frame = +3

Query 3 SSSSFRAYRAALSEVEPPCI 62 SSSSFRAYRAALSEVEPPCISbjct 972 SSSSFRAYRAALSEVEPPCI 991

Really too big a discrepancy to easily explain with hand wavinghellip

Amino Acid Substitutions

A SC F LWYG I LMVL IMFVM ILVP V ILMW FY

N DHSQ REHKS ANTT SY HFW

H NQYK RQER QK

D NEE DQK

In fact we need to take into account both amino acid substitutability as well as as before allowing gapped alignments On average any residue can be substituted for by about 2 others so each position has about 17th chance of lsquomatchingrsquo rather than 120th

So now we getE-value = ~35 x 108 (7 x 7 x 7 hellip20 times) = ~44 x 10-9which is much closer to the actual BLAST value

These substitutabilities are dealt with by the BLOSUM and PAM matrices

ExercisesGo to the file random-DNA-sequenceshtml select one of the 20 randomly generated nucleotide sequences and do a BLASTx (translated DNA-gtprotein) at NCBI against the nr protein database

Did you find any lsquosignificantrsquo hits

Repeat with a second sequence

What conclusions might you draw from this exercise

Try the same sequence(s) against the nr nucleotide database

Is there any general difference

Part 4 Tweaking BLASTAlthough you normally see BLAST as a web page with boxes to place data in and tick boxes etc it is actually a command line program that can be run just by typing the appropriate command and options eg

promptgt blastall ndashp ltblast typegt ndashi ltinput sequencegt ndashd ltdatabasegt

This is the simplest form where the basic program lsquoblastallrsquo takes a number of different options or parameters indicated by the ndashx and followed by its value -p ltwhich blast flavour to rungt-i ltfile with query sequence ingt-d ltpre-indexed database namegt

Not All Parameters are hellipThere are many other parameters and if not listed explicitly they will use a default value most appropriate to the blast flavour requested Eg for ndashW ltword sizegt blastn uses ndashW 11 where blastx uses ndashW 3

There are also some options that appear on the web pages that are not really parameters but manage the job in a similar way One of the most useful of these is on the NCBI blast pages where you can use Entrez queries or pick from an organism list to modify your search

The Many Parameters of BLASTThere are almost literally hundreds of parameters but most are way too obscure even for die-hard techies like me Very few of them are regularly useful in any but their default value but just occasionally they are very necessary

Here are some of the ones that I have used

-e max expected value -m output format (graphical or tabularspreadsheet)-F filter query sequence for low complexity (default TRUE)-U use only upper case regions of query (default FALSE)-G gap opening cost-E gap extension cost-q nucleotide mismatch penalty (BLASTx uses matrices)-r nucleotide match reward-b number of matching sequences to report-g allow gaps (default TRUE)-W word size-z effective database size (removes effect of actual database size)-S query strands to search (default both directions)-l restrict database sequences to given list of lsquogilsquo numbers

BLAST Parameters Exercises1 BLASTn vs BLASTx

Open the file example-sequenceshtml copy the sequence gtblastn-vs-blastxThis is a Xenopus tropicalis cDNA sequence

Go to the NCBI BLAST Home PageNucleotide-nucleotide BLAST (blastn) section Paste your sequence into the box

Run BLASTn against the nr nucleotide database using all default optionsThen hit [format] to wait for the results in a new page(hint if you paste the sequence definition line lsquogtnamersquo into the box as well your results will be labelled accordingly which can be useful)

Now repeat but go to the TRANSLATED BLAST section and BLAST against the nr protein database using BLASTx

How might the different results help us view the presence of this gene in other vertebrates

Results for Exercise 1 BLASTn

BLASTx

BLAST Parameters Exercises2 Low complexity filtering

Open the file example-sequenceshtml copy the sequence gtlow-complexity-filtering-A This sequence contains a long AT tandem repeat

Go to the NCBI BLAST Home PageTRANSLATED BLAST sectionBLASTx Paste your sequence into the box

Carefully UNTICK the ldquoChoose filter [ ] Low complexityrdquo BOX in the second section And then run BLASTx against the nr database

What do you feel about these alignmentsRe-run but leave the low-complexity filter ON this timeDoes this change our view of the protein matches

Now continue with gtlow-complexity-filtering-B and ndashCC is an especially interesting case ndash what can we deduce about the cDNA sequence Annotators beware

Results for Exercise 2A (OFF) BLASTn ndash low complexity filtering OFF

Results for Exercise 2A (ON) BLASTn ndash low complexity filtering ON

Results for Exercise 2B

ON OFF

Results for Exercise 2C

There is a sequence error an extra G at position 117 in the sequence cDNA (117)AGAAAAGAAGAAACATGGCAATGGATCAGAA|||||||||||||||| ||||||||||||||AGAAAAGAAGAAACAT-GCAATGGATCAGAA

Genomic sequence

ON

OFF

BLAST Parameters Exercises3 Limit by Entrez query

Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching sequences in fruit fly enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT

Open the file example-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home Page TRANSLATED BLAST sectionBLASTx and paste the sequence

Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip

BLAST Parameters Exercises4 BLASTn vs tBLASTx and nucleotide mismatch penalties

Open the file example-sequenceshtml

Also open the NCBI BLAST Home PageSPECIAL ndash Align two sequences section

There are several Xenopus tropicalis cyclins in the examples fileCopy the sequence gtcyclin-A1-Xt to the Sequence 1 BLAST windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 BLAST window(i) Run the default comparison should be BLASTn Note the alignmentNow run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these

(ii) Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1 Then try reducing the gap open and gap extension penaltieshellipWhat do we learn from this

(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip

Results for Exercise 4 (i)

BLASTn tBLASTx

Results for Exercise 4 (ii)

Mismatch penalty = -2 (default) Mismatch penalty = -1

BLAST Parameters Exercises5 E-Value maximum for reporting

Open the file example-sequenceshtml

Copy the sequence gtsumo-binding-motif and go to the NCBI BLAST Home PageGo to the PROTEIN BLAST section BLASTp and paste the sequence

Run the search with the default values

Now re-run the search setting the maximum E-value in the box

Expect 100

What difference does this make

BLAST Parameters Exercises6 Word Size

Open the file example-sequenceshtml

Copy the sequence gtmorpholino and go to the NCBI BLAST Home PageGo to the NUCLEOTIDE BLAST section BLASTn and paste the sequence

Check OFF the low complexity filter and then run the search

Now re-run the search setting the following parameters

Low complexity OFFExpect 100Word Size 7Other advanced -q-1 (mismatch penalty -1 instead of default -3)

What difference does this make

END

  • Bioinformatics Workshop 1 Sequences and Similarity Searches
  • The Basic Questions
  • Part 1 Structural Genomics
  • Chromosomes and Genes
  • Gene to Protein
  • Sequence Signals
  • Genomic Signals
  • Derivative Sequences
  • Gene Models
  • Sequences and Genes (Accession Numbers and Names)
  • Gene Symbols Names Etc
  • A Gene-Centric View
  • Sequences and Accession Numbers
  • mRNA Splicing Signals
  • Gene Predictions
  • Supporting Evidence
  • TheoreticalPredicted Sequences
  • Sequences for a model organism
  • So Whatrsquos in the Databases Now
  • Part 2 Comparative Genomics
  • Speciation
  • Residual Similarity
  • Computers Can Detect Homology
  • Orthologs
  • Paralogs
  • lsquoOtherrsquo-logs
  • The Essential Paradigm
  • Function Conserved Longer than Detectable Similarity
  • Redundancy in the Genetic Code
  • Protein Similarity Persists Longer
  • Always Compare Protein Sequences
  • Exercise 1 nucleotide vs amino acid search
  • Answers Exercise 1
  • The Essential Task
  • Functional Orthologs
  • Finding Orthologs
  • Using Synteny is Better
  • Metazome
  • Metazome Exercise
  • Part 3 Finding Sequence Similarities
  • Gaps in Alignments
  • The Downside of Gaps
  • BLAST
  • Flavours of BLAST
  • How does it work
  • BLAST WORDS and INDEXING
  • Analyse the Query Sequence
  • Expand from Word Based Matches
  • BLAST ndashTypical Output
  • When is a match significant
  • E-values
  • E-values From First Principles
  • Calculating an E-value
  • E-values In Practice
  • E-value Exercise
  • E-value Exercise Answer
  • E-values Effect of Database Size
  • Slide 58
  • Why were the values different
  • E-values Effect of Query Length
  • Why not just use identity
  • So whatrsquos the real problem
  • Rules of Thumb
  • Protein BLAST
  • Amino Acid Substitutions
  • Exercises
  • Part 4 Tweaking BLAST
  • Not All Parameters are hellip
  • The Many Parameters of BLAST
  • Slide 70
  • BLAST Parameters Exercises
  • Results for Exercise 1
  • Slide 73
  • Results for Exercise 2A (OFF)
  • Results for Exercise 2A (ON)
  • Results for Exercise 2B
  • Results for Exercise 2C
  • Slide 78
  • Slide 79
  • Results for Exercise 4 (i)
  • Results for Exercise 4 (ii)
  • Slide 82
  • Slide 83
  • END
Page 63: Bioinformatics Workshop 1 Sequences and Similarity Searches

Rules of ThumbHow good does an E-value have to be before we might even think we have an ortholog

largerworse smallerbetter

E-values 10-5 10-10 10-40 10-100 00

fantasy borderline encouraging

pretty good canrsquot get

better

But note that in some gene families with closely related members you can get an E-value of 00 for several different matches and then identity may be more sensitive Also bear in mind in cases like this that ideas of lsquofunctionalrsquo orthology may break down with more than one locus producing identical proteins which share the same functionhellip

Protein BLASTItrsquos (nearly) always better to make comparisons at the amino acid level between protein sequences than the DNA level because the amino acid sequence is more conserved than the underlying DNA sequenceDoes this cause us to treat expected values any differently

If we follow the argument as before then for an exact match of a 20 amino acid sequence in the RefSeq protein database each additional amino acid will reduce the E-value by 120th (there are 20 different amino acids) And as there are 347895532 letters in that databaseE-value = ~35 x 108 (20 x 20 x 20 hellip20 times) = ~35 x 10-18

But this is what we get if we run the blast at NCBI

Score = 431 bits (100) Expect = 8e-04 Identities = 2020 (100) Positives = 2020 (100) Gaps = 020 (0) Frame = +3

Query 3 SSSSFRAYRAALSEVEPPCI 62 SSSSFRAYRAALSEVEPPCISbjct 972 SSSSFRAYRAALSEVEPPCI 991

Really too big a discrepancy to easily explain with hand wavinghellip

Amino Acid Substitutions

A SC F LWYG I LMVL IMFVM ILVP V ILMW FY

N DHSQ REHKS ANTT SY HFW

H NQYK RQER QK

D NEE DQK

In fact we need to take into account both amino acid substitutability as well as as before allowing gapped alignments On average any residue can be substituted for by about 2 others so each position has about 17th chance of lsquomatchingrsquo rather than 120th

So now we getE-value = ~35 x 108 (7 x 7 x 7 hellip20 times) = ~44 x 10-9which is much closer to the actual BLAST value

These substitutabilities are dealt with by the BLOSUM and PAM matrices

ExercisesGo to the file random-DNA-sequenceshtml select one of the 20 randomly generated nucleotide sequences and do a BLASTx (translated DNA-gtprotein) at NCBI against the nr protein database

Did you find any lsquosignificantrsquo hits

Repeat with a second sequence

What conclusions might you draw from this exercise

Try the same sequence(s) against the nr nucleotide database

Is there any general difference

Part 4 Tweaking BLASTAlthough you normally see BLAST as a web page with boxes to place data in and tick boxes etc it is actually a command line program that can be run just by typing the appropriate command and options eg

promptgt blastall ndashp ltblast typegt ndashi ltinput sequencegt ndashd ltdatabasegt

This is the simplest form where the basic program lsquoblastallrsquo takes a number of different options or parameters indicated by the ndashx and followed by its value -p ltwhich blast flavour to rungt-i ltfile with query sequence ingt-d ltpre-indexed database namegt

Not All Parameters are hellipThere are many other parameters and if not listed explicitly they will use a default value most appropriate to the blast flavour requested Eg for ndashW ltword sizegt blastn uses ndashW 11 where blastx uses ndashW 3

There are also some options that appear on the web pages that are not really parameters but manage the job in a similar way One of the most useful of these is on the NCBI blast pages where you can use Entrez queries or pick from an organism list to modify your search

The Many Parameters of BLASTThere are almost literally hundreds of parameters but most are way too obscure even for die-hard techies like me Very few of them are regularly useful in any but their default value but just occasionally they are very necessary

Here are some of the ones that I have used

-e max expected value -m output format (graphical or tabularspreadsheet)-F filter query sequence for low complexity (default TRUE)-U use only upper case regions of query (default FALSE)-G gap opening cost-E gap extension cost-q nucleotide mismatch penalty (BLASTx uses matrices)-r nucleotide match reward-b number of matching sequences to report-g allow gaps (default TRUE)-W word size-z effective database size (removes effect of actual database size)-S query strands to search (default both directions)-l restrict database sequences to given list of lsquogilsquo numbers

BLAST Parameters Exercises1 BLASTn vs BLASTx

Open the file example-sequenceshtml copy the sequence gtblastn-vs-blastxThis is a Xenopus tropicalis cDNA sequence

Go to the NCBI BLAST Home PageNucleotide-nucleotide BLAST (blastn) section Paste your sequence into the box

Run BLASTn against the nr nucleotide database using all default optionsThen hit [format] to wait for the results in a new page(hint if you paste the sequence definition line lsquogtnamersquo into the box as well your results will be labelled accordingly which can be useful)

Now repeat but go to the TRANSLATED BLAST section and BLAST against the nr protein database using BLASTx

How might the different results help us view the presence of this gene in other vertebrates

Results for Exercise 1 BLASTn

BLASTx

BLAST Parameters Exercises2 Low complexity filtering

Open the file example-sequenceshtml copy the sequence gtlow-complexity-filtering-A This sequence contains a long AT tandem repeat

Go to the NCBI BLAST Home PageTRANSLATED BLAST sectionBLASTx Paste your sequence into the box

Carefully UNTICK the ldquoChoose filter [ ] Low complexityrdquo BOX in the second section And then run BLASTx against the nr database

What do you feel about these alignmentsRe-run but leave the low-complexity filter ON this timeDoes this change our view of the protein matches

Now continue with gtlow-complexity-filtering-B and ndashCC is an especially interesting case ndash what can we deduce about the cDNA sequence Annotators beware

Results for Exercise 2A (OFF) BLASTn ndash low complexity filtering OFF

Results for Exercise 2A (ON) BLASTn ndash low complexity filtering ON

Results for Exercise 2B

ON OFF

Results for Exercise 2C

There is a sequence error an extra G at position 117 in the sequence cDNA (117)AGAAAAGAAGAAACATGGCAATGGATCAGAA|||||||||||||||| ||||||||||||||AGAAAAGAAGAAACAT-GCAATGGATCAGAA

Genomic sequence

ON

OFF

BLAST Parameters Exercises3 Limit by Entrez query

Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching sequences in fruit fly enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT

Open the file example-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home Page TRANSLATED BLAST sectionBLASTx and paste the sequence

Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip

BLAST Parameters Exercises4 BLASTn vs tBLASTx and nucleotide mismatch penalties

Open the file example-sequenceshtml

Also open the NCBI BLAST Home PageSPECIAL ndash Align two sequences section

There are several Xenopus tropicalis cyclins in the examples fileCopy the sequence gtcyclin-A1-Xt to the Sequence 1 BLAST windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 BLAST window(i) Run the default comparison should be BLASTn Note the alignmentNow run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these

(ii) Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1 Then try reducing the gap open and gap extension penaltieshellipWhat do we learn from this

(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip

Results for Exercise 4 (i)

BLASTn tBLASTx

Results for Exercise 4 (ii)

Mismatch penalty = -2 (default) Mismatch penalty = -1

BLAST Parameters Exercises5 E-Value maximum for reporting

Open the file example-sequenceshtml

Copy the sequence gtsumo-binding-motif and go to the NCBI BLAST Home PageGo to the PROTEIN BLAST section BLASTp and paste the sequence

Run the search with the default values

Now re-run the search setting the maximum E-value in the box

Expect 100

What difference does this make

BLAST Parameters Exercises6 Word Size

Open the file example-sequenceshtml

Copy the sequence gtmorpholino and go to the NCBI BLAST Home PageGo to the NUCLEOTIDE BLAST section BLASTn and paste the sequence

Check OFF the low complexity filter and then run the search

Now re-run the search setting the following parameters

Low complexity OFFExpect 100Word Size 7Other advanced -q-1 (mismatch penalty -1 instead of default -3)

What difference does this make

END

  • Bioinformatics Workshop 1 Sequences and Similarity Searches
  • The Basic Questions
  • Part 1 Structural Genomics
  • Chromosomes and Genes
  • Gene to Protein
  • Sequence Signals
  • Genomic Signals
  • Derivative Sequences
  • Gene Models
  • Sequences and Genes (Accession Numbers and Names)
  • Gene Symbols Names Etc
  • A Gene-Centric View
  • Sequences and Accession Numbers
  • mRNA Splicing Signals
  • Gene Predictions
  • Supporting Evidence
  • TheoreticalPredicted Sequences
  • Sequences for a model organism
  • So Whatrsquos in the Databases Now
  • Part 2 Comparative Genomics
  • Speciation
  • Residual Similarity
  • Computers Can Detect Homology
  • Orthologs
  • Paralogs
  • lsquoOtherrsquo-logs
  • The Essential Paradigm
  • Function Conserved Longer than Detectable Similarity
  • Redundancy in the Genetic Code
  • Protein Similarity Persists Longer
  • Always Compare Protein Sequences
  • Exercise 1 nucleotide vs amino acid search
  • Answers Exercise 1
  • The Essential Task
  • Functional Orthologs
  • Finding Orthologs
  • Using Synteny is Better
  • Metazome
  • Metazome Exercise
  • Part 3 Finding Sequence Similarities
  • Gaps in Alignments
  • The Downside of Gaps
  • BLAST
  • Flavours of BLAST
  • How does it work
  • BLAST WORDS and INDEXING
  • Analyse the Query Sequence
  • Expand from Word Based Matches
  • BLAST ndashTypical Output
  • When is a match significant
  • E-values
  • E-values From First Principles
  • Calculating an E-value
  • E-values In Practice
  • E-value Exercise
  • E-value Exercise Answer
  • E-values Effect of Database Size
  • Slide 58
  • Why were the values different
  • E-values Effect of Query Length
  • Why not just use identity
  • So whatrsquos the real problem
  • Rules of Thumb
  • Protein BLAST
  • Amino Acid Substitutions
  • Exercises
  • Part 4 Tweaking BLAST
  • Not All Parameters are hellip
  • The Many Parameters of BLAST
  • Slide 70
  • BLAST Parameters Exercises
  • Results for Exercise 1
  • Slide 73
  • Results for Exercise 2A (OFF)
  • Results for Exercise 2A (ON)
  • Results for Exercise 2B
  • Results for Exercise 2C
  • Slide 78
  • Slide 79
  • Results for Exercise 4 (i)
  • Results for Exercise 4 (ii)
  • Slide 82
  • Slide 83
  • END
Page 64: Bioinformatics Workshop 1 Sequences and Similarity Searches

Protein BLASTItrsquos (nearly) always better to make comparisons at the amino acid level between protein sequences than the DNA level because the amino acid sequence is more conserved than the underlying DNA sequenceDoes this cause us to treat expected values any differently

If we follow the argument as before then for an exact match of a 20 amino acid sequence in the RefSeq protein database each additional amino acid will reduce the E-value by 120th (there are 20 different amino acids) And as there are 347895532 letters in that databaseE-value = ~35 x 108 (20 x 20 x 20 hellip20 times) = ~35 x 10-18

But this is what we get if we run the blast at NCBI

Score = 431 bits (100) Expect = 8e-04 Identities = 2020 (100) Positives = 2020 (100) Gaps = 020 (0) Frame = +3

Query 3 SSSSFRAYRAALSEVEPPCI 62 SSSSFRAYRAALSEVEPPCISbjct 972 SSSSFRAYRAALSEVEPPCI 991

Really too big a discrepancy to easily explain with hand wavinghellip

Amino Acid Substitutions

A SC F LWYG I LMVL IMFVM ILVP V ILMW FY

N DHSQ REHKS ANTT SY HFW

H NQYK RQER QK

D NEE DQK

In fact we need to take into account both amino acid substitutability as well as as before allowing gapped alignments On average any residue can be substituted for by about 2 others so each position has about 17th chance of lsquomatchingrsquo rather than 120th

So now we getE-value = ~35 x 108 (7 x 7 x 7 hellip20 times) = ~44 x 10-9which is much closer to the actual BLAST value

These substitutabilities are dealt with by the BLOSUM and PAM matrices

ExercisesGo to the file random-DNA-sequenceshtml select one of the 20 randomly generated nucleotide sequences and do a BLASTx (translated DNA-gtprotein) at NCBI against the nr protein database

Did you find any lsquosignificantrsquo hits

Repeat with a second sequence

What conclusions might you draw from this exercise

Try the same sequence(s) against the nr nucleotide database

Is there any general difference

Part 4 Tweaking BLASTAlthough you normally see BLAST as a web page with boxes to place data in and tick boxes etc it is actually a command line program that can be run just by typing the appropriate command and options eg

promptgt blastall ndashp ltblast typegt ndashi ltinput sequencegt ndashd ltdatabasegt

This is the simplest form where the basic program lsquoblastallrsquo takes a number of different options or parameters indicated by the ndashx and followed by its value -p ltwhich blast flavour to rungt-i ltfile with query sequence ingt-d ltpre-indexed database namegt

Not All Parameters are hellipThere are many other parameters and if not listed explicitly they will use a default value most appropriate to the blast flavour requested Eg for ndashW ltword sizegt blastn uses ndashW 11 where blastx uses ndashW 3

There are also some options that appear on the web pages that are not really parameters but manage the job in a similar way One of the most useful of these is on the NCBI blast pages where you can use Entrez queries or pick from an organism list to modify your search

The Many Parameters of BLASTThere are almost literally hundreds of parameters but most are way too obscure even for die-hard techies like me Very few of them are regularly useful in any but their default value but just occasionally they are very necessary

Here are some of the ones that I have used

-e max expected value -m output format (graphical or tabularspreadsheet)-F filter query sequence for low complexity (default TRUE)-U use only upper case regions of query (default FALSE)-G gap opening cost-E gap extension cost-q nucleotide mismatch penalty (BLASTx uses matrices)-r nucleotide match reward-b number of matching sequences to report-g allow gaps (default TRUE)-W word size-z effective database size (removes effect of actual database size)-S query strands to search (default both directions)-l restrict database sequences to given list of lsquogilsquo numbers

BLAST Parameters Exercises1 BLASTn vs BLASTx

Open the file example-sequenceshtml copy the sequence gtblastn-vs-blastxThis is a Xenopus tropicalis cDNA sequence

Go to the NCBI BLAST Home PageNucleotide-nucleotide BLAST (blastn) section Paste your sequence into the box

Run BLASTn against the nr nucleotide database using all default optionsThen hit [format] to wait for the results in a new page(hint if you paste the sequence definition line lsquogtnamersquo into the box as well your results will be labelled accordingly which can be useful)

Now repeat but go to the TRANSLATED BLAST section and BLAST against the nr protein database using BLASTx

How might the different results help us view the presence of this gene in other vertebrates

Results for Exercise 1 BLASTn

BLASTx

BLAST Parameters Exercises2 Low complexity filtering

Open the file example-sequenceshtml copy the sequence gtlow-complexity-filtering-A This sequence contains a long AT tandem repeat

Go to the NCBI BLAST Home PageTRANSLATED BLAST sectionBLASTx Paste your sequence into the box

Carefully UNTICK the ldquoChoose filter [ ] Low complexityrdquo BOX in the second section And then run BLASTx against the nr database

What do you feel about these alignmentsRe-run but leave the low-complexity filter ON this timeDoes this change our view of the protein matches

Now continue with gtlow-complexity-filtering-B and ndashCC is an especially interesting case ndash what can we deduce about the cDNA sequence Annotators beware

Results for Exercise 2A (OFF) BLASTn ndash low complexity filtering OFF

Results for Exercise 2A (ON) BLASTn ndash low complexity filtering ON

Results for Exercise 2B

ON OFF

Results for Exercise 2C

There is a sequence error an extra G at position 117 in the sequence cDNA (117)AGAAAAGAAGAAACATGGCAATGGATCAGAA|||||||||||||||| ||||||||||||||AGAAAAGAAGAAACAT-GCAATGGATCAGAA

Genomic sequence

ON

OFF

BLAST Parameters Exercises3 Limit by Entrez query

Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching sequences in fruit fly enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT

Open the file example-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home Page TRANSLATED BLAST sectionBLASTx and paste the sequence

Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip

BLAST Parameters Exercises4 BLASTn vs tBLASTx and nucleotide mismatch penalties

Open the file example-sequenceshtml

Also open the NCBI BLAST Home PageSPECIAL ndash Align two sequences section

There are several Xenopus tropicalis cyclins in the examples fileCopy the sequence gtcyclin-A1-Xt to the Sequence 1 BLAST windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 BLAST window(i) Run the default comparison should be BLASTn Note the alignmentNow run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these

(ii) Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1 Then try reducing the gap open and gap extension penaltieshellipWhat do we learn from this

(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip

Results for Exercise 4 (i)

BLASTn tBLASTx

Results for Exercise 4 (ii)

Mismatch penalty = -2 (default) Mismatch penalty = -1

BLAST Parameters Exercises5 E-Value maximum for reporting

Open the file example-sequenceshtml

Copy the sequence gtsumo-binding-motif and go to the NCBI BLAST Home PageGo to the PROTEIN BLAST section BLASTp and paste the sequence

Run the search with the default values

Now re-run the search setting the maximum E-value in the box

Expect 100

What difference does this make

BLAST Parameters Exercises6 Word Size

Open the file example-sequenceshtml

Copy the sequence gtmorpholino and go to the NCBI BLAST Home PageGo to the NUCLEOTIDE BLAST section BLASTn and paste the sequence

Check OFF the low complexity filter and then run the search

Now re-run the search setting the following parameters

Low complexity OFFExpect 100Word Size 7Other advanced -q-1 (mismatch penalty -1 instead of default -3)

What difference does this make

END

  • Bioinformatics Workshop 1 Sequences and Similarity Searches
  • The Basic Questions
  • Part 1 Structural Genomics
  • Chromosomes and Genes
  • Gene to Protein
  • Sequence Signals
  • Genomic Signals
  • Derivative Sequences
  • Gene Models
  • Sequences and Genes (Accession Numbers and Names)
  • Gene Symbols Names Etc
  • A Gene-Centric View
  • Sequences and Accession Numbers
  • mRNA Splicing Signals
  • Gene Predictions
  • Supporting Evidence
  • TheoreticalPredicted Sequences
  • Sequences for a model organism
  • So Whatrsquos in the Databases Now
  • Part 2 Comparative Genomics
  • Speciation
  • Residual Similarity
  • Computers Can Detect Homology
  • Orthologs
  • Paralogs
  • lsquoOtherrsquo-logs
  • The Essential Paradigm
  • Function Conserved Longer than Detectable Similarity
  • Redundancy in the Genetic Code
  • Protein Similarity Persists Longer
  • Always Compare Protein Sequences
  • Exercise 1 nucleotide vs amino acid search
  • Answers Exercise 1
  • The Essential Task
  • Functional Orthologs
  • Finding Orthologs
  • Using Synteny is Better
  • Metazome
  • Metazome Exercise
  • Part 3 Finding Sequence Similarities
  • Gaps in Alignments
  • The Downside of Gaps
  • BLAST
  • Flavours of BLAST
  • How does it work
  • BLAST WORDS and INDEXING
  • Analyse the Query Sequence
  • Expand from Word Based Matches
  • BLAST ndashTypical Output
  • When is a match significant
  • E-values
  • E-values From First Principles
  • Calculating an E-value
  • E-values In Practice
  • E-value Exercise
  • E-value Exercise Answer
  • E-values Effect of Database Size
  • Slide 58
  • Why were the values different
  • E-values Effect of Query Length
  • Why not just use identity
  • So whatrsquos the real problem
  • Rules of Thumb
  • Protein BLAST
  • Amino Acid Substitutions
  • Exercises
  • Part 4 Tweaking BLAST
  • Not All Parameters are hellip
  • The Many Parameters of BLAST
  • Slide 70
  • BLAST Parameters Exercises
  • Results for Exercise 1
  • Slide 73
  • Results for Exercise 2A (OFF)
  • Results for Exercise 2A (ON)
  • Results for Exercise 2B
  • Results for Exercise 2C
  • Slide 78
  • Slide 79
  • Results for Exercise 4 (i)
  • Results for Exercise 4 (ii)
  • Slide 82
  • Slide 83
  • END
Page 65: Bioinformatics Workshop 1 Sequences and Similarity Searches

Amino Acid Substitutions

A SC F LWYG I LMVL IMFVM ILVP V ILMW FY

N DHSQ REHKS ANTT SY HFW

H NQYK RQER QK

D NEE DQK

In fact we need to take into account both amino acid substitutability as well as as before allowing gapped alignments On average any residue can be substituted for by about 2 others so each position has about 17th chance of lsquomatchingrsquo rather than 120th

So now we getE-value = ~35 x 108 (7 x 7 x 7 hellip20 times) = ~44 x 10-9which is much closer to the actual BLAST value

These substitutabilities are dealt with by the BLOSUM and PAM matrices

ExercisesGo to the file random-DNA-sequenceshtml select one of the 20 randomly generated nucleotide sequences and do a BLASTx (translated DNA-gtprotein) at NCBI against the nr protein database

Did you find any lsquosignificantrsquo hits

Repeat with a second sequence

What conclusions might you draw from this exercise

Try the same sequence(s) against the nr nucleotide database

Is there any general difference

Part 4 Tweaking BLASTAlthough you normally see BLAST as a web page with boxes to place data in and tick boxes etc it is actually a command line program that can be run just by typing the appropriate command and options eg

promptgt blastall ndashp ltblast typegt ndashi ltinput sequencegt ndashd ltdatabasegt

This is the simplest form where the basic program lsquoblastallrsquo takes a number of different options or parameters indicated by the ndashx and followed by its value -p ltwhich blast flavour to rungt-i ltfile with query sequence ingt-d ltpre-indexed database namegt

Not All Parameters are hellipThere are many other parameters and if not listed explicitly they will use a default value most appropriate to the blast flavour requested Eg for ndashW ltword sizegt blastn uses ndashW 11 where blastx uses ndashW 3

There are also some options that appear on the web pages that are not really parameters but manage the job in a similar way One of the most useful of these is on the NCBI blast pages where you can use Entrez queries or pick from an organism list to modify your search

The Many Parameters of BLASTThere are almost literally hundreds of parameters but most are way too obscure even for die-hard techies like me Very few of them are regularly useful in any but their default value but just occasionally they are very necessary

Here are some of the ones that I have used

-e max expected value -m output format (graphical or tabularspreadsheet)-F filter query sequence for low complexity (default TRUE)-U use only upper case regions of query (default FALSE)-G gap opening cost-E gap extension cost-q nucleotide mismatch penalty (BLASTx uses matrices)-r nucleotide match reward-b number of matching sequences to report-g allow gaps (default TRUE)-W word size-z effective database size (removes effect of actual database size)-S query strands to search (default both directions)-l restrict database sequences to given list of lsquogilsquo numbers

BLAST Parameters Exercises1 BLASTn vs BLASTx

Open the file example-sequenceshtml copy the sequence gtblastn-vs-blastxThis is a Xenopus tropicalis cDNA sequence

Go to the NCBI BLAST Home PageNucleotide-nucleotide BLAST (blastn) section Paste your sequence into the box

Run BLASTn against the nr nucleotide database using all default optionsThen hit [format] to wait for the results in a new page(hint if you paste the sequence definition line lsquogtnamersquo into the box as well your results will be labelled accordingly which can be useful)

Now repeat but go to the TRANSLATED BLAST section and BLAST against the nr protein database using BLASTx

How might the different results help us view the presence of this gene in other vertebrates

Results for Exercise 1 BLASTn

BLASTx

BLAST Parameters Exercises2 Low complexity filtering

Open the file example-sequenceshtml copy the sequence gtlow-complexity-filtering-A This sequence contains a long AT tandem repeat

Go to the NCBI BLAST Home PageTRANSLATED BLAST sectionBLASTx Paste your sequence into the box

Carefully UNTICK the ldquoChoose filter [ ] Low complexityrdquo BOX in the second section And then run BLASTx against the nr database

What do you feel about these alignmentsRe-run but leave the low-complexity filter ON this timeDoes this change our view of the protein matches

Now continue with gtlow-complexity-filtering-B and ndashCC is an especially interesting case ndash what can we deduce about the cDNA sequence Annotators beware

Results for Exercise 2A (OFF) BLASTn ndash low complexity filtering OFF

Results for Exercise 2A (ON) BLASTn ndash low complexity filtering ON

Results for Exercise 2B

ON OFF

Results for Exercise 2C

There is a sequence error an extra G at position 117 in the sequence cDNA (117)AGAAAAGAAGAAACATGGCAATGGATCAGAA|||||||||||||||| ||||||||||||||AGAAAAGAAGAAACAT-GCAATGGATCAGAA

Genomic sequence

ON

OFF

BLAST Parameters Exercises3 Limit by Entrez query

Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching sequences in fruit fly enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT

Open the file example-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home Page TRANSLATED BLAST sectionBLASTx and paste the sequence

Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip

BLAST Parameters Exercises4 BLASTn vs tBLASTx and nucleotide mismatch penalties

Open the file example-sequenceshtml

Also open the NCBI BLAST Home PageSPECIAL ndash Align two sequences section

There are several Xenopus tropicalis cyclins in the examples fileCopy the sequence gtcyclin-A1-Xt to the Sequence 1 BLAST windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 BLAST window(i) Run the default comparison should be BLASTn Note the alignmentNow run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these

(ii) Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1 Then try reducing the gap open and gap extension penaltieshellipWhat do we learn from this

(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip

Results for Exercise 4 (i)

BLASTn tBLASTx

Results for Exercise 4 (ii)

Mismatch penalty = -2 (default) Mismatch penalty = -1

BLAST Parameters Exercises5 E-Value maximum for reporting

Open the file example-sequenceshtml

Copy the sequence gtsumo-binding-motif and go to the NCBI BLAST Home PageGo to the PROTEIN BLAST section BLASTp and paste the sequence

Run the search with the default values

Now re-run the search setting the maximum E-value in the box

Expect 100

What difference does this make

BLAST Parameters Exercises6 Word Size

Open the file example-sequenceshtml

Copy the sequence gtmorpholino and go to the NCBI BLAST Home PageGo to the NUCLEOTIDE BLAST section BLASTn and paste the sequence

Check OFF the low complexity filter and then run the search

Now re-run the search setting the following parameters

Low complexity OFFExpect 100Word Size 7Other advanced -q-1 (mismatch penalty -1 instead of default -3)

What difference does this make

END

  • Bioinformatics Workshop 1 Sequences and Similarity Searches
  • The Basic Questions
  • Part 1 Structural Genomics
  • Chromosomes and Genes
  • Gene to Protein
  • Sequence Signals
  • Genomic Signals
  • Derivative Sequences
  • Gene Models
  • Sequences and Genes (Accession Numbers and Names)
  • Gene Symbols Names Etc
  • A Gene-Centric View
  • Sequences and Accession Numbers
  • mRNA Splicing Signals
  • Gene Predictions
  • Supporting Evidence
  • TheoreticalPredicted Sequences
  • Sequences for a model organism
  • So Whatrsquos in the Databases Now
  • Part 2 Comparative Genomics
  • Speciation
  • Residual Similarity
  • Computers Can Detect Homology
  • Orthologs
  • Paralogs
  • lsquoOtherrsquo-logs
  • The Essential Paradigm
  • Function Conserved Longer than Detectable Similarity
  • Redundancy in the Genetic Code
  • Protein Similarity Persists Longer
  • Always Compare Protein Sequences
  • Exercise 1 nucleotide vs amino acid search
  • Answers Exercise 1
  • The Essential Task
  • Functional Orthologs
  • Finding Orthologs
  • Using Synteny is Better
  • Metazome
  • Metazome Exercise
  • Part 3 Finding Sequence Similarities
  • Gaps in Alignments
  • The Downside of Gaps
  • BLAST
  • Flavours of BLAST
  • How does it work
  • BLAST WORDS and INDEXING
  • Analyse the Query Sequence
  • Expand from Word Based Matches
  • BLAST ndashTypical Output
  • When is a match significant
  • E-values
  • E-values From First Principles
  • Calculating an E-value
  • E-values In Practice
  • E-value Exercise
  • E-value Exercise Answer
  • E-values Effect of Database Size
  • Slide 58
  • Why were the values different
  • E-values Effect of Query Length
  • Why not just use identity
  • So whatrsquos the real problem
  • Rules of Thumb
  • Protein BLAST
  • Amino Acid Substitutions
  • Exercises
  • Part 4 Tweaking BLAST
  • Not All Parameters are hellip
  • The Many Parameters of BLAST
  • Slide 70
  • BLAST Parameters Exercises
  • Results for Exercise 1
  • Slide 73
  • Results for Exercise 2A (OFF)
  • Results for Exercise 2A (ON)
  • Results for Exercise 2B
  • Results for Exercise 2C
  • Slide 78
  • Slide 79
  • Results for Exercise 4 (i)
  • Results for Exercise 4 (ii)
  • Slide 82
  • Slide 83
  • END
Page 66: Bioinformatics Workshop 1 Sequences and Similarity Searches

ExercisesGo to the file random-DNA-sequenceshtml select one of the 20 randomly generated nucleotide sequences and do a BLASTx (translated DNA-gtprotein) at NCBI against the nr protein database

Did you find any lsquosignificantrsquo hits

Repeat with a second sequence

What conclusions might you draw from this exercise

Try the same sequence(s) against the nr nucleotide database

Is there any general difference

Part 4 Tweaking BLASTAlthough you normally see BLAST as a web page with boxes to place data in and tick boxes etc it is actually a command line program that can be run just by typing the appropriate command and options eg

promptgt blastall ndashp ltblast typegt ndashi ltinput sequencegt ndashd ltdatabasegt

This is the simplest form where the basic program lsquoblastallrsquo takes a number of different options or parameters indicated by the ndashx and followed by its value -p ltwhich blast flavour to rungt-i ltfile with query sequence ingt-d ltpre-indexed database namegt

Not All Parameters are hellipThere are many other parameters and if not listed explicitly they will use a default value most appropriate to the blast flavour requested Eg for ndashW ltword sizegt blastn uses ndashW 11 where blastx uses ndashW 3

There are also some options that appear on the web pages that are not really parameters but manage the job in a similar way One of the most useful of these is on the NCBI blast pages where you can use Entrez queries or pick from an organism list to modify your search

The Many Parameters of BLASTThere are almost literally hundreds of parameters but most are way too obscure even for die-hard techies like me Very few of them are regularly useful in any but their default value but just occasionally they are very necessary

Here are some of the ones that I have used

-e max expected value -m output format (graphical or tabularspreadsheet)-F filter query sequence for low complexity (default TRUE)-U use only upper case regions of query (default FALSE)-G gap opening cost-E gap extension cost-q nucleotide mismatch penalty (BLASTx uses matrices)-r nucleotide match reward-b number of matching sequences to report-g allow gaps (default TRUE)-W word size-z effective database size (removes effect of actual database size)-S query strands to search (default both directions)-l restrict database sequences to given list of lsquogilsquo numbers

BLAST Parameters Exercises1 BLASTn vs BLASTx

Open the file example-sequenceshtml copy the sequence gtblastn-vs-blastxThis is a Xenopus tropicalis cDNA sequence

Go to the NCBI BLAST Home PageNucleotide-nucleotide BLAST (blastn) section Paste your sequence into the box

Run BLASTn against the nr nucleotide database using all default optionsThen hit [format] to wait for the results in a new page(hint if you paste the sequence definition line lsquogtnamersquo into the box as well your results will be labelled accordingly which can be useful)

Now repeat but go to the TRANSLATED BLAST section and BLAST against the nr protein database using BLASTx

How might the different results help us view the presence of this gene in other vertebrates

Results for Exercise 1 BLASTn

BLASTx

BLAST Parameters Exercises2 Low complexity filtering

Open the file example-sequenceshtml copy the sequence gtlow-complexity-filtering-A This sequence contains a long AT tandem repeat

Go to the NCBI BLAST Home PageTRANSLATED BLAST sectionBLASTx Paste your sequence into the box

Carefully UNTICK the ldquoChoose filter [ ] Low complexityrdquo BOX in the second section And then run BLASTx against the nr database

What do you feel about these alignmentsRe-run but leave the low-complexity filter ON this timeDoes this change our view of the protein matches

Now continue with gtlow-complexity-filtering-B and ndashCC is an especially interesting case ndash what can we deduce about the cDNA sequence Annotators beware

Results for Exercise 2A (OFF) BLASTn ndash low complexity filtering OFF

Results for Exercise 2A (ON) BLASTn ndash low complexity filtering ON

Results for Exercise 2B

ON OFF

Results for Exercise 2C

There is a sequence error an extra G at position 117 in the sequence cDNA (117)AGAAAAGAAGAAACATGGCAATGGATCAGAA|||||||||||||||| ||||||||||||||AGAAAAGAAGAAACAT-GCAATGGATCAGAA

Genomic sequence

ON

OFF

BLAST Parameters Exercises3 Limit by Entrez query

Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching sequences in fruit fly enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT

Open the file example-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home Page TRANSLATED BLAST sectionBLASTx and paste the sequence

Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip

BLAST Parameters Exercises4 BLASTn vs tBLASTx and nucleotide mismatch penalties

Open the file example-sequenceshtml

Also open the NCBI BLAST Home PageSPECIAL ndash Align two sequences section

There are several Xenopus tropicalis cyclins in the examples fileCopy the sequence gtcyclin-A1-Xt to the Sequence 1 BLAST windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 BLAST window(i) Run the default comparison should be BLASTn Note the alignmentNow run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these

(ii) Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1 Then try reducing the gap open and gap extension penaltieshellipWhat do we learn from this

(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip

Results for Exercise 4 (i)

BLASTn tBLASTx

Results for Exercise 4 (ii)

Mismatch penalty = -2 (default) Mismatch penalty = -1

BLAST Parameters Exercises5 E-Value maximum for reporting

Open the file example-sequenceshtml

Copy the sequence gtsumo-binding-motif and go to the NCBI BLAST Home PageGo to the PROTEIN BLAST section BLASTp and paste the sequence

Run the search with the default values

Now re-run the search setting the maximum E-value in the box

Expect 100

What difference does this make

BLAST Parameters Exercises6 Word Size

Open the file example-sequenceshtml

Copy the sequence gtmorpholino and go to the NCBI BLAST Home PageGo to the NUCLEOTIDE BLAST section BLASTn and paste the sequence

Check OFF the low complexity filter and then run the search

Now re-run the search setting the following parameters

Low complexity OFFExpect 100Word Size 7Other advanced -q-1 (mismatch penalty -1 instead of default -3)

What difference does this make

END

  • Bioinformatics Workshop 1 Sequences and Similarity Searches
  • The Basic Questions
  • Part 1 Structural Genomics
  • Chromosomes and Genes
  • Gene to Protein
  • Sequence Signals
  • Genomic Signals
  • Derivative Sequences
  • Gene Models
  • Sequences and Genes (Accession Numbers and Names)
  • Gene Symbols Names Etc
  • A Gene-Centric View
  • Sequences and Accession Numbers
  • mRNA Splicing Signals
  • Gene Predictions
  • Supporting Evidence
  • TheoreticalPredicted Sequences
  • Sequences for a model organism
  • So Whatrsquos in the Databases Now
  • Part 2 Comparative Genomics
  • Speciation
  • Residual Similarity
  • Computers Can Detect Homology
  • Orthologs
  • Paralogs
  • lsquoOtherrsquo-logs
  • The Essential Paradigm
  • Function Conserved Longer than Detectable Similarity
  • Redundancy in the Genetic Code
  • Protein Similarity Persists Longer
  • Always Compare Protein Sequences
  • Exercise 1 nucleotide vs amino acid search
  • Answers Exercise 1
  • The Essential Task
  • Functional Orthologs
  • Finding Orthologs
  • Using Synteny is Better
  • Metazome
  • Metazome Exercise
  • Part 3 Finding Sequence Similarities
  • Gaps in Alignments
  • The Downside of Gaps
  • BLAST
  • Flavours of BLAST
  • How does it work
  • BLAST WORDS and INDEXING
  • Analyse the Query Sequence
  • Expand from Word Based Matches
  • BLAST ndashTypical Output
  • When is a match significant
  • E-values
  • E-values From First Principles
  • Calculating an E-value
  • E-values In Practice
  • E-value Exercise
  • E-value Exercise Answer
  • E-values Effect of Database Size
  • Slide 58
  • Why were the values different
  • E-values Effect of Query Length
  • Why not just use identity
  • So whatrsquos the real problem
  • Rules of Thumb
  • Protein BLAST
  • Amino Acid Substitutions
  • Exercises
  • Part 4 Tweaking BLAST
  • Not All Parameters are hellip
  • The Many Parameters of BLAST
  • Slide 70
  • BLAST Parameters Exercises
  • Results for Exercise 1
  • Slide 73
  • Results for Exercise 2A (OFF)
  • Results for Exercise 2A (ON)
  • Results for Exercise 2B
  • Results for Exercise 2C
  • Slide 78
  • Slide 79
  • Results for Exercise 4 (i)
  • Results for Exercise 4 (ii)
  • Slide 82
  • Slide 83
  • END
Page 67: Bioinformatics Workshop 1 Sequences and Similarity Searches

Part 4 Tweaking BLASTAlthough you normally see BLAST as a web page with boxes to place data in and tick boxes etc it is actually a command line program that can be run just by typing the appropriate command and options eg

promptgt blastall ndashp ltblast typegt ndashi ltinput sequencegt ndashd ltdatabasegt

This is the simplest form where the basic program lsquoblastallrsquo takes a number of different options or parameters indicated by the ndashx and followed by its value -p ltwhich blast flavour to rungt-i ltfile with query sequence ingt-d ltpre-indexed database namegt

Not All Parameters are hellipThere are many other parameters and if not listed explicitly they will use a default value most appropriate to the blast flavour requested Eg for ndashW ltword sizegt blastn uses ndashW 11 where blastx uses ndashW 3

There are also some options that appear on the web pages that are not really parameters but manage the job in a similar way One of the most useful of these is on the NCBI blast pages where you can use Entrez queries or pick from an organism list to modify your search

The Many Parameters of BLASTThere are almost literally hundreds of parameters but most are way too obscure even for die-hard techies like me Very few of them are regularly useful in any but their default value but just occasionally they are very necessary

Here are some of the ones that I have used

-e max expected value -m output format (graphical or tabularspreadsheet)-F filter query sequence for low complexity (default TRUE)-U use only upper case regions of query (default FALSE)-G gap opening cost-E gap extension cost-q nucleotide mismatch penalty (BLASTx uses matrices)-r nucleotide match reward-b number of matching sequences to report-g allow gaps (default TRUE)-W word size-z effective database size (removes effect of actual database size)-S query strands to search (default both directions)-l restrict database sequences to given list of lsquogilsquo numbers

BLAST Parameters Exercises1 BLASTn vs BLASTx

Open the file example-sequenceshtml copy the sequence gtblastn-vs-blastxThis is a Xenopus tropicalis cDNA sequence

Go to the NCBI BLAST Home PageNucleotide-nucleotide BLAST (blastn) section Paste your sequence into the box

Run BLASTn against the nr nucleotide database using all default optionsThen hit [format] to wait for the results in a new page(hint if you paste the sequence definition line lsquogtnamersquo into the box as well your results will be labelled accordingly which can be useful)

Now repeat but go to the TRANSLATED BLAST section and BLAST against the nr protein database using BLASTx

How might the different results help us view the presence of this gene in other vertebrates

Results for Exercise 1 BLASTn

BLASTx

BLAST Parameters Exercises2 Low complexity filtering

Open the file example-sequenceshtml copy the sequence gtlow-complexity-filtering-A This sequence contains a long AT tandem repeat

Go to the NCBI BLAST Home PageTRANSLATED BLAST sectionBLASTx Paste your sequence into the box

Carefully UNTICK the ldquoChoose filter [ ] Low complexityrdquo BOX in the second section And then run BLASTx against the nr database

What do you feel about these alignmentsRe-run but leave the low-complexity filter ON this timeDoes this change our view of the protein matches

Now continue with gtlow-complexity-filtering-B and ndashCC is an especially interesting case ndash what can we deduce about the cDNA sequence Annotators beware

Results for Exercise 2A (OFF) BLASTn ndash low complexity filtering OFF

Results for Exercise 2A (ON) BLASTn ndash low complexity filtering ON

Results for Exercise 2B

ON OFF

Results for Exercise 2C

There is a sequence error an extra G at position 117 in the sequence cDNA (117)AGAAAAGAAGAAACATGGCAATGGATCAGAA|||||||||||||||| ||||||||||||||AGAAAAGAAGAAACAT-GCAATGGATCAGAA

Genomic sequence

ON

OFF

BLAST Parameters Exercises3 Limit by Entrez query

Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching sequences in fruit fly enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT

Open the file example-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home Page TRANSLATED BLAST sectionBLASTx and paste the sequence

Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip

BLAST Parameters Exercises4 BLASTn vs tBLASTx and nucleotide mismatch penalties

Open the file example-sequenceshtml

Also open the NCBI BLAST Home PageSPECIAL ndash Align two sequences section

There are several Xenopus tropicalis cyclins in the examples fileCopy the sequence gtcyclin-A1-Xt to the Sequence 1 BLAST windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 BLAST window(i) Run the default comparison should be BLASTn Note the alignmentNow run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these

(ii) Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1 Then try reducing the gap open and gap extension penaltieshellipWhat do we learn from this

(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip

Results for Exercise 4 (i)

BLASTn tBLASTx

Results for Exercise 4 (ii)

Mismatch penalty = -2 (default) Mismatch penalty = -1

BLAST Parameters Exercises5 E-Value maximum for reporting

Open the file example-sequenceshtml

Copy the sequence gtsumo-binding-motif and go to the NCBI BLAST Home PageGo to the PROTEIN BLAST section BLASTp and paste the sequence

Run the search with the default values

Now re-run the search setting the maximum E-value in the box

Expect 100

What difference does this make

BLAST Parameters Exercises6 Word Size

Open the file example-sequenceshtml

Copy the sequence gtmorpholino and go to the NCBI BLAST Home PageGo to the NUCLEOTIDE BLAST section BLASTn and paste the sequence

Check OFF the low complexity filter and then run the search

Now re-run the search setting the following parameters

Low complexity OFFExpect 100Word Size 7Other advanced -q-1 (mismatch penalty -1 instead of default -3)

What difference does this make

END

  • Bioinformatics Workshop 1 Sequences and Similarity Searches
  • The Basic Questions
  • Part 1 Structural Genomics
  • Chromosomes and Genes
  • Gene to Protein
  • Sequence Signals
  • Genomic Signals
  • Derivative Sequences
  • Gene Models
  • Sequences and Genes (Accession Numbers and Names)
  • Gene Symbols Names Etc
  • A Gene-Centric View
  • Sequences and Accession Numbers
  • mRNA Splicing Signals
  • Gene Predictions
  • Supporting Evidence
  • TheoreticalPredicted Sequences
  • Sequences for a model organism
  • So Whatrsquos in the Databases Now
  • Part 2 Comparative Genomics
  • Speciation
  • Residual Similarity
  • Computers Can Detect Homology
  • Orthologs
  • Paralogs
  • lsquoOtherrsquo-logs
  • The Essential Paradigm
  • Function Conserved Longer than Detectable Similarity
  • Redundancy in the Genetic Code
  • Protein Similarity Persists Longer
  • Always Compare Protein Sequences
  • Exercise 1 nucleotide vs amino acid search
  • Answers Exercise 1
  • The Essential Task
  • Functional Orthologs
  • Finding Orthologs
  • Using Synteny is Better
  • Metazome
  • Metazome Exercise
  • Part 3 Finding Sequence Similarities
  • Gaps in Alignments
  • The Downside of Gaps
  • BLAST
  • Flavours of BLAST
  • How does it work
  • BLAST WORDS and INDEXING
  • Analyse the Query Sequence
  • Expand from Word Based Matches
  • BLAST ndashTypical Output
  • When is a match significant
  • E-values
  • E-values From First Principles
  • Calculating an E-value
  • E-values In Practice
  • E-value Exercise
  • E-value Exercise Answer
  • E-values Effect of Database Size
  • Slide 58
  • Why were the values different
  • E-values Effect of Query Length
  • Why not just use identity
  • So whatrsquos the real problem
  • Rules of Thumb
  • Protein BLAST
  • Amino Acid Substitutions
  • Exercises
  • Part 4 Tweaking BLAST
  • Not All Parameters are hellip
  • The Many Parameters of BLAST
  • Slide 70
  • BLAST Parameters Exercises
  • Results for Exercise 1
  • Slide 73
  • Results for Exercise 2A (OFF)
  • Results for Exercise 2A (ON)
  • Results for Exercise 2B
  • Results for Exercise 2C
  • Slide 78
  • Slide 79
  • Results for Exercise 4 (i)
  • Results for Exercise 4 (ii)
  • Slide 82
  • Slide 83
  • END
Page 68: Bioinformatics Workshop 1 Sequences and Similarity Searches

Not All Parameters are hellipThere are many other parameters and if not listed explicitly they will use a default value most appropriate to the blast flavour requested Eg for ndashW ltword sizegt blastn uses ndashW 11 where blastx uses ndashW 3

There are also some options that appear on the web pages that are not really parameters but manage the job in a similar way One of the most useful of these is on the NCBI blast pages where you can use Entrez queries or pick from an organism list to modify your search

The Many Parameters of BLASTThere are almost literally hundreds of parameters but most are way too obscure even for die-hard techies like me Very few of them are regularly useful in any but their default value but just occasionally they are very necessary

Here are some of the ones that I have used

-e max expected value -m output format (graphical or tabularspreadsheet)-F filter query sequence for low complexity (default TRUE)-U use only upper case regions of query (default FALSE)-G gap opening cost-E gap extension cost-q nucleotide mismatch penalty (BLASTx uses matrices)-r nucleotide match reward-b number of matching sequences to report-g allow gaps (default TRUE)-W word size-z effective database size (removes effect of actual database size)-S query strands to search (default both directions)-l restrict database sequences to given list of lsquogilsquo numbers

BLAST Parameters Exercises1 BLASTn vs BLASTx

Open the file example-sequenceshtml copy the sequence gtblastn-vs-blastxThis is a Xenopus tropicalis cDNA sequence

Go to the NCBI BLAST Home PageNucleotide-nucleotide BLAST (blastn) section Paste your sequence into the box

Run BLASTn against the nr nucleotide database using all default optionsThen hit [format] to wait for the results in a new page(hint if you paste the sequence definition line lsquogtnamersquo into the box as well your results will be labelled accordingly which can be useful)

Now repeat but go to the TRANSLATED BLAST section and BLAST against the nr protein database using BLASTx

How might the different results help us view the presence of this gene in other vertebrates

Results for Exercise 1 BLASTn

BLASTx

BLAST Parameters Exercises2 Low complexity filtering

Open the file example-sequenceshtml copy the sequence gtlow-complexity-filtering-A This sequence contains a long AT tandem repeat

Go to the NCBI BLAST Home PageTRANSLATED BLAST sectionBLASTx Paste your sequence into the box

Carefully UNTICK the ldquoChoose filter [ ] Low complexityrdquo BOX in the second section And then run BLASTx against the nr database

What do you feel about these alignmentsRe-run but leave the low-complexity filter ON this timeDoes this change our view of the protein matches

Now continue with gtlow-complexity-filtering-B and ndashCC is an especially interesting case ndash what can we deduce about the cDNA sequence Annotators beware

Results for Exercise 2A (OFF) BLASTn ndash low complexity filtering OFF

Results for Exercise 2A (ON) BLASTn ndash low complexity filtering ON

Results for Exercise 2B

ON OFF

Results for Exercise 2C

There is a sequence error an extra G at position 117 in the sequence cDNA (117)AGAAAAGAAGAAACATGGCAATGGATCAGAA|||||||||||||||| ||||||||||||||AGAAAAGAAGAAACAT-GCAATGGATCAGAA

Genomic sequence

ON

OFF

BLAST Parameters Exercises3 Limit by Entrez query

Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching sequences in fruit fly enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT

Open the file example-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home Page TRANSLATED BLAST sectionBLASTx and paste the sequence

Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip

BLAST Parameters Exercises4 BLASTn vs tBLASTx and nucleotide mismatch penalties

Open the file example-sequenceshtml

Also open the NCBI BLAST Home PageSPECIAL ndash Align two sequences section

There are several Xenopus tropicalis cyclins in the examples fileCopy the sequence gtcyclin-A1-Xt to the Sequence 1 BLAST windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 BLAST window(i) Run the default comparison should be BLASTn Note the alignmentNow run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these

(ii) Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1 Then try reducing the gap open and gap extension penaltieshellipWhat do we learn from this

(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip

Results for Exercise 4 (i)

BLASTn tBLASTx

Results for Exercise 4 (ii)

Mismatch penalty = -2 (default) Mismatch penalty = -1

BLAST Parameters Exercises5 E-Value maximum for reporting

Open the file example-sequenceshtml

Copy the sequence gtsumo-binding-motif and go to the NCBI BLAST Home PageGo to the PROTEIN BLAST section BLASTp and paste the sequence

Run the search with the default values

Now re-run the search setting the maximum E-value in the box

Expect 100

What difference does this make

BLAST Parameters Exercises6 Word Size

Open the file example-sequenceshtml

Copy the sequence gtmorpholino and go to the NCBI BLAST Home PageGo to the NUCLEOTIDE BLAST section BLASTn and paste the sequence

Check OFF the low complexity filter and then run the search

Now re-run the search setting the following parameters

Low complexity OFFExpect 100Word Size 7Other advanced -q-1 (mismatch penalty -1 instead of default -3)

What difference does this make

END

  • Bioinformatics Workshop 1 Sequences and Similarity Searches
  • The Basic Questions
  • Part 1 Structural Genomics
  • Chromosomes and Genes
  • Gene to Protein
  • Sequence Signals
  • Genomic Signals
  • Derivative Sequences
  • Gene Models
  • Sequences and Genes (Accession Numbers and Names)
  • Gene Symbols Names Etc
  • A Gene-Centric View
  • Sequences and Accession Numbers
  • mRNA Splicing Signals
  • Gene Predictions
  • Supporting Evidence
  • TheoreticalPredicted Sequences
  • Sequences for a model organism
  • So Whatrsquos in the Databases Now
  • Part 2 Comparative Genomics
  • Speciation
  • Residual Similarity
  • Computers Can Detect Homology
  • Orthologs
  • Paralogs
  • lsquoOtherrsquo-logs
  • The Essential Paradigm
  • Function Conserved Longer than Detectable Similarity
  • Redundancy in the Genetic Code
  • Protein Similarity Persists Longer
  • Always Compare Protein Sequences
  • Exercise 1 nucleotide vs amino acid search
  • Answers Exercise 1
  • The Essential Task
  • Functional Orthologs
  • Finding Orthologs
  • Using Synteny is Better
  • Metazome
  • Metazome Exercise
  • Part 3 Finding Sequence Similarities
  • Gaps in Alignments
  • The Downside of Gaps
  • BLAST
  • Flavours of BLAST
  • How does it work
  • BLAST WORDS and INDEXING
  • Analyse the Query Sequence
  • Expand from Word Based Matches
  • BLAST ndashTypical Output
  • When is a match significant
  • E-values
  • E-values From First Principles
  • Calculating an E-value
  • E-values In Practice
  • E-value Exercise
  • E-value Exercise Answer
  • E-values Effect of Database Size
  • Slide 58
  • Why were the values different
  • E-values Effect of Query Length
  • Why not just use identity
  • So whatrsquos the real problem
  • Rules of Thumb
  • Protein BLAST
  • Amino Acid Substitutions
  • Exercises
  • Part 4 Tweaking BLAST
  • Not All Parameters are hellip
  • The Many Parameters of BLAST
  • Slide 70
  • BLAST Parameters Exercises
  • Results for Exercise 1
  • Slide 73
  • Results for Exercise 2A (OFF)
  • Results for Exercise 2A (ON)
  • Results for Exercise 2B
  • Results for Exercise 2C
  • Slide 78
  • Slide 79
  • Results for Exercise 4 (i)
  • Results for Exercise 4 (ii)
  • Slide 82
  • Slide 83
  • END
Page 69: Bioinformatics Workshop 1 Sequences and Similarity Searches

The Many Parameters of BLASTThere are almost literally hundreds of parameters but most are way too obscure even for die-hard techies like me Very few of them are regularly useful in any but their default value but just occasionally they are very necessary

Here are some of the ones that I have used

-e max expected value -m output format (graphical or tabularspreadsheet)-F filter query sequence for low complexity (default TRUE)-U use only upper case regions of query (default FALSE)-G gap opening cost-E gap extension cost-q nucleotide mismatch penalty (BLASTx uses matrices)-r nucleotide match reward-b number of matching sequences to report-g allow gaps (default TRUE)-W word size-z effective database size (removes effect of actual database size)-S query strands to search (default both directions)-l restrict database sequences to given list of lsquogilsquo numbers

BLAST Parameters Exercises1 BLASTn vs BLASTx

Open the file example-sequenceshtml copy the sequence gtblastn-vs-blastxThis is a Xenopus tropicalis cDNA sequence

Go to the NCBI BLAST Home PageNucleotide-nucleotide BLAST (blastn) section Paste your sequence into the box

Run BLASTn against the nr nucleotide database using all default optionsThen hit [format] to wait for the results in a new page(hint if you paste the sequence definition line lsquogtnamersquo into the box as well your results will be labelled accordingly which can be useful)

Now repeat but go to the TRANSLATED BLAST section and BLAST against the nr protein database using BLASTx

How might the different results help us view the presence of this gene in other vertebrates

Results for Exercise 1 BLASTn

BLASTx

BLAST Parameters Exercises2 Low complexity filtering

Open the file example-sequenceshtml copy the sequence gtlow-complexity-filtering-A This sequence contains a long AT tandem repeat

Go to the NCBI BLAST Home PageTRANSLATED BLAST sectionBLASTx Paste your sequence into the box

Carefully UNTICK the ldquoChoose filter [ ] Low complexityrdquo BOX in the second section And then run BLASTx against the nr database

What do you feel about these alignmentsRe-run but leave the low-complexity filter ON this timeDoes this change our view of the protein matches

Now continue with gtlow-complexity-filtering-B and ndashCC is an especially interesting case ndash what can we deduce about the cDNA sequence Annotators beware

Results for Exercise 2A (OFF) BLASTn ndash low complexity filtering OFF

Results for Exercise 2A (ON) BLASTn ndash low complexity filtering ON

Results for Exercise 2B

ON OFF

Results for Exercise 2C

There is a sequence error an extra G at position 117 in the sequence cDNA (117)AGAAAAGAAGAAACATGGCAATGGATCAGAA|||||||||||||||| ||||||||||||||AGAAAAGAAGAAACAT-GCAATGGATCAGAA

Genomic sequence

ON

OFF

BLAST Parameters Exercises3 Limit by Entrez query

Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching sequences in fruit fly enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT

Open the file example-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home Page TRANSLATED BLAST sectionBLASTx and paste the sequence

Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip

BLAST Parameters Exercises4 BLASTn vs tBLASTx and nucleotide mismatch penalties

Open the file example-sequenceshtml

Also open the NCBI BLAST Home PageSPECIAL ndash Align two sequences section

There are several Xenopus tropicalis cyclins in the examples fileCopy the sequence gtcyclin-A1-Xt to the Sequence 1 BLAST windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 BLAST window(i) Run the default comparison should be BLASTn Note the alignmentNow run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these

(ii) Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1 Then try reducing the gap open and gap extension penaltieshellipWhat do we learn from this

(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip

Results for Exercise 4 (i)

BLASTn tBLASTx

Results for Exercise 4 (ii)

Mismatch penalty = -2 (default) Mismatch penalty = -1

BLAST Parameters Exercises5 E-Value maximum for reporting

Open the file example-sequenceshtml

Copy the sequence gtsumo-binding-motif and go to the NCBI BLAST Home PageGo to the PROTEIN BLAST section BLASTp and paste the sequence

Run the search with the default values

Now re-run the search setting the maximum E-value in the box

Expect 100

What difference does this make

BLAST Parameters Exercises6 Word Size

Open the file example-sequenceshtml

Copy the sequence gtmorpholino and go to the NCBI BLAST Home PageGo to the NUCLEOTIDE BLAST section BLASTn and paste the sequence

Check OFF the low complexity filter and then run the search

Now re-run the search setting the following parameters

Low complexity OFFExpect 100Word Size 7Other advanced -q-1 (mismatch penalty -1 instead of default -3)

What difference does this make

END

  • Bioinformatics Workshop 1 Sequences and Similarity Searches
  • The Basic Questions
  • Part 1 Structural Genomics
  • Chromosomes and Genes
  • Gene to Protein
  • Sequence Signals
  • Genomic Signals
  • Derivative Sequences
  • Gene Models
  • Sequences and Genes (Accession Numbers and Names)
  • Gene Symbols Names Etc
  • A Gene-Centric View
  • Sequences and Accession Numbers
  • mRNA Splicing Signals
  • Gene Predictions
  • Supporting Evidence
  • TheoreticalPredicted Sequences
  • Sequences for a model organism
  • So Whatrsquos in the Databases Now
  • Part 2 Comparative Genomics
  • Speciation
  • Residual Similarity
  • Computers Can Detect Homology
  • Orthologs
  • Paralogs
  • lsquoOtherrsquo-logs
  • The Essential Paradigm
  • Function Conserved Longer than Detectable Similarity
  • Redundancy in the Genetic Code
  • Protein Similarity Persists Longer
  • Always Compare Protein Sequences
  • Exercise 1 nucleotide vs amino acid search
  • Answers Exercise 1
  • The Essential Task
  • Functional Orthologs
  • Finding Orthologs
  • Using Synteny is Better
  • Metazome
  • Metazome Exercise
  • Part 3 Finding Sequence Similarities
  • Gaps in Alignments
  • The Downside of Gaps
  • BLAST
  • Flavours of BLAST
  • How does it work
  • BLAST WORDS and INDEXING
  • Analyse the Query Sequence
  • Expand from Word Based Matches
  • BLAST ndashTypical Output
  • When is a match significant
  • E-values
  • E-values From First Principles
  • Calculating an E-value
  • E-values In Practice
  • E-value Exercise
  • E-value Exercise Answer
  • E-values Effect of Database Size
  • Slide 58
  • Why were the values different
  • E-values Effect of Query Length
  • Why not just use identity
  • So whatrsquos the real problem
  • Rules of Thumb
  • Protein BLAST
  • Amino Acid Substitutions
  • Exercises
  • Part 4 Tweaking BLAST
  • Not All Parameters are hellip
  • The Many Parameters of BLAST
  • Slide 70
  • BLAST Parameters Exercises
  • Results for Exercise 1
  • Slide 73
  • Results for Exercise 2A (OFF)
  • Results for Exercise 2A (ON)
  • Results for Exercise 2B
  • Results for Exercise 2C
  • Slide 78
  • Slide 79
  • Results for Exercise 4 (i)
  • Results for Exercise 4 (ii)
  • Slide 82
  • Slide 83
  • END
Page 70: Bioinformatics Workshop 1 Sequences and Similarity Searches

BLAST Parameters Exercises1 BLASTn vs BLASTx

Open the file example-sequenceshtml copy the sequence gtblastn-vs-blastxThis is a Xenopus tropicalis cDNA sequence

Go to the NCBI BLAST Home PageNucleotide-nucleotide BLAST (blastn) section Paste your sequence into the box

Run BLASTn against the nr nucleotide database using all default optionsThen hit [format] to wait for the results in a new page(hint if you paste the sequence definition line lsquogtnamersquo into the box as well your results will be labelled accordingly which can be useful)

Now repeat but go to the TRANSLATED BLAST section and BLAST against the nr protein database using BLASTx

How might the different results help us view the presence of this gene in other vertebrates

Results for Exercise 1 BLASTn

BLASTx

BLAST Parameters Exercises2 Low complexity filtering

Open the file example-sequenceshtml copy the sequence gtlow-complexity-filtering-A This sequence contains a long AT tandem repeat

Go to the NCBI BLAST Home PageTRANSLATED BLAST sectionBLASTx Paste your sequence into the box

Carefully UNTICK the ldquoChoose filter [ ] Low complexityrdquo BOX in the second section And then run BLASTx against the nr database

What do you feel about these alignmentsRe-run but leave the low-complexity filter ON this timeDoes this change our view of the protein matches

Now continue with gtlow-complexity-filtering-B and ndashCC is an especially interesting case ndash what can we deduce about the cDNA sequence Annotators beware

Results for Exercise 2A (OFF) BLASTn ndash low complexity filtering OFF

Results for Exercise 2A (ON) BLASTn ndash low complexity filtering ON

Results for Exercise 2B

ON OFF

Results for Exercise 2C

There is a sequence error an extra G at position 117 in the sequence cDNA (117)AGAAAAGAAGAAACATGGCAATGGATCAGAA|||||||||||||||| ||||||||||||||AGAAAAGAAGAAACAT-GCAATGGATCAGAA

Genomic sequence

ON

OFF

BLAST Parameters Exercises3 Limit by Entrez query

Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching sequences in fruit fly enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT

Open the file example-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home Page TRANSLATED BLAST sectionBLASTx and paste the sequence

Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip

BLAST Parameters Exercises4 BLASTn vs tBLASTx and nucleotide mismatch penalties

Open the file example-sequenceshtml

Also open the NCBI BLAST Home PageSPECIAL ndash Align two sequences section

There are several Xenopus tropicalis cyclins in the examples fileCopy the sequence gtcyclin-A1-Xt to the Sequence 1 BLAST windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 BLAST window(i) Run the default comparison should be BLASTn Note the alignmentNow run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these

(ii) Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1 Then try reducing the gap open and gap extension penaltieshellipWhat do we learn from this

(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip

Results for Exercise 4 (i)

BLASTn tBLASTx

Results for Exercise 4 (ii)

Mismatch penalty = -2 (default) Mismatch penalty = -1

BLAST Parameters Exercises5 E-Value maximum for reporting

Open the file example-sequenceshtml

Copy the sequence gtsumo-binding-motif and go to the NCBI BLAST Home PageGo to the PROTEIN BLAST section BLASTp and paste the sequence

Run the search with the default values

Now re-run the search setting the maximum E-value in the box

Expect 100

What difference does this make

BLAST Parameters Exercises6 Word Size

Open the file example-sequenceshtml

Copy the sequence gtmorpholino and go to the NCBI BLAST Home PageGo to the NUCLEOTIDE BLAST section BLASTn and paste the sequence

Check OFF the low complexity filter and then run the search

Now re-run the search setting the following parameters

Low complexity OFFExpect 100Word Size 7Other advanced -q-1 (mismatch penalty -1 instead of default -3)

What difference does this make

END

  • Bioinformatics Workshop 1 Sequences and Similarity Searches
  • The Basic Questions
  • Part 1 Structural Genomics
  • Chromosomes and Genes
  • Gene to Protein
  • Sequence Signals
  • Genomic Signals
  • Derivative Sequences
  • Gene Models
  • Sequences and Genes (Accession Numbers and Names)
  • Gene Symbols Names Etc
  • A Gene-Centric View
  • Sequences and Accession Numbers
  • mRNA Splicing Signals
  • Gene Predictions
  • Supporting Evidence
  • TheoreticalPredicted Sequences
  • Sequences for a model organism
  • So Whatrsquos in the Databases Now
  • Part 2 Comparative Genomics
  • Speciation
  • Residual Similarity
  • Computers Can Detect Homology
  • Orthologs
  • Paralogs
  • lsquoOtherrsquo-logs
  • The Essential Paradigm
  • Function Conserved Longer than Detectable Similarity
  • Redundancy in the Genetic Code
  • Protein Similarity Persists Longer
  • Always Compare Protein Sequences
  • Exercise 1 nucleotide vs amino acid search
  • Answers Exercise 1
  • The Essential Task
  • Functional Orthologs
  • Finding Orthologs
  • Using Synteny is Better
  • Metazome
  • Metazome Exercise
  • Part 3 Finding Sequence Similarities
  • Gaps in Alignments
  • The Downside of Gaps
  • BLAST
  • Flavours of BLAST
  • How does it work
  • BLAST WORDS and INDEXING
  • Analyse the Query Sequence
  • Expand from Word Based Matches
  • BLAST ndashTypical Output
  • When is a match significant
  • E-values
  • E-values From First Principles
  • Calculating an E-value
  • E-values In Practice
  • E-value Exercise
  • E-value Exercise Answer
  • E-values Effect of Database Size
  • Slide 58
  • Why were the values different
  • E-values Effect of Query Length
  • Why not just use identity
  • So whatrsquos the real problem
  • Rules of Thumb
  • Protein BLAST
  • Amino Acid Substitutions
  • Exercises
  • Part 4 Tweaking BLAST
  • Not All Parameters are hellip
  • The Many Parameters of BLAST
  • Slide 70
  • BLAST Parameters Exercises
  • Results for Exercise 1
  • Slide 73
  • Results for Exercise 2A (OFF)
  • Results for Exercise 2A (ON)
  • Results for Exercise 2B
  • Results for Exercise 2C
  • Slide 78
  • Slide 79
  • Results for Exercise 4 (i)
  • Results for Exercise 4 (ii)
  • Slide 82
  • Slide 83
  • END
Page 71: Bioinformatics Workshop 1 Sequences and Similarity Searches

Results for Exercise 1 BLASTn

BLASTx

BLAST Parameters Exercises2 Low complexity filtering

Open the file example-sequenceshtml copy the sequence gtlow-complexity-filtering-A This sequence contains a long AT tandem repeat

Go to the NCBI BLAST Home PageTRANSLATED BLAST sectionBLASTx Paste your sequence into the box

Carefully UNTICK the ldquoChoose filter [ ] Low complexityrdquo BOX in the second section And then run BLASTx against the nr database

What do you feel about these alignmentsRe-run but leave the low-complexity filter ON this timeDoes this change our view of the protein matches

Now continue with gtlow-complexity-filtering-B and ndashCC is an especially interesting case ndash what can we deduce about the cDNA sequence Annotators beware

Results for Exercise 2A (OFF) BLASTn ndash low complexity filtering OFF

Results for Exercise 2A (ON) BLASTn ndash low complexity filtering ON

Results for Exercise 2B

ON OFF

Results for Exercise 2C

There is a sequence error an extra G at position 117 in the sequence cDNA (117)AGAAAAGAAGAAACATGGCAATGGATCAGAA|||||||||||||||| ||||||||||||||AGAAAAGAAGAAACAT-GCAATGGATCAGAA

Genomic sequence

ON

OFF

BLAST Parameters Exercises3 Limit by Entrez query

Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching sequences in fruit fly enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT

Open the file example-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home Page TRANSLATED BLAST sectionBLASTx and paste the sequence

Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip

BLAST Parameters Exercises4 BLASTn vs tBLASTx and nucleotide mismatch penalties

Open the file example-sequenceshtml

Also open the NCBI BLAST Home PageSPECIAL ndash Align two sequences section

There are several Xenopus tropicalis cyclins in the examples fileCopy the sequence gtcyclin-A1-Xt to the Sequence 1 BLAST windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 BLAST window(i) Run the default comparison should be BLASTn Note the alignmentNow run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these

(ii) Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1 Then try reducing the gap open and gap extension penaltieshellipWhat do we learn from this

(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip

Results for Exercise 4 (i)

BLASTn tBLASTx

Results for Exercise 4 (ii)

Mismatch penalty = -2 (default) Mismatch penalty = -1

BLAST Parameters Exercises5 E-Value maximum for reporting

Open the file example-sequenceshtml

Copy the sequence gtsumo-binding-motif and go to the NCBI BLAST Home PageGo to the PROTEIN BLAST section BLASTp and paste the sequence

Run the search with the default values

Now re-run the search setting the maximum E-value in the box

Expect 100

What difference does this make

BLAST Parameters Exercises6 Word Size

Open the file example-sequenceshtml

Copy the sequence gtmorpholino and go to the NCBI BLAST Home PageGo to the NUCLEOTIDE BLAST section BLASTn and paste the sequence

Check OFF the low complexity filter and then run the search

Now re-run the search setting the following parameters

Low complexity OFFExpect 100Word Size 7Other advanced -q-1 (mismatch penalty -1 instead of default -3)

What difference does this make

END

  • Bioinformatics Workshop 1 Sequences and Similarity Searches
  • The Basic Questions
  • Part 1 Structural Genomics
  • Chromosomes and Genes
  • Gene to Protein
  • Sequence Signals
  • Genomic Signals
  • Derivative Sequences
  • Gene Models
  • Sequences and Genes (Accession Numbers and Names)
  • Gene Symbols Names Etc
  • A Gene-Centric View
  • Sequences and Accession Numbers
  • mRNA Splicing Signals
  • Gene Predictions
  • Supporting Evidence
  • TheoreticalPredicted Sequences
  • Sequences for a model organism
  • So Whatrsquos in the Databases Now
  • Part 2 Comparative Genomics
  • Speciation
  • Residual Similarity
  • Computers Can Detect Homology
  • Orthologs
  • Paralogs
  • lsquoOtherrsquo-logs
  • The Essential Paradigm
  • Function Conserved Longer than Detectable Similarity
  • Redundancy in the Genetic Code
  • Protein Similarity Persists Longer
  • Always Compare Protein Sequences
  • Exercise 1 nucleotide vs amino acid search
  • Answers Exercise 1
  • The Essential Task
  • Functional Orthologs
  • Finding Orthologs
  • Using Synteny is Better
  • Metazome
  • Metazome Exercise
  • Part 3 Finding Sequence Similarities
  • Gaps in Alignments
  • The Downside of Gaps
  • BLAST
  • Flavours of BLAST
  • How does it work
  • BLAST WORDS and INDEXING
  • Analyse the Query Sequence
  • Expand from Word Based Matches
  • BLAST ndashTypical Output
  • When is a match significant
  • E-values
  • E-values From First Principles
  • Calculating an E-value
  • E-values In Practice
  • E-value Exercise
  • E-value Exercise Answer
  • E-values Effect of Database Size
  • Slide 58
  • Why were the values different
  • E-values Effect of Query Length
  • Why not just use identity
  • So whatrsquos the real problem
  • Rules of Thumb
  • Protein BLAST
  • Amino Acid Substitutions
  • Exercises
  • Part 4 Tweaking BLAST
  • Not All Parameters are hellip
  • The Many Parameters of BLAST
  • Slide 70
  • BLAST Parameters Exercises
  • Results for Exercise 1
  • Slide 73
  • Results for Exercise 2A (OFF)
  • Results for Exercise 2A (ON)
  • Results for Exercise 2B
  • Results for Exercise 2C
  • Slide 78
  • Slide 79
  • Results for Exercise 4 (i)
  • Results for Exercise 4 (ii)
  • Slide 82
  • Slide 83
  • END
Page 72: Bioinformatics Workshop 1 Sequences and Similarity Searches

BLAST Parameters Exercises2 Low complexity filtering

Open the file example-sequenceshtml copy the sequence gtlow-complexity-filtering-A This sequence contains a long AT tandem repeat

Go to the NCBI BLAST Home PageTRANSLATED BLAST sectionBLASTx Paste your sequence into the box

Carefully UNTICK the ldquoChoose filter [ ] Low complexityrdquo BOX in the second section And then run BLASTx against the nr database

What do you feel about these alignmentsRe-run but leave the low-complexity filter ON this timeDoes this change our view of the protein matches

Now continue with gtlow-complexity-filtering-B and ndashCC is an especially interesting case ndash what can we deduce about the cDNA sequence Annotators beware

Results for Exercise 2A (OFF) BLASTn ndash low complexity filtering OFF

Results for Exercise 2A (ON) BLASTn ndash low complexity filtering ON

Results for Exercise 2B

ON OFF

Results for Exercise 2C

There is a sequence error an extra G at position 117 in the sequence cDNA (117)AGAAAAGAAGAAACATGGCAATGGATCAGAA|||||||||||||||| ||||||||||||||AGAAAAGAAGAAACAT-GCAATGGATCAGAA

Genomic sequence

ON

OFF

BLAST Parameters Exercises3 Limit by Entrez query

Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching sequences in fruit fly enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT

Open the file example-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home Page TRANSLATED BLAST sectionBLASTx and paste the sequence

Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip

BLAST Parameters Exercises4 BLASTn vs tBLASTx and nucleotide mismatch penalties

Open the file example-sequenceshtml

Also open the NCBI BLAST Home PageSPECIAL ndash Align two sequences section

There are several Xenopus tropicalis cyclins in the examples fileCopy the sequence gtcyclin-A1-Xt to the Sequence 1 BLAST windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 BLAST window(i) Run the default comparison should be BLASTn Note the alignmentNow run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these

(ii) Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1 Then try reducing the gap open and gap extension penaltieshellipWhat do we learn from this

(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip

Results for Exercise 4 (i)

BLASTn tBLASTx

Results for Exercise 4 (ii)

Mismatch penalty = -2 (default) Mismatch penalty = -1

BLAST Parameters Exercises5 E-Value maximum for reporting

Open the file example-sequenceshtml

Copy the sequence gtsumo-binding-motif and go to the NCBI BLAST Home PageGo to the PROTEIN BLAST section BLASTp and paste the sequence

Run the search with the default values

Now re-run the search setting the maximum E-value in the box

Expect 100

What difference does this make

BLAST Parameters Exercises6 Word Size

Open the file example-sequenceshtml

Copy the sequence gtmorpholino and go to the NCBI BLAST Home PageGo to the NUCLEOTIDE BLAST section BLASTn and paste the sequence

Check OFF the low complexity filter and then run the search

Now re-run the search setting the following parameters

Low complexity OFFExpect 100Word Size 7Other advanced -q-1 (mismatch penalty -1 instead of default -3)

What difference does this make

END

  • Bioinformatics Workshop 1 Sequences and Similarity Searches
  • The Basic Questions
  • Part 1 Structural Genomics
  • Chromosomes and Genes
  • Gene to Protein
  • Sequence Signals
  • Genomic Signals
  • Derivative Sequences
  • Gene Models
  • Sequences and Genes (Accession Numbers and Names)
  • Gene Symbols Names Etc
  • A Gene-Centric View
  • Sequences and Accession Numbers
  • mRNA Splicing Signals
  • Gene Predictions
  • Supporting Evidence
  • TheoreticalPredicted Sequences
  • Sequences for a model organism
  • So Whatrsquos in the Databases Now
  • Part 2 Comparative Genomics
  • Speciation
  • Residual Similarity
  • Computers Can Detect Homology
  • Orthologs
  • Paralogs
  • lsquoOtherrsquo-logs
  • The Essential Paradigm
  • Function Conserved Longer than Detectable Similarity
  • Redundancy in the Genetic Code
  • Protein Similarity Persists Longer
  • Always Compare Protein Sequences
  • Exercise 1 nucleotide vs amino acid search
  • Answers Exercise 1
  • The Essential Task
  • Functional Orthologs
  • Finding Orthologs
  • Using Synteny is Better
  • Metazome
  • Metazome Exercise
  • Part 3 Finding Sequence Similarities
  • Gaps in Alignments
  • The Downside of Gaps
  • BLAST
  • Flavours of BLAST
  • How does it work
  • BLAST WORDS and INDEXING
  • Analyse the Query Sequence
  • Expand from Word Based Matches
  • BLAST ndashTypical Output
  • When is a match significant
  • E-values
  • E-values From First Principles
  • Calculating an E-value
  • E-values In Practice
  • E-value Exercise
  • E-value Exercise Answer
  • E-values Effect of Database Size
  • Slide 58
  • Why were the values different
  • E-values Effect of Query Length
  • Why not just use identity
  • So whatrsquos the real problem
  • Rules of Thumb
  • Protein BLAST
  • Amino Acid Substitutions
  • Exercises
  • Part 4 Tweaking BLAST
  • Not All Parameters are hellip
  • The Many Parameters of BLAST
  • Slide 70
  • BLAST Parameters Exercises
  • Results for Exercise 1
  • Slide 73
  • Results for Exercise 2A (OFF)
  • Results for Exercise 2A (ON)
  • Results for Exercise 2B
  • Results for Exercise 2C
  • Slide 78
  • Slide 79
  • Results for Exercise 4 (i)
  • Results for Exercise 4 (ii)
  • Slide 82
  • Slide 83
  • END
Page 73: Bioinformatics Workshop 1 Sequences and Similarity Searches

Results for Exercise 2A (OFF) BLASTn ndash low complexity filtering OFF

Results for Exercise 2A (ON) BLASTn ndash low complexity filtering ON

Results for Exercise 2B

ON OFF

Results for Exercise 2C

There is a sequence error an extra G at position 117 in the sequence cDNA (117)AGAAAAGAAGAAACATGGCAATGGATCAGAA|||||||||||||||| ||||||||||||||AGAAAAGAAGAAACAT-GCAATGGATCAGAA

Genomic sequence

ON

OFF

BLAST Parameters Exercises3 Limit by Entrez query

Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching sequences in fruit fly enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT

Open the file example-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home Page TRANSLATED BLAST sectionBLASTx and paste the sequence

Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip

BLAST Parameters Exercises4 BLASTn vs tBLASTx and nucleotide mismatch penalties

Open the file example-sequenceshtml

Also open the NCBI BLAST Home PageSPECIAL ndash Align two sequences section

There are several Xenopus tropicalis cyclins in the examples fileCopy the sequence gtcyclin-A1-Xt to the Sequence 1 BLAST windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 BLAST window(i) Run the default comparison should be BLASTn Note the alignmentNow run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these

(ii) Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1 Then try reducing the gap open and gap extension penaltieshellipWhat do we learn from this

(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip

Results for Exercise 4 (i)

BLASTn tBLASTx

Results for Exercise 4 (ii)

Mismatch penalty = -2 (default) Mismatch penalty = -1

BLAST Parameters Exercises5 E-Value maximum for reporting

Open the file example-sequenceshtml

Copy the sequence gtsumo-binding-motif and go to the NCBI BLAST Home PageGo to the PROTEIN BLAST section BLASTp and paste the sequence

Run the search with the default values

Now re-run the search setting the maximum E-value in the box

Expect 100

What difference does this make

BLAST Parameters Exercises6 Word Size

Open the file example-sequenceshtml

Copy the sequence gtmorpholino and go to the NCBI BLAST Home PageGo to the NUCLEOTIDE BLAST section BLASTn and paste the sequence

Check OFF the low complexity filter and then run the search

Now re-run the search setting the following parameters

Low complexity OFFExpect 100Word Size 7Other advanced -q-1 (mismatch penalty -1 instead of default -3)

What difference does this make

END

  • Bioinformatics Workshop 1 Sequences and Similarity Searches
  • The Basic Questions
  • Part 1 Structural Genomics
  • Chromosomes and Genes
  • Gene to Protein
  • Sequence Signals
  • Genomic Signals
  • Derivative Sequences
  • Gene Models
  • Sequences and Genes (Accession Numbers and Names)
  • Gene Symbols Names Etc
  • A Gene-Centric View
  • Sequences and Accession Numbers
  • mRNA Splicing Signals
  • Gene Predictions
  • Supporting Evidence
  • TheoreticalPredicted Sequences
  • Sequences for a model organism
  • So Whatrsquos in the Databases Now
  • Part 2 Comparative Genomics
  • Speciation
  • Residual Similarity
  • Computers Can Detect Homology
  • Orthologs
  • Paralogs
  • lsquoOtherrsquo-logs
  • The Essential Paradigm
  • Function Conserved Longer than Detectable Similarity
  • Redundancy in the Genetic Code
  • Protein Similarity Persists Longer
  • Always Compare Protein Sequences
  • Exercise 1 nucleotide vs amino acid search
  • Answers Exercise 1
  • The Essential Task
  • Functional Orthologs
  • Finding Orthologs
  • Using Synteny is Better
  • Metazome
  • Metazome Exercise
  • Part 3 Finding Sequence Similarities
  • Gaps in Alignments
  • The Downside of Gaps
  • BLAST
  • Flavours of BLAST
  • How does it work
  • BLAST WORDS and INDEXING
  • Analyse the Query Sequence
  • Expand from Word Based Matches
  • BLAST ndashTypical Output
  • When is a match significant
  • E-values
  • E-values From First Principles
  • Calculating an E-value
  • E-values In Practice
  • E-value Exercise
  • E-value Exercise Answer
  • E-values Effect of Database Size
  • Slide 58
  • Why were the values different
  • E-values Effect of Query Length
  • Why not just use identity
  • So whatrsquos the real problem
  • Rules of Thumb
  • Protein BLAST
  • Amino Acid Substitutions
  • Exercises
  • Part 4 Tweaking BLAST
  • Not All Parameters are hellip
  • The Many Parameters of BLAST
  • Slide 70
  • BLAST Parameters Exercises
  • Results for Exercise 1
  • Slide 73
  • Results for Exercise 2A (OFF)
  • Results for Exercise 2A (ON)
  • Results for Exercise 2B
  • Results for Exercise 2C
  • Slide 78
  • Slide 79
  • Results for Exercise 4 (i)
  • Results for Exercise 4 (ii)
  • Slide 82
  • Slide 83
  • END
Page 74: Bioinformatics Workshop 1 Sequences and Similarity Searches

Results for Exercise 2A (ON) BLASTn ndash low complexity filtering ON

Results for Exercise 2B

ON OFF

Results for Exercise 2C

There is a sequence error an extra G at position 117 in the sequence cDNA (117)AGAAAAGAAGAAACATGGCAATGGATCAGAA|||||||||||||||| ||||||||||||||AGAAAAGAAGAAACAT-GCAATGGATCAGAA

Genomic sequence

ON

OFF

BLAST Parameters Exercises3 Limit by Entrez query

Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching sequences in fruit fly enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT

Open the file example-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home Page TRANSLATED BLAST sectionBLASTx and paste the sequence

Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip

BLAST Parameters Exercises4 BLASTn vs tBLASTx and nucleotide mismatch penalties

Open the file example-sequenceshtml

Also open the NCBI BLAST Home PageSPECIAL ndash Align two sequences section

There are several Xenopus tropicalis cyclins in the examples fileCopy the sequence gtcyclin-A1-Xt to the Sequence 1 BLAST windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 BLAST window(i) Run the default comparison should be BLASTn Note the alignmentNow run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these

(ii) Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1 Then try reducing the gap open and gap extension penaltieshellipWhat do we learn from this

(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip

Results for Exercise 4 (i)

BLASTn tBLASTx

Results for Exercise 4 (ii)

Mismatch penalty = -2 (default) Mismatch penalty = -1

BLAST Parameters Exercises5 E-Value maximum for reporting

Open the file example-sequenceshtml

Copy the sequence gtsumo-binding-motif and go to the NCBI BLAST Home PageGo to the PROTEIN BLAST section BLASTp and paste the sequence

Run the search with the default values

Now re-run the search setting the maximum E-value in the box

Expect 100

What difference does this make

BLAST Parameters Exercises6 Word Size

Open the file example-sequenceshtml

Copy the sequence gtmorpholino and go to the NCBI BLAST Home PageGo to the NUCLEOTIDE BLAST section BLASTn and paste the sequence

Check OFF the low complexity filter and then run the search

Now re-run the search setting the following parameters

Low complexity OFFExpect 100Word Size 7Other advanced -q-1 (mismatch penalty -1 instead of default -3)

What difference does this make

END

  • Bioinformatics Workshop 1 Sequences and Similarity Searches
  • The Basic Questions
  • Part 1 Structural Genomics
  • Chromosomes and Genes
  • Gene to Protein
  • Sequence Signals
  • Genomic Signals
  • Derivative Sequences
  • Gene Models
  • Sequences and Genes (Accession Numbers and Names)
  • Gene Symbols Names Etc
  • A Gene-Centric View
  • Sequences and Accession Numbers
  • mRNA Splicing Signals
  • Gene Predictions
  • Supporting Evidence
  • TheoreticalPredicted Sequences
  • Sequences for a model organism
  • So Whatrsquos in the Databases Now
  • Part 2 Comparative Genomics
  • Speciation
  • Residual Similarity
  • Computers Can Detect Homology
  • Orthologs
  • Paralogs
  • lsquoOtherrsquo-logs
  • The Essential Paradigm
  • Function Conserved Longer than Detectable Similarity
  • Redundancy in the Genetic Code
  • Protein Similarity Persists Longer
  • Always Compare Protein Sequences
  • Exercise 1 nucleotide vs amino acid search
  • Answers Exercise 1
  • The Essential Task
  • Functional Orthologs
  • Finding Orthologs
  • Using Synteny is Better
  • Metazome
  • Metazome Exercise
  • Part 3 Finding Sequence Similarities
  • Gaps in Alignments
  • The Downside of Gaps
  • BLAST
  • Flavours of BLAST
  • How does it work
  • BLAST WORDS and INDEXING
  • Analyse the Query Sequence
  • Expand from Word Based Matches
  • BLAST ndashTypical Output
  • When is a match significant
  • E-values
  • E-values From First Principles
  • Calculating an E-value
  • E-values In Practice
  • E-value Exercise
  • E-value Exercise Answer
  • E-values Effect of Database Size
  • Slide 58
  • Why were the values different
  • E-values Effect of Query Length
  • Why not just use identity
  • So whatrsquos the real problem
  • Rules of Thumb
  • Protein BLAST
  • Amino Acid Substitutions
  • Exercises
  • Part 4 Tweaking BLAST
  • Not All Parameters are hellip
  • The Many Parameters of BLAST
  • Slide 70
  • BLAST Parameters Exercises
  • Results for Exercise 1
  • Slide 73
  • Results for Exercise 2A (OFF)
  • Results for Exercise 2A (ON)
  • Results for Exercise 2B
  • Results for Exercise 2C
  • Slide 78
  • Slide 79
  • Results for Exercise 4 (i)
  • Results for Exercise 4 (ii)
  • Slide 82
  • Slide 83
  • END
Page 75: Bioinformatics Workshop 1 Sequences and Similarity Searches

Results for Exercise 2B

ON OFF

Results for Exercise 2C

There is a sequence error an extra G at position 117 in the sequence cDNA (117)AGAAAAGAAGAAACATGGCAATGGATCAGAA|||||||||||||||| ||||||||||||||AGAAAAGAAGAAACAT-GCAATGGATCAGAA

Genomic sequence

ON

OFF

BLAST Parameters Exercises3 Limit by Entrez query

Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching sequences in fruit fly enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT

Open the file example-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home Page TRANSLATED BLAST sectionBLASTx and paste the sequence

Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip

BLAST Parameters Exercises4 BLASTn vs tBLASTx and nucleotide mismatch penalties

Open the file example-sequenceshtml

Also open the NCBI BLAST Home PageSPECIAL ndash Align two sequences section

There are several Xenopus tropicalis cyclins in the examples fileCopy the sequence gtcyclin-A1-Xt to the Sequence 1 BLAST windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 BLAST window(i) Run the default comparison should be BLASTn Note the alignmentNow run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these

(ii) Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1 Then try reducing the gap open and gap extension penaltieshellipWhat do we learn from this

(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip

Results for Exercise 4 (i)

BLASTn tBLASTx

Results for Exercise 4 (ii)

Mismatch penalty = -2 (default) Mismatch penalty = -1

BLAST Parameters Exercises5 E-Value maximum for reporting

Open the file example-sequenceshtml

Copy the sequence gtsumo-binding-motif and go to the NCBI BLAST Home PageGo to the PROTEIN BLAST section BLASTp and paste the sequence

Run the search with the default values

Now re-run the search setting the maximum E-value in the box

Expect 100

What difference does this make

BLAST Parameters Exercises6 Word Size

Open the file example-sequenceshtml

Copy the sequence gtmorpholino and go to the NCBI BLAST Home PageGo to the NUCLEOTIDE BLAST section BLASTn and paste the sequence

Check OFF the low complexity filter and then run the search

Now re-run the search setting the following parameters

Low complexity OFFExpect 100Word Size 7Other advanced -q-1 (mismatch penalty -1 instead of default -3)

What difference does this make

END

  • Bioinformatics Workshop 1 Sequences and Similarity Searches
  • The Basic Questions
  • Part 1 Structural Genomics
  • Chromosomes and Genes
  • Gene to Protein
  • Sequence Signals
  • Genomic Signals
  • Derivative Sequences
  • Gene Models
  • Sequences and Genes (Accession Numbers and Names)
  • Gene Symbols Names Etc
  • A Gene-Centric View
  • Sequences and Accession Numbers
  • mRNA Splicing Signals
  • Gene Predictions
  • Supporting Evidence
  • TheoreticalPredicted Sequences
  • Sequences for a model organism
  • So Whatrsquos in the Databases Now
  • Part 2 Comparative Genomics
  • Speciation
  • Residual Similarity
  • Computers Can Detect Homology
  • Orthologs
  • Paralogs
  • lsquoOtherrsquo-logs
  • The Essential Paradigm
  • Function Conserved Longer than Detectable Similarity
  • Redundancy in the Genetic Code
  • Protein Similarity Persists Longer
  • Always Compare Protein Sequences
  • Exercise 1 nucleotide vs amino acid search
  • Answers Exercise 1
  • The Essential Task
  • Functional Orthologs
  • Finding Orthologs
  • Using Synteny is Better
  • Metazome
  • Metazome Exercise
  • Part 3 Finding Sequence Similarities
  • Gaps in Alignments
  • The Downside of Gaps
  • BLAST
  • Flavours of BLAST
  • How does it work
  • BLAST WORDS and INDEXING
  • Analyse the Query Sequence
  • Expand from Word Based Matches
  • BLAST ndashTypical Output
  • When is a match significant
  • E-values
  • E-values From First Principles
  • Calculating an E-value
  • E-values In Practice
  • E-value Exercise
  • E-value Exercise Answer
  • E-values Effect of Database Size
  • Slide 58
  • Why were the values different
  • E-values Effect of Query Length
  • Why not just use identity
  • So whatrsquos the real problem
  • Rules of Thumb
  • Protein BLAST
  • Amino Acid Substitutions
  • Exercises
  • Part 4 Tweaking BLAST
  • Not All Parameters are hellip
  • The Many Parameters of BLAST
  • Slide 70
  • BLAST Parameters Exercises
  • Results for Exercise 1
  • Slide 73
  • Results for Exercise 2A (OFF)
  • Results for Exercise 2A (ON)
  • Results for Exercise 2B
  • Results for Exercise 2C
  • Slide 78
  • Slide 79
  • Results for Exercise 4 (i)
  • Results for Exercise 4 (ii)
  • Slide 82
  • Slide 83
  • END
Page 76: Bioinformatics Workshop 1 Sequences and Similarity Searches

Results for Exercise 2C

There is a sequence error an extra G at position 117 in the sequence cDNA (117)AGAAAAGAAGAAACATGGCAATGGATCAGAA|||||||||||||||| ||||||||||||||AGAAAAGAAGAAACAT-GCAATGGATCAGAA

Genomic sequence

ON

OFF

BLAST Parameters Exercises3 Limit by Entrez query

Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching sequences in fruit fly enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT

Open the file example-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home Page TRANSLATED BLAST sectionBLASTx and paste the sequence

Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip

BLAST Parameters Exercises4 BLASTn vs tBLASTx and nucleotide mismatch penalties

Open the file example-sequenceshtml

Also open the NCBI BLAST Home PageSPECIAL ndash Align two sequences section

There are several Xenopus tropicalis cyclins in the examples fileCopy the sequence gtcyclin-A1-Xt to the Sequence 1 BLAST windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 BLAST window(i) Run the default comparison should be BLASTn Note the alignmentNow run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these

(ii) Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1 Then try reducing the gap open and gap extension penaltieshellipWhat do we learn from this

(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip

Results for Exercise 4 (i)

BLASTn tBLASTx

Results for Exercise 4 (ii)

Mismatch penalty = -2 (default) Mismatch penalty = -1

BLAST Parameters Exercises5 E-Value maximum for reporting

Open the file example-sequenceshtml

Copy the sequence gtsumo-binding-motif and go to the NCBI BLAST Home PageGo to the PROTEIN BLAST section BLASTp and paste the sequence

Run the search with the default values

Now re-run the search setting the maximum E-value in the box

Expect 100

What difference does this make

BLAST Parameters Exercises6 Word Size

Open the file example-sequenceshtml

Copy the sequence gtmorpholino and go to the NCBI BLAST Home PageGo to the NUCLEOTIDE BLAST section BLASTn and paste the sequence

Check OFF the low complexity filter and then run the search

Now re-run the search setting the following parameters

Low complexity OFFExpect 100Word Size 7Other advanced -q-1 (mismatch penalty -1 instead of default -3)

What difference does this make

END

  • Bioinformatics Workshop 1 Sequences and Similarity Searches
  • The Basic Questions
  • Part 1 Structural Genomics
  • Chromosomes and Genes
  • Gene to Protein
  • Sequence Signals
  • Genomic Signals
  • Derivative Sequences
  • Gene Models
  • Sequences and Genes (Accession Numbers and Names)
  • Gene Symbols Names Etc
  • A Gene-Centric View
  • Sequences and Accession Numbers
  • mRNA Splicing Signals
  • Gene Predictions
  • Supporting Evidence
  • TheoreticalPredicted Sequences
  • Sequences for a model organism
  • So Whatrsquos in the Databases Now
  • Part 2 Comparative Genomics
  • Speciation
  • Residual Similarity
  • Computers Can Detect Homology
  • Orthologs
  • Paralogs
  • lsquoOtherrsquo-logs
  • The Essential Paradigm
  • Function Conserved Longer than Detectable Similarity
  • Redundancy in the Genetic Code
  • Protein Similarity Persists Longer
  • Always Compare Protein Sequences
  • Exercise 1 nucleotide vs amino acid search
  • Answers Exercise 1
  • The Essential Task
  • Functional Orthologs
  • Finding Orthologs
  • Using Synteny is Better
  • Metazome
  • Metazome Exercise
  • Part 3 Finding Sequence Similarities
  • Gaps in Alignments
  • The Downside of Gaps
  • BLAST
  • Flavours of BLAST
  • How does it work
  • BLAST WORDS and INDEXING
  • Analyse the Query Sequence
  • Expand from Word Based Matches
  • BLAST ndashTypical Output
  • When is a match significant
  • E-values
  • E-values From First Principles
  • Calculating an E-value
  • E-values In Practice
  • E-value Exercise
  • E-value Exercise Answer
  • E-values Effect of Database Size
  • Slide 58
  • Why were the values different
  • E-values Effect of Query Length
  • Why not just use identity
  • So whatrsquos the real problem
  • Rules of Thumb
  • Protein BLAST
  • Amino Acid Substitutions
  • Exercises
  • Part 4 Tweaking BLAST
  • Not All Parameters are hellip
  • The Many Parameters of BLAST
  • Slide 70
  • BLAST Parameters Exercises
  • Results for Exercise 1
  • Slide 73
  • Results for Exercise 2A (OFF)
  • Results for Exercise 2A (ON)
  • Results for Exercise 2B
  • Results for Exercise 2C
  • Slide 78
  • Slide 79
  • Results for Exercise 4 (i)
  • Results for Exercise 4 (ii)
  • Slide 82
  • Slide 83
  • END
Page 77: Bioinformatics Workshop 1 Sequences and Similarity Searches

BLAST Parameters Exercises3 Limit by Entrez query

Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching sequences in fruit fly enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT

Open the file example-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home Page TRANSLATED BLAST sectionBLASTx and paste the sequence

Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip

BLAST Parameters Exercises4 BLASTn vs tBLASTx and nucleotide mismatch penalties

Open the file example-sequenceshtml

Also open the NCBI BLAST Home PageSPECIAL ndash Align two sequences section

There are several Xenopus tropicalis cyclins in the examples fileCopy the sequence gtcyclin-A1-Xt to the Sequence 1 BLAST windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 BLAST window(i) Run the default comparison should be BLASTn Note the alignmentNow run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these

(ii) Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1 Then try reducing the gap open and gap extension penaltieshellipWhat do we learn from this

(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip

Results for Exercise 4 (i)

BLASTn tBLASTx

Results for Exercise 4 (ii)

Mismatch penalty = -2 (default) Mismatch penalty = -1

BLAST Parameters Exercises5 E-Value maximum for reporting

Open the file example-sequenceshtml

Copy the sequence gtsumo-binding-motif and go to the NCBI BLAST Home PageGo to the PROTEIN BLAST section BLASTp and paste the sequence

Run the search with the default values

Now re-run the search setting the maximum E-value in the box

Expect 100

What difference does this make

BLAST Parameters Exercises6 Word Size

Open the file example-sequenceshtml

Copy the sequence gtmorpholino and go to the NCBI BLAST Home PageGo to the NUCLEOTIDE BLAST section BLASTn and paste the sequence

Check OFF the low complexity filter and then run the search

Now re-run the search setting the following parameters

Low complexity OFFExpect 100Word Size 7Other advanced -q-1 (mismatch penalty -1 instead of default -3)

What difference does this make

END

  • Bioinformatics Workshop 1 Sequences and Similarity Searches
  • The Basic Questions
  • Part 1 Structural Genomics
  • Chromosomes and Genes
  • Gene to Protein
  • Sequence Signals
  • Genomic Signals
  • Derivative Sequences
  • Gene Models
  • Sequences and Genes (Accession Numbers and Names)
  • Gene Symbols Names Etc
  • A Gene-Centric View
  • Sequences and Accession Numbers
  • mRNA Splicing Signals
  • Gene Predictions
  • Supporting Evidence
  • TheoreticalPredicted Sequences
  • Sequences for a model organism
  • So Whatrsquos in the Databases Now
  • Part 2 Comparative Genomics
  • Speciation
  • Residual Similarity
  • Computers Can Detect Homology
  • Orthologs
  • Paralogs
  • lsquoOtherrsquo-logs
  • The Essential Paradigm
  • Function Conserved Longer than Detectable Similarity
  • Redundancy in the Genetic Code
  • Protein Similarity Persists Longer
  • Always Compare Protein Sequences
  • Exercise 1 nucleotide vs amino acid search
  • Answers Exercise 1
  • The Essential Task
  • Functional Orthologs
  • Finding Orthologs
  • Using Synteny is Better
  • Metazome
  • Metazome Exercise
  • Part 3 Finding Sequence Similarities
  • Gaps in Alignments
  • The Downside of Gaps
  • BLAST
  • Flavours of BLAST
  • How does it work
  • BLAST WORDS and INDEXING
  • Analyse the Query Sequence
  • Expand from Word Based Matches
  • BLAST ndashTypical Output
  • When is a match significant
  • E-values
  • E-values From First Principles
  • Calculating an E-value
  • E-values In Practice
  • E-value Exercise
  • E-value Exercise Answer
  • E-values Effect of Database Size
  • Slide 58
  • Why were the values different
  • E-values Effect of Query Length
  • Why not just use identity
  • So whatrsquos the real problem
  • Rules of Thumb
  • Protein BLAST
  • Amino Acid Substitutions
  • Exercises
  • Part 4 Tweaking BLAST
  • Not All Parameters are hellip
  • The Many Parameters of BLAST
  • Slide 70
  • BLAST Parameters Exercises
  • Results for Exercise 1
  • Slide 73
  • Results for Exercise 2A (OFF)
  • Results for Exercise 2A (ON)
  • Results for Exercise 2B
  • Results for Exercise 2C
  • Slide 78
  • Slide 79
  • Results for Exercise 4 (i)
  • Results for Exercise 4 (ii)
  • Slide 82
  • Slide 83
  • END
Page 78: Bioinformatics Workshop 1 Sequences and Similarity Searches

BLAST Parameters Exercises4 BLASTn vs tBLASTx and nucleotide mismatch penalties

Open the file example-sequenceshtml

Also open the NCBI BLAST Home PageSPECIAL ndash Align two sequences section

There are several Xenopus tropicalis cyclins in the examples fileCopy the sequence gtcyclin-A1-Xt to the Sequence 1 BLAST windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 BLAST window(i) Run the default comparison should be BLASTn Note the alignmentNow run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these

(ii) Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1 Then try reducing the gap open and gap extension penaltieshellipWhat do we learn from this

(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip

Results for Exercise 4 (i)

BLASTn tBLASTx

Results for Exercise 4 (ii)

Mismatch penalty = -2 (default) Mismatch penalty = -1

BLAST Parameters Exercises5 E-Value maximum for reporting

Open the file example-sequenceshtml

Copy the sequence gtsumo-binding-motif and go to the NCBI BLAST Home PageGo to the PROTEIN BLAST section BLASTp and paste the sequence

Run the search with the default values

Now re-run the search setting the maximum E-value in the box

Expect 100

What difference does this make

BLAST Parameters Exercises6 Word Size

Open the file example-sequenceshtml

Copy the sequence gtmorpholino and go to the NCBI BLAST Home PageGo to the NUCLEOTIDE BLAST section BLASTn and paste the sequence

Check OFF the low complexity filter and then run the search

Now re-run the search setting the following parameters

Low complexity OFFExpect 100Word Size 7Other advanced -q-1 (mismatch penalty -1 instead of default -3)

What difference does this make

END

  • Bioinformatics Workshop 1 Sequences and Similarity Searches
  • The Basic Questions
  • Part 1 Structural Genomics
  • Chromosomes and Genes
  • Gene to Protein
  • Sequence Signals
  • Genomic Signals
  • Derivative Sequences
  • Gene Models
  • Sequences and Genes (Accession Numbers and Names)
  • Gene Symbols Names Etc
  • A Gene-Centric View
  • Sequences and Accession Numbers
  • mRNA Splicing Signals
  • Gene Predictions
  • Supporting Evidence
  • TheoreticalPredicted Sequences
  • Sequences for a model organism
  • So Whatrsquos in the Databases Now
  • Part 2 Comparative Genomics
  • Speciation
  • Residual Similarity
  • Computers Can Detect Homology
  • Orthologs
  • Paralogs
  • lsquoOtherrsquo-logs
  • The Essential Paradigm
  • Function Conserved Longer than Detectable Similarity
  • Redundancy in the Genetic Code
  • Protein Similarity Persists Longer
  • Always Compare Protein Sequences
  • Exercise 1 nucleotide vs amino acid search
  • Answers Exercise 1
  • The Essential Task
  • Functional Orthologs
  • Finding Orthologs
  • Using Synteny is Better
  • Metazome
  • Metazome Exercise
  • Part 3 Finding Sequence Similarities
  • Gaps in Alignments
  • The Downside of Gaps
  • BLAST
  • Flavours of BLAST
  • How does it work
  • BLAST WORDS and INDEXING
  • Analyse the Query Sequence
  • Expand from Word Based Matches
  • BLAST ndashTypical Output
  • When is a match significant
  • E-values
  • E-values From First Principles
  • Calculating an E-value
  • E-values In Practice
  • E-value Exercise
  • E-value Exercise Answer
  • E-values Effect of Database Size
  • Slide 58
  • Why were the values different
  • E-values Effect of Query Length
  • Why not just use identity
  • So whatrsquos the real problem
  • Rules of Thumb
  • Protein BLAST
  • Amino Acid Substitutions
  • Exercises
  • Part 4 Tweaking BLAST
  • Not All Parameters are hellip
  • The Many Parameters of BLAST
  • Slide 70
  • BLAST Parameters Exercises
  • Results for Exercise 1
  • Slide 73
  • Results for Exercise 2A (OFF)
  • Results for Exercise 2A (ON)
  • Results for Exercise 2B
  • Results for Exercise 2C
  • Slide 78
  • Slide 79
  • Results for Exercise 4 (i)
  • Results for Exercise 4 (ii)
  • Slide 82
  • Slide 83
  • END
Page 79: Bioinformatics Workshop 1 Sequences and Similarity Searches

Results for Exercise 4 (i)

BLASTn tBLASTx

Results for Exercise 4 (ii)

Mismatch penalty = -2 (default) Mismatch penalty = -1

BLAST Parameters Exercises5 E-Value maximum for reporting

Open the file example-sequenceshtml

Copy the sequence gtsumo-binding-motif and go to the NCBI BLAST Home PageGo to the PROTEIN BLAST section BLASTp and paste the sequence

Run the search with the default values

Now re-run the search setting the maximum E-value in the box

Expect 100

What difference does this make

BLAST Parameters Exercises6 Word Size

Open the file example-sequenceshtml

Copy the sequence gtmorpholino and go to the NCBI BLAST Home PageGo to the NUCLEOTIDE BLAST section BLASTn and paste the sequence

Check OFF the low complexity filter and then run the search

Now re-run the search setting the following parameters

Low complexity OFFExpect 100Word Size 7Other advanced -q-1 (mismatch penalty -1 instead of default -3)

What difference does this make

END

  • Bioinformatics Workshop 1 Sequences and Similarity Searches
  • The Basic Questions
  • Part 1 Structural Genomics
  • Chromosomes and Genes
  • Gene to Protein
  • Sequence Signals
  • Genomic Signals
  • Derivative Sequences
  • Gene Models
  • Sequences and Genes (Accession Numbers and Names)
  • Gene Symbols Names Etc
  • A Gene-Centric View
  • Sequences and Accession Numbers
  • mRNA Splicing Signals
  • Gene Predictions
  • Supporting Evidence
  • TheoreticalPredicted Sequences
  • Sequences for a model organism
  • So Whatrsquos in the Databases Now
  • Part 2 Comparative Genomics
  • Speciation
  • Residual Similarity
  • Computers Can Detect Homology
  • Orthologs
  • Paralogs
  • lsquoOtherrsquo-logs
  • The Essential Paradigm
  • Function Conserved Longer than Detectable Similarity
  • Redundancy in the Genetic Code
  • Protein Similarity Persists Longer
  • Always Compare Protein Sequences
  • Exercise 1 nucleotide vs amino acid search
  • Answers Exercise 1
  • The Essential Task
  • Functional Orthologs
  • Finding Orthologs
  • Using Synteny is Better
  • Metazome
  • Metazome Exercise
  • Part 3 Finding Sequence Similarities
  • Gaps in Alignments
  • The Downside of Gaps
  • BLAST
  • Flavours of BLAST
  • How does it work
  • BLAST WORDS and INDEXING
  • Analyse the Query Sequence
  • Expand from Word Based Matches
  • BLAST ndashTypical Output
  • When is a match significant
  • E-values
  • E-values From First Principles
  • Calculating an E-value
  • E-values In Practice
  • E-value Exercise
  • E-value Exercise Answer
  • E-values Effect of Database Size
  • Slide 58
  • Why were the values different
  • E-values Effect of Query Length
  • Why not just use identity
  • So whatrsquos the real problem
  • Rules of Thumb
  • Protein BLAST
  • Amino Acid Substitutions
  • Exercises
  • Part 4 Tweaking BLAST
  • Not All Parameters are hellip
  • The Many Parameters of BLAST
  • Slide 70
  • BLAST Parameters Exercises
  • Results for Exercise 1
  • Slide 73
  • Results for Exercise 2A (OFF)
  • Results for Exercise 2A (ON)
  • Results for Exercise 2B
  • Results for Exercise 2C
  • Slide 78
  • Slide 79
  • Results for Exercise 4 (i)
  • Results for Exercise 4 (ii)
  • Slide 82
  • Slide 83
  • END
Page 80: Bioinformatics Workshop 1 Sequences and Similarity Searches

Results for Exercise 4 (ii)

Mismatch penalty = -2 (default) Mismatch penalty = -1

BLAST Parameters Exercises5 E-Value maximum for reporting

Open the file example-sequenceshtml

Copy the sequence gtsumo-binding-motif and go to the NCBI BLAST Home PageGo to the PROTEIN BLAST section BLASTp and paste the sequence

Run the search with the default values

Now re-run the search setting the maximum E-value in the box

Expect 100

What difference does this make

BLAST Parameters Exercises6 Word Size

Open the file example-sequenceshtml

Copy the sequence gtmorpholino and go to the NCBI BLAST Home PageGo to the NUCLEOTIDE BLAST section BLASTn and paste the sequence

Check OFF the low complexity filter and then run the search

Now re-run the search setting the following parameters

Low complexity OFFExpect 100Word Size 7Other advanced -q-1 (mismatch penalty -1 instead of default -3)

What difference does this make

END

  • Bioinformatics Workshop 1 Sequences and Similarity Searches
  • The Basic Questions
  • Part 1 Structural Genomics
  • Chromosomes and Genes
  • Gene to Protein
  • Sequence Signals
  • Genomic Signals
  • Derivative Sequences
  • Gene Models
  • Sequences and Genes (Accession Numbers and Names)
  • Gene Symbols Names Etc
  • A Gene-Centric View
  • Sequences and Accession Numbers
  • mRNA Splicing Signals
  • Gene Predictions
  • Supporting Evidence
  • TheoreticalPredicted Sequences
  • Sequences for a model organism
  • So Whatrsquos in the Databases Now
  • Part 2 Comparative Genomics
  • Speciation
  • Residual Similarity
  • Computers Can Detect Homology
  • Orthologs
  • Paralogs
  • lsquoOtherrsquo-logs
  • The Essential Paradigm
  • Function Conserved Longer than Detectable Similarity
  • Redundancy in the Genetic Code
  • Protein Similarity Persists Longer
  • Always Compare Protein Sequences
  • Exercise 1 nucleotide vs amino acid search
  • Answers Exercise 1
  • The Essential Task
  • Functional Orthologs
  • Finding Orthologs
  • Using Synteny is Better
  • Metazome
  • Metazome Exercise
  • Part 3 Finding Sequence Similarities
  • Gaps in Alignments
  • The Downside of Gaps
  • BLAST
  • Flavours of BLAST
  • How does it work
  • BLAST WORDS and INDEXING
  • Analyse the Query Sequence
  • Expand from Word Based Matches
  • BLAST ndashTypical Output
  • When is a match significant
  • E-values
  • E-values From First Principles
  • Calculating an E-value
  • E-values In Practice
  • E-value Exercise
  • E-value Exercise Answer
  • E-values Effect of Database Size
  • Slide 58
  • Why were the values different
  • E-values Effect of Query Length
  • Why not just use identity
  • So whatrsquos the real problem
  • Rules of Thumb
  • Protein BLAST
  • Amino Acid Substitutions
  • Exercises
  • Part 4 Tweaking BLAST
  • Not All Parameters are hellip
  • The Many Parameters of BLAST
  • Slide 70
  • BLAST Parameters Exercises
  • Results for Exercise 1
  • Slide 73
  • Results for Exercise 2A (OFF)
  • Results for Exercise 2A (ON)
  • Results for Exercise 2B
  • Results for Exercise 2C
  • Slide 78
  • Slide 79
  • Results for Exercise 4 (i)
  • Results for Exercise 4 (ii)
  • Slide 82
  • Slide 83
  • END
Page 81: Bioinformatics Workshop 1 Sequences and Similarity Searches

BLAST Parameters Exercises5 E-Value maximum for reporting

Open the file example-sequenceshtml

Copy the sequence gtsumo-binding-motif and go to the NCBI BLAST Home PageGo to the PROTEIN BLAST section BLASTp and paste the sequence

Run the search with the default values

Now re-run the search setting the maximum E-value in the box

Expect 100

What difference does this make

BLAST Parameters Exercises6 Word Size

Open the file example-sequenceshtml

Copy the sequence gtmorpholino and go to the NCBI BLAST Home PageGo to the NUCLEOTIDE BLAST section BLASTn and paste the sequence

Check OFF the low complexity filter and then run the search

Now re-run the search setting the following parameters

Low complexity OFFExpect 100Word Size 7Other advanced -q-1 (mismatch penalty -1 instead of default -3)

What difference does this make

END

  • Bioinformatics Workshop 1 Sequences and Similarity Searches
  • The Basic Questions
  • Part 1 Structural Genomics
  • Chromosomes and Genes
  • Gene to Protein
  • Sequence Signals
  • Genomic Signals
  • Derivative Sequences
  • Gene Models
  • Sequences and Genes (Accession Numbers and Names)
  • Gene Symbols Names Etc
  • A Gene-Centric View
  • Sequences and Accession Numbers
  • mRNA Splicing Signals
  • Gene Predictions
  • Supporting Evidence
  • TheoreticalPredicted Sequences
  • Sequences for a model organism
  • So Whatrsquos in the Databases Now
  • Part 2 Comparative Genomics
  • Speciation
  • Residual Similarity
  • Computers Can Detect Homology
  • Orthologs
  • Paralogs
  • lsquoOtherrsquo-logs
  • The Essential Paradigm
  • Function Conserved Longer than Detectable Similarity
  • Redundancy in the Genetic Code
  • Protein Similarity Persists Longer
  • Always Compare Protein Sequences
  • Exercise 1 nucleotide vs amino acid search
  • Answers Exercise 1
  • The Essential Task
  • Functional Orthologs
  • Finding Orthologs
  • Using Synteny is Better
  • Metazome
  • Metazome Exercise
  • Part 3 Finding Sequence Similarities
  • Gaps in Alignments
  • The Downside of Gaps
  • BLAST
  • Flavours of BLAST
  • How does it work
  • BLAST WORDS and INDEXING
  • Analyse the Query Sequence
  • Expand from Word Based Matches
  • BLAST ndashTypical Output
  • When is a match significant
  • E-values
  • E-values From First Principles
  • Calculating an E-value
  • E-values In Practice
  • E-value Exercise
  • E-value Exercise Answer
  • E-values Effect of Database Size
  • Slide 58
  • Why were the values different
  • E-values Effect of Query Length
  • Why not just use identity
  • So whatrsquos the real problem
  • Rules of Thumb
  • Protein BLAST
  • Amino Acid Substitutions
  • Exercises
  • Part 4 Tweaking BLAST
  • Not All Parameters are hellip
  • The Many Parameters of BLAST
  • Slide 70
  • BLAST Parameters Exercises
  • Results for Exercise 1
  • Slide 73
  • Results for Exercise 2A (OFF)
  • Results for Exercise 2A (ON)
  • Results for Exercise 2B
  • Results for Exercise 2C
  • Slide 78
  • Slide 79
  • Results for Exercise 4 (i)
  • Results for Exercise 4 (ii)
  • Slide 82
  • Slide 83
  • END
Page 82: Bioinformatics Workshop 1 Sequences and Similarity Searches

BLAST Parameters Exercises6 Word Size

Open the file example-sequenceshtml

Copy the sequence gtmorpholino and go to the NCBI BLAST Home PageGo to the NUCLEOTIDE BLAST section BLASTn and paste the sequence

Check OFF the low complexity filter and then run the search

Now re-run the search setting the following parameters

Low complexity OFFExpect 100Word Size 7Other advanced -q-1 (mismatch penalty -1 instead of default -3)

What difference does this make

END

  • Bioinformatics Workshop 1 Sequences and Similarity Searches
  • The Basic Questions
  • Part 1 Structural Genomics
  • Chromosomes and Genes
  • Gene to Protein
  • Sequence Signals
  • Genomic Signals
  • Derivative Sequences
  • Gene Models
  • Sequences and Genes (Accession Numbers and Names)
  • Gene Symbols Names Etc
  • A Gene-Centric View
  • Sequences and Accession Numbers
  • mRNA Splicing Signals
  • Gene Predictions
  • Supporting Evidence
  • TheoreticalPredicted Sequences
  • Sequences for a model organism
  • So Whatrsquos in the Databases Now
  • Part 2 Comparative Genomics
  • Speciation
  • Residual Similarity
  • Computers Can Detect Homology
  • Orthologs
  • Paralogs
  • lsquoOtherrsquo-logs
  • The Essential Paradigm
  • Function Conserved Longer than Detectable Similarity
  • Redundancy in the Genetic Code
  • Protein Similarity Persists Longer
  • Always Compare Protein Sequences
  • Exercise 1 nucleotide vs amino acid search
  • Answers Exercise 1
  • The Essential Task
  • Functional Orthologs
  • Finding Orthologs
  • Using Synteny is Better
  • Metazome
  • Metazome Exercise
  • Part 3 Finding Sequence Similarities
  • Gaps in Alignments
  • The Downside of Gaps
  • BLAST
  • Flavours of BLAST
  • How does it work
  • BLAST WORDS and INDEXING
  • Analyse the Query Sequence
  • Expand from Word Based Matches
  • BLAST ndashTypical Output
  • When is a match significant
  • E-values
  • E-values From First Principles
  • Calculating an E-value
  • E-values In Practice
  • E-value Exercise
  • E-value Exercise Answer
  • E-values Effect of Database Size
  • Slide 58
  • Why were the values different
  • E-values Effect of Query Length
  • Why not just use identity
  • So whatrsquos the real problem
  • Rules of Thumb
  • Protein BLAST
  • Amino Acid Substitutions
  • Exercises
  • Part 4 Tweaking BLAST
  • Not All Parameters are hellip
  • The Many Parameters of BLAST
  • Slide 70
  • BLAST Parameters Exercises
  • Results for Exercise 1
  • Slide 73
  • Results for Exercise 2A (OFF)
  • Results for Exercise 2A (ON)
  • Results for Exercise 2B
  • Results for Exercise 2C
  • Slide 78
  • Slide 79
  • Results for Exercise 4 (i)
  • Results for Exercise 4 (ii)
  • Slide 82
  • Slide 83
  • END
Page 83: Bioinformatics Workshop 1 Sequences and Similarity Searches

END

  • Bioinformatics Workshop 1 Sequences and Similarity Searches
  • The Basic Questions
  • Part 1 Structural Genomics
  • Chromosomes and Genes
  • Gene to Protein
  • Sequence Signals
  • Genomic Signals
  • Derivative Sequences
  • Gene Models
  • Sequences and Genes (Accession Numbers and Names)
  • Gene Symbols Names Etc
  • A Gene-Centric View
  • Sequences and Accession Numbers
  • mRNA Splicing Signals
  • Gene Predictions
  • Supporting Evidence
  • TheoreticalPredicted Sequences
  • Sequences for a model organism
  • So Whatrsquos in the Databases Now
  • Part 2 Comparative Genomics
  • Speciation
  • Residual Similarity
  • Computers Can Detect Homology
  • Orthologs
  • Paralogs
  • lsquoOtherrsquo-logs
  • The Essential Paradigm
  • Function Conserved Longer than Detectable Similarity
  • Redundancy in the Genetic Code
  • Protein Similarity Persists Longer
  • Always Compare Protein Sequences
  • Exercise 1 nucleotide vs amino acid search
  • Answers Exercise 1
  • The Essential Task
  • Functional Orthologs
  • Finding Orthologs
  • Using Synteny is Better
  • Metazome
  • Metazome Exercise
  • Part 3 Finding Sequence Similarities
  • Gaps in Alignments
  • The Downside of Gaps
  • BLAST
  • Flavours of BLAST
  • How does it work
  • BLAST WORDS and INDEXING
  • Analyse the Query Sequence
  • Expand from Word Based Matches
  • BLAST ndashTypical Output
  • When is a match significant
  • E-values
  • E-values From First Principles
  • Calculating an E-value
  • E-values In Practice
  • E-value Exercise
  • E-value Exercise Answer
  • E-values Effect of Database Size
  • Slide 58
  • Why were the values different
  • E-values Effect of Query Length
  • Why not just use identity
  • So whatrsquos the real problem
  • Rules of Thumb
  • Protein BLAST
  • Amino Acid Substitutions
  • Exercises
  • Part 4 Tweaking BLAST
  • Not All Parameters are hellip
  • The Many Parameters of BLAST
  • Slide 70
  • BLAST Parameters Exercises
  • Results for Exercise 1
  • Slide 73
  • Results for Exercise 2A (OFF)
  • Results for Exercise 2A (ON)
  • Results for Exercise 2B
  • Results for Exercise 2C
  • Slide 78
  • Slide 79
  • Results for Exercise 4 (i)
  • Results for Exercise 4 (ii)
  • Slide 82
  • Slide 83
  • END