34
Sequence Database Searching Computing in Molecular Biology Hugues Sicotte National Center for Biotechnology Information [email protected]

Sequence Database Searching Computing in Molecular Biology Hugues Sicotte National Center for Biotechnology Information [email protected]

Embed Size (px)

Citation preview

Sequence Database Searching

Computing in Molecular Biology

Hugues Sicotte

National Center for Biotechnology Information

[email protected]

Sequence Database Searching

Alignment methods

Query sequence

Sub

ject

seq

uenc

e

Sequence Alignment representation using a dot plot.

For a query of N letters against a subject sequence of M letters, it requires MxN comparisons.

Sequence Database Searching

H A S H I N G M E T H O D S

Hashing is a common method for accelerating database searches

MLILII

MLIIKRDELVISWASHEREquery sequence

IIKIKRKRDRDEDELELVLVIVISISWSWAWASASHSHEHERERE

all overlappingwords of size 3

Compile “dictionary” of words from the query sequence. Put each word in a look-up table that points to the original position in the sequence. Thus given one word, you can know if it is in the query in a single operation.

Sequence Database Searching

Index lookup

Each word is assigned a unique integer.

E.g. for a word of 3 letters made up of an alphabet of 20 letters.

1. Assign a code to each letter Code(l) (0 to 19)

2. For a word of 3 letters L1 L2 L3 the code is

index = Code(L1)*202 + Code(L2)*201 + Code(L3)

3. Have an array with a list of the positions that have that word.

1

0 1 2 3

Position in query sequence of word

Sequence Database Searching

H A S H I N G M E T H O D S

Building the dictionary for the query sequence requires (N-2) operations.

MLILII

MLIIKRDELVISWASHEREquery sequence

IIKIKRKRDRDEDELELVLVIVISISWSWAWASASHSHEHERERE

all overlappingwords of size 3

The database contains (M-2) words, and it takes only one operation to see if the word was in the query.

Sequence Database Searching

H A S H I N G M E T H O D S

Query sequence

Sub

ject

seq

uenc

e

Scan the subject, looking up words in the dictionary

Use word hits to determine were to search for alignments

fills the dynamic programming matrix

in (N-2)+(M-2) operations instead

of MxN.

Sequence Database Searching

H A S H I N G M E T H O D S

Query sequence

Sub

ject

seq

uenc

e

Scan the database, looking up words in the dictionary

Use word hits to determine were to search for alignments

FASTA searches in a band

Sequence Database Searching

H A S H I N G M E T H O D S

Query sequence

Dat

abas

e se

quen

ce

Scan the database, looking up words in the dictionary

Use word hits to determine were to search for alignments

BLAST extends from word hits

Sequence Database Searching

Database Search Space

Query sequence

Con

cana

ted

Dat

abas

e se

quen

ce

Simplest Database searching could is a large dynamic programming example.

With all the database sequences concatenated one after another.

Sequence Database Searching

Database Search Space

Query sequence

Con

cana

ted

Dat

abas

e se

quen

ce

Which alignment is more significant?

Sequence Database Searching

Database Search Space

Query sequence

Con

cana

ted

Dat

abas

e se

quen

ce

Score can be used to judge alignments. But a score absolute value is a function of the score parameters.

Match=+1,Mismatch=-1,

Gap_open=5,

gap_extend=1

Yields same alignments as

Match=+10,Mismatch=-10,

Gap_open=50,

gap_extend=10

Scores useful for relative ranking.

Sequence Database Searching

Database Search Space

Query sequence

Con

cana

ted

Dat

abas

e se

quen

ce

To Judge relevancy of an alignment, need to judge if match is significant.

E-value = Expect(S) is a function of the score, database size and composition, and query size.

Number of Aligments with scores >= S expected if the query was a random given the database size and composition.

Expect of 0.0 means a very good match unlikely to be random.

Sequence Database Searching

D A T A B A S E S E A R C H I N G

Compare one query sequence against an entire database

> fasta myquery swissprot -ktup 2

search program

querysequence

sequencedatabase

optionalparameters

A typical search has four basic elements

Sequence Database Searching

D A T A B A S E S E A R C H I N G

With exponential database growth, searches keep taking more time

> fasta myquery swissprot -ktup 2

searching . . . . . .

Sequence Database Searching

E-value

“Hits” can be sorted according to their E-value or their score.

The E-value is better known as the EXPECT value and is a function of score, database size and query sequence length.

E-value: Number of alignments with a score >=S that you expect to find if the database was a collection of random letters.

e.g. For a score of 1, one only requires 1 match, and there should be an enormous amount of alignments. One expects to find less alignments with a score of 5, and so on.. Eventually when the score is big enough, one expects to find an insignificant number of of alignments that could be due to chance.

E-value of less than 1e-6 (1* 10-6 in scientific notation) are usually very good and for proteins, E<1e-2 is usually considered significant. It is still possible for a Hit with E>1 to be biologically meaningful, but more analysis is required to comfirm that.

Even for VERY good hits, it is possible that the hit is due to a biological artifact (sequencing/cloning vector, repeats, low-complexity sequence…)

Sequence Database Searching

D A T A B A S E S E A R C H I N G

The “hit list” gives titles and scores for matched sequences

> fasta myquery swissprot -ktup 2The best scores are: initn init1 opt z-sc E(77110)gi|1706794|sp|P49789|FHIT_HUMAN BIS(5'-ADENOSYL)- 996 996 996 1262.1 0gi|1703339|sp|P49776|APH1_SCHPO BIS(5'-NUCLEOSYL) 412 382 395 507.6 1.4e-21gi|1723425|sp|P49775|HNT2_YEAST HIT FAMILY PROTEI 238 133 316 407.4 5.4e-16gi|3915958|sp|Q58276|Y866_METJA HYPOTHETICAL HIT- 153 98 190 253.1 2.1e-07gi|3916020|sp|Q11066|YHIT_MYCTU HYPOTHETICAL 15.7 163 163 184 244.8 6.1e-07gi|3023940|sp|O07513|HIT_BACSU HIT PROTEIN 164 164 170 227.2 5.8e-06gi|2506515|sp|Q04344|HNT1_YEAST HIT FAMILY PROTEI 130 91 157 210.3 5.1e-05gi|2495235|sp|P75504|YHIT_MYCPN HYPOTHETICAL 16.1 125 125 148 199.7 0.0002gi|418447|sp|P32084|YHIT_SYNP7 HYPOTHETICAL 12.4 42 42 140 191.3 0.00058gi|3025190|sp|P94252|YHIT_BORBU HYPOTHETICAL 15.9 128 73 139 188.7 0.00082gi|1351828|sp|P47378|YHIT_MYCGE HYPOTHETICAL HIT- 76 76 133 181.0 0.0022gi|418446|sp|P32083|YHIT_MYCHR HYPOTHETICAL 13.1 27 27 119 165.2 0.017gi|1708543|sp|P49773|IPK1_HUMAN HINT PROTEIN (PRO 66 66 118 163.0 0.022gi|2495231|sp|P70349|IPK1_MOUSE HINT PROTEIN (PRO 65 65 116 160.5 0.03gi|1724020|sp|P49774|YHIT_MYCLE HYPOTHETICAL HIT- 52 52 117 160.3 0.031gi|1170581|sp|P16436|IPK1_BOVIN HINT PROTEIN (PRO 66 66 115 159.3 0.035gi|2495232|sp|P80912|IPK1_RABIT HINT PROTEIN (PRO 66 66 112 155.5 0.057gi|1177047|sp|P42856|ZB14_MAIZE 14 KD ZINC-BINDIN 73 73 112 155.4 0.058gi|1177046|sp|P42855|ZB14_BRAJU 14 KD ZINC-BINDIN 76 76 110 153.8 0.072gi|1169825|sp|P31764|GAL7_HAEIN GALACTOSE-1-PHOSP 58 58 104 138.5 0.51gi|113999|sp|P16550|APA1_YEAST 5',5'''-P-1,P-4-TE 47 47 103 137.8 0.56gi|1351948|sp|P49348|APA2_KLULA 5',5'''-P-1,P-4-T 63 63 98 131.3 1.3gi|123331|sp|P23228|HMCS_CHICK HYDROXYMETHYLGLUTA 58 58 99 129.4 1.6gi|1170899|sp|P06994|MDH_ECOLI MALATE DEHYDROGENA 70 48 91 122.9 3.7gi|3915666|sp|Q10798|DXR_MYCTU 1-DEOXY-D-XYLULOSE 75 50 92 121.9 4.3gi|124341|sp|P05113|IL5_HUMAN INTERLEUKIN-5 PRECU 36 36 85 121.3 4.7gi|1170538|sp|P46685|IL5_CERTO INTERLEUKIN-5 PREC 36 36 84 120.0 5.5gi|121369|sp|P15124|GLNA_METCA GLUTAMINE SYNTHETA 45 45 90 118.9 6.3gi|2506868|sp|P33937|NAPA_ECOLI PERIPLASMIC NITRA 48 48 92 117.4 7.6gi|119377|sp|P10403|ENV1_DROME RETROVIRUS-RELATED 59 59 89 117.0 8gi|1351041|sp|P48415|SC16_YEAST MULTIDOMAIN VESIC 48 48 97 117.0 8gi|4033418|sp|O67501|IPYR_AQUAE INORGANIC PYROPHO 38 38 83 116.8 8.3

Sequence Database Searching

D A T A B A S E S E A R C H I N G

Detailed alignments are shown farther down in the output

> fasta myquery swissprot -ktup 2

>>gi|1703339|sp|P49776|APH1_SCHPO BIS(5'-NUCLEOSYL)-TETR (182 aa)initn: 412 init1: 382 opt: 395 z-score: 507.6 E(): 1.4e-21Smith-Waterman score: 395; 52.3% identity in 109 aa overlap

10 20 30 40 50gi|170 MSFRFGQHLIKPSVVFLKTELSFALVNRKPVVPGHVLVCPLRPVERFHDLRPDEVADLF : X: .:.:: :.:: ::..:::::: : : : :..:: :.:..:::gi|170 MPKQLYFSKFPVGSQVFYRTKLSAAFVNLKPILPGHVLVIPQRAVPRLKDLTPSELTDLF 10 20 30 40 50 60

60 70 80 90 100 110gi|170 QTTQRVGTVVEKHFHGTSLTFSMQDGPEAGQTVKHVHVHVLPRKAGDFHRNDSIYEELQK ....: :.:: : ... ....::: .::::: :::::..::: .:: .:: .: :X.:gi|170 TSVRKVQQVIEKVFSASASNIGIQDGVDAGQTVPHVHVHIIPRKKADFSENDLVYSELEK 70 80 90 100 110 120

120 130 140gi|170 HDKEDFPASWRSEEEMAAEAAALRVYFQ ..gi|170 NEGNLASLYLTGNERYAGDERPPTSMRQAIPKDEDRKPRTLEEMEKEAQWLKGYFSEEQE 130 140 150 160 170 180

>>gi|1723425|sp|P49775|HNT2_YEAST HIT FAMILY PROTEIN 2 (217 aa)initn: 238 init1: 133 opt: 316 z-score: 407.4 E(): 5.4e-16Smith-Waterman score: 316; 37.4% identity in 131 aa overlap

10 20 30 40gi|170 MSFRFGQHLIKPSVVFLKTELSFALVNRKPVVPGHVLVCPLRP-VER :.. :. .v^: :.. ..:::: ::.::::::. ::X :

Sequence Database Searching

Database Search Space

Query sequence

Con

cana

ted

Dat

abas

e se

quen

ce

Some matches are non-meaningful because they occur VERY often in

database.

e.g. nucleotide AAA (from polyA)

Biological repeated elements(retroposons ALU)

Low-complexity repeated patterns.

(CAGCAG, QQQ,KKK,…)

These elements should be

FILTERED or MASKED

to avoid generating false ‘hits’.. It is ‘OK’ to align through them if they are near meaningful diagonal ‘hits’

Sequence Database Searching

Score and Statistics

Some amino acids mutations do not affect structure/function very much. Amino acids with similar physico-chemical and steric properties can often replace each other.

Scoring system that doesn’t penalize very much mutations to similar amino acid.

PAM Matrices: Point Accepted Mutations. Defined in terms of a divergence of 1 percent PAM. For distant sequences use PAM250, while for closer sequences (like DNA) use PAM100. Some sites accumulate mutations some others don’t, thus use of the PAM100 matrice doesn’t mean that the sequences compared were 100% mutated.

BLOSUM: BLOCK substitution matrices. Started with the BLOCKS database of multiple alignment only involving distant sequences. BLOSUM62 means that the proteins compated were never closer than 62% Identity. BLOSUM50 matrices involved alignment of more distant sequences. Recommend use BLOSUM matrices (BLOSUM62) for most protein alignments.

Sequence Database Searching

S C O R I N G S Y S T E M S

BLOSUM62

Figure 7.8

A 4

R -1 5

N -2 0 6

D -2 -2 1 6

C 0 -3 -3 -3 9

Q -1 1 0 0 -3 5

E -1 0 0 2 -4 2 5

G 0 -2 0 -1 -3 -2 -2 6

H -2 0 1 -1 -3 0 0 -2 8

I -1 -3 -3 -3 -1 -3 -3 -4 -3 4

L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4

K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5

M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5

F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6

P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7

S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4

T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5

W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11

Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7

V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4

A R N D C Q E G H I L K M F P S T W Y V

Some amino acid substitutions are more common than others

Substitution scores come from an odds ratio based on measured substitution rates

Sequence Database Searching

S C O R I N G S Y S T E M S

BLOSUM62

Figure 7.8

A 4

R -1 5

N -2 0 6

D -2 -2 1 6

C 0 -3 -3 -3 9

Q -1 1 0 0 -3 5

E -1 0 0 2 -4 2 5

G 0 -2 0 -1 -3 -2 -2 6

H -2 0 1 -1 -3 0 0 -2 8

I -1 -3 -3 -3 -1 -3 -3 -4 -3 4

L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4

K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5

M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5

F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6

P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7

S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4

T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5

W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11

Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7

V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4

A R N D C Q E G H I L K M F P S T W Y V

Identities get positive scores, but some are better than others

Sequence Database Searching

S C O R I N G S Y S T E M S

BLOSUM62

Figure 7.8

A 4

R -1 5

N -2 0 6

D -2 -2 1 6

C 0 -3 -3 -3 9

Q -1 1 0 0 -3 5

E -1 0 0 2 -4 2 5

G 0 -2 0 -1 -3 -2 -2 6

H -2 0 1 -1 -3 0 0 -2 8

I -1 -3 -3 -3 -1 -3 -3 -4 -3 4

L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4

K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5

M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5

F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6

P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7

S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4

T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5

W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11

Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7

V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4

A R N D C Q E G H I L K M F P S T W Y V

Some non-identities have positive scores, but most are negative

Sequence Database Searching

BLAST and BLAST2SEQUENCES

BLAST is a database search engine based on

using hashing to accelerate the search.

blastn (nucleotide query against nucleotide database) blastp (protein query against protein database)blastx (nucleotide query against protein database)

- translates a nucleotide query in all 6 reading frames and compare it to a protein database.

tblastn (protein query against nucleotide database)- compare a protein against a nucleotide database translated in all 6 reading frames.

tblastx (nucleotide query against nucleotide database)- compares a nucleotide sequence against a nucleotide database by translating the query and database in all 6 reading frames. Very slow!

A pairwise alignment implementation of this

program is available at:

http://www.ncbi.nlm.nih.gov/gorf/bl2.html

Sequence Database Searching

Protein BLAST databases

nr All non-redundant GenBank CDS+ translations+PDB+ SwissProt + PIR + PRF

month All new or revised GenBank CDS translation+PDB+SwissProt+PIR+PRF released in the last 30 days.

swissprot Last major release of the SWISS-PROT protein sequence database (no updates)

Drosophila Drosophila genome proteins provided by Celera and Berkeley Drosophila Genome Project (BDGP).

yeast Yeast (Saccharomyces cerevisiae) genomic CDS translations

ecoli Escherichia coli genomic CDS translations

pdb Sequences derived from the 3-dimensional structure from Brookhaven Protein Data Bank

kabat [kabatpro] Kabat's database of sequences of immunological interest

alu Translations of select Alu repeats from REPBASE, suitable for masking Alu repeats from query sequences.

Sequence Database Searching

Nucleotide BLAST databases

nr All GenBank+EMBL+DDBJ+PDB sequences (but no EST, STS, GSS, or phase 0, 1 or 2 HTGS sequences). No longer "non-redundant".

month All new or revised GenBank+EMBL+DDBJ+PDB sequences released in the last 30 days.

Drosophila genome Drosophila genome provided by Celera and Berkeley Drosophila Genome Project (BDGP).

dbest Database of GenBank+EMBL+DDBJ sequences from EST Divisions

dbsts Database of GenBank+EMBL+DDBJ sequences from STS Divisions

htgs Unfinished High Throughput Genomic Sequences: phases 0, 1 and 2 (finished, phase 3 HTG sequences are in nr)

gss Genome Survey Sequence, includes single-pass genomic data, exon-trapped sequences, and Alu PCR sequences.

yeast Yeast (Saccharomyces cerevisiae) genomic nucleotide sequences

E. coli Escherichia coli genomic nucleotide sequences

Sequence Database Searching

Nucleotide BLAST databases

pdb Sequences derived from the 3-dimensional structure from Brookhaven Protein Data Bank

kabat [kabatnuc] Kabat's database of sequences of immunological interest

vector Vector subset of GenBank(R), NCBI, in ftp://ncbi.nlm.nih.gov/blast/db/

mito Database of mitochondrial sequences

alu Select Alu repeats from REPBASE, suitable for masking Alu repeats from query sequences.

epd Eukaryotic Promotor Database found on the web at http://www.genome.ad.jp/dbget-bin/www_bfind?epd

Sequence Database Searching

BLASTN SEARCH (M29204)

Search Nucleotide sequence M29204 against nr.

http://www.ncbi.nlm.nih.gov/blast/blast.cgi?Jform=1

Sequence Database Searching

BLASTP and filtering.

Search using blastp against nr

With filtering ON (default)\

Then with filtering OFF.

>GCF

MKKRVTNRERHWTHRRRRQRTRKKKKKKKRVLGRRALGPRPWLTGRKGLFGSARLIPATA

Sequence Database Searching

BLASTN vs BLASTX

Search blastn against nr (nucleotide) U15595

Now search using blastx against nr (protein)

Now

Search blastx against ALU

Sequence Database Searching

TBLASTX against dbEST

Search tblastx against dbEST

Picks up homologs based on protein homology of translations.

>OCRL-selected mRNA, partial sequenceTTGAACATCATGAAACATGAGGTTGTCATTTGGTTGGGAGATTTGAATTATAGACTTTGCATGCCTGATGCCAATGAGGTGAAAAGTCTTATTAATAAGAAAGACCTTCAGAGACTCTTGAAATTCGACCAGCTAAATATTCAGCGCACACAGAAAAAAGCTTTTGTTGACTTCAATGAAGGGGAAATCAAGTTCATCCCCACTTATAAGTATGACTCTAA

Sequence Database Searching

Prosite search

Search prosite for

NP_000271 (Pax6a)

http://www.expasy.ch/prosite

Sequence Database Searching

PHI-Blast search

Search Prosite db using the NCBI’s PHI-blast.(Pattern-Hit-Initiated blast) using the pattern for Pax6a.

[LIVMFYG]-[ASLVR]-X(2)-[LIVMSTACN]-X-(4)-[LIV]-[RKNQESTAIY]-[LIVFSTNKH]-W

-e 2e-14

Sequence Database Searching

PSI-Blast search

Search AB026911 using PSI-blast. (at NCBI).

Position-Specific-Iteration.

.. Modifies the scoring matrix as a function of conserved or unconserved residues in alignments.

Sequence Database Searching

ONLINE tutorials

http://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html

Details of Blast methodology.

http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/information3.html

Blast usage and Tutorial

http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/similarity.html

Quick overview of terminology.