Advanced BLAST Searching · 2014. 5. 26. · Advanced BLAST Searching. September 13, 2006. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. [email protected]. Johns Hopkins

Advanced BLAST Searching

September 13, 2006

Jonathan Pevsner, Ph.D.Introduction to Bioinformatics

[email protected] Hopkins School of

Public Health (260.602.01)

Copyright notice

Many of the images in this powerpoint presentationare from Bioinformatics and Functional Genomicsby Jonathan Pevsner (ISBN 0-471-21004-8). Copyright © 2003 by John Wiley & Sons, Inc.

These images and materials may not be usedwithout permission from the publisher. We welcomeinstructors to use these powerpoints for educationalpurposes, but please acknowledge the source.

The book has a homepage at http://www.bioinfbook.orgincluding hyperlinks to the book chapters.

http://www.bioinfbook.org/

Outline of today’s lecture

Organism-specific BLAST sites

Specialized BLAST-related algorithms

BLAST-like tools for genomic DNA searches

PSI-BLAST

PHI-BLAST

Find-a-gene

Specialized BLAST servers

Organism-specific BLAST sites

Page 128

Fig. 5.1Page 131

Ensembl BLAST output includes an ideogram

Fig. 5.2Page 131

TIGR BLAST

Fig. 5.3Page 132

Fig. 5.4Page 132

Fig. 5.5Page 133

BLAST output of ProDom server: graphical view of domains

Fig. 5.6Page 134

Specialized BLAST servers

Molecule-specific BLAST sites

Species-specific BLAST sites

Specialized algorithms (WU-BLAST 2.0)

Page 135

Fig. 5.7Page 136

Fig. 5.8Page 137

FASTA server at the University of Virginia

We will discuss the Conserved Domain Database(CDD) in chapter 10(multiple sequence alignment)

We will discuss the Conserved Domain Database(CDD) in chapter 10(multiple sequence alignment)

BLAST-related tools for genomic DNA

The analysis of genomic DNA presents special challenges:• There are exons (protein-coding sequence) andintrons (intervening sequences).

• There may be sequencing errors or polymorphisms• The comparison may between be related species(e.g. human and mouse)

Page 135

BLAST-related tools for genomic DNA

Recently developed tools include:

• MegaBLAST at NCBI.

• BLAT (BLAST-like alignment tool). BLAT parses an entire genomic DNA database into words (11mers), then searches them against a query. Thus it is a mirror imageof the BLAST strategy. See http://genome.ucsc.edu

• SSAHA at Ensembl uses a similar strategy as BLAT.See http://www.ensembl.org

Page 136

To access BLAT, visit http://genome.ucsc.edu

“BLAT on DNA is designed to quickly find sequences of 95%and greater similarity of length 40 bases or more. It may missmore divergent or shorter sequence alignments. It will findperfect sequence matches of 33 bases, and sometimes findthem down to 20 bases. BLAT on proteins finds sequences of80% and greater similarity of length 20 amino acids or more.In practice DNA BLAT works well on primates, and proteinblat on land vertebrates.” --BLAT website

Paste DNA or protein sequencehere in the FASTA format

BLAT output includes browser and other formats

Position specific iterated BLAST: PSI-BLAST

The purpose of PSI-BLAST is to look deeperinto the database for matches to your queryprotein sequence by employing a scoringmatrix that is customized to your query.

Page 137

PSI-BLAST is performed in five steps

[1] Select a query and search it against a protein database

Page 138



[2] PSI-BLAST constructs a multiple sequence alignmentthen creates a “profile” or specialized position-specificscoring matrix (PSSM)

Page 138

R,I,K C D,E,T K,R,T N,L,Y,G

Fig. 5.9Page 139

A R N D C Q E G H I L K M F P S T W Y V 1 M -1 -2 -2 -3 -2 -1 -2 -3 -2 1 2 -2 6 0 -3 -2 -1 -2 -1 1 2 K -1 1 0 1 -4 2 4 -2 0 -3 -3 3 -2 -4 -1 0 -1 -3 -2 -3 3 W -3 -3 -4 -5 -3 -2 -3 -3 -3 -3 -2 -3 -2 1 -4 -3 -3 12 2 -3 4 V 0 -3 -3 -4 -1 -3 -3 -4 -4 3 1 -3 1 -1 -3 -2 0 -3 -1 4 5 W -3 -3 -4 -5 -3 -2 -3 -3 -3 -3 -2 -3 -2 1 -4 -3 -3 12 2 -3 6 A 5 -2 -2 -2 -1 -1 -1 0 -2 -2 -2 -1 -1 -3 -1 1 0 -3 -2 0 7 L -2 -2 -4 -4 -1 -2 -3 -4 -3 2 4 -3 2 0 -3 -3 -1 -2 -1 1 8 L -1 -3 -3 -4 -1 -3 -3 -4 -3 2 2 -3 1 3 -3 -2 -1 -2 0 3 9 L -1 -3 -4 -4 -1 -2 -3 -4 -3 2 4 -3 2 0 -3 -3 -1 -2 -1 2

10 L -2 -2 -4 -4 -1 -2 -3 -4 -3 2 4 -3 2 0 -3 -3 -1 -2 -1 1 11 A 5 -2 -2 -2 -1 -1 -1 0 -2 -2 -2 -1 -1 -3 -1 1 0 -3 -2 0 12 A 5 -2 -2 -2 -1 -1 -1 0 -2 -2 -2 -1 -1 -3 -1 1 0 -3 -2 0 13 W -2 -3 -4 -4 -2 -2 -3 -4 -3 1 4 -3 2 1 -3 -3 -2 7 0 0 14 A 3 -2 -1 -2 -1 -1 -2 4 -2 -2 -2 -1 -2 -3 -1 1 -1 -3 -3 -1 15 A 2 -1 0 -1 -2 2 0 2 -1 -3 -3 0 -2 -3 -1 3 0 -3 -2 -2 16 A 4 -2 -1 -2 -1 -1 -1 3 -2 -2 -2 -1 -1 -3 -1 1 0 -3 -2 -1 ...37 S 2 -1 0 -1 -1 0 0 0 -1 -2 -3 0 -2 -3 -1 4 1 -3 -2 -2 38 G 0 -3 -1 -2 -3 -2 -2 6 -2 -4 -4 -2 -3 -4 -2 0 -2 -3 -3 -4 39 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -3 -2 0 40 W -3 -3 -4 -5 -3 -2 -3 -3 -3 -3 -2 -3 -2 1 -4 -3 -3 12 2 -3 41 Y -2 -2 -2 -3 -3 -2 -2 -3 2 -2 -1 -2 -1 3 -3 -2 -2 2 7 -1 42 A 4 -2 -2 -2 -1 -1 -1 0 -2 -2 -2 -1 -1 -3 -1 1 0 -3 -2 0

Fig. 5.10Page 140

A R N D C Q E G H I L K M F P S T W Y V 1 M -1 -2 -2 -3 -2 -1 -2 -3 -2 1 2 -2 6 0 -3 -2 -1 -2 -1 1 2 K -1 1 0 1 -4 2 4 -2 0 -3 -3 3 -2 -4 -1 0 -1 -3 -2 -3 3 W -3 -3 -4 -5 -3 -2 -3 -3 -3 -3 -2 -3 -2 1 -4 -3 -3 12 2 -3 4 V 0 -3 -3 -4 -1 -3 -3 -4 -4 3 1 -3 1 -1 -3 -2 0 -3 -1 4 5 W -3 -3 -4 -5 -3 -2 -3 -3 -3 -3 -2 -3 -2 1 -4 -3 -3 12 2 -3 6 A 5 -2 -2 -2 -1 -1 -1 0 -2 -2 -2 -1 -1 -3 -1 1 0 -3 -2 0 7 L -2 -2 -4 -4 -1 -2 -3 -4 -3 2 4 -3 2 0 -3 -3 -1 -2 -1 1 8 L -1 -3 -3 -4 -1 -3 -3 -4 -3 2 2 -3 1 3 -3 -2 -1 -2 0 3 9 L -1 -3 -4 -4 -1 -2 -3 -4 -3 2 4 -3 2 0 -3 -3 -1 -2 -1 2

10 L -2 -2 -4 -4 -1 -2 -3 -4 -3 2 4 -3 2 0 -3 -3 -1 -2 -1 1 11 A 5 -2 -2 -2 -1 -1 -1 0 -2 -2 -2 -1 -1 -3 -1 1 0 -3 -2 0 12 A 5 -2 -2 -2 -1 -1 -1 0 -2 -2 -2 -1 -1 -3 -1 1 0 -3 -2 0 13 W -2 -3 -4 -4 -2 -2 -3 -4 -3 1 4 -3 2 1 -3 -3 -2 7 0 0 14 A 3 -2 -1 -2 -1 -1 -2 4 -2 -2 -2 -1 -2 -3 -1 1 -1 -3 -3 -1 15 A 2 -1 0 -1 -2 2 0 2 -1 -3 -3 0 -2 -3 -1 3 0 -3 -2 -2 16 A 4 -2 -1 -2 -1 -1 -1 3 -2 -2 -2 -1 -1 -3 -1 1 0 -3 -2 -1 ...37 S 2 -1 0 -1 -1 0 0 0 -1 -2 -3 0 -2 -3 -1 4 1 -3 -2 -2 38 G 0 -3 -1 -2 -3 -2 -2 6 -2 -4 -4 -2 -3 -4 -2 0 -2 -3 -3 -4 39 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -3 -2 0 40 W -3 -3 -4 -5 -3 -2 -3 -3 -3 -3 -2 -3 -2 1 -4 -3 -3 12 2 -3 41 Y -2 -2 -2 -3 -3 -2 -2 -3 2 -2 -1 -2 -1 3 -3 -2 -2 2 7 -1 42 A 4 -2 -2 -2 -1 -1 -1 0 -2 -2 -2 -1 -1 -3 -1 1 0 -3 -2 0

Fig. 5.10Page 140




[3] The PSSM is used as a query against the database

[4] PSI-BLAST estimates statistical significance (E values)

Page 138

Fig. 5.11Page 141




[3] The PSSM is used as a query against the database

[4] PSI-BLAST estimates statistical significance (E values)

[5] Repeat steps [3] and [4] iteratively, typically 5 times.At each new search, a new profile is used as the query.

Page 138

Results of a PSI-BLAST search

# hitsIteration # hits > threshold

1 104 492 173 963 236 1784 301 2405 344 2836 342 2987 378 3108 382 320 Table 5-4

Page 141

Score = 46.2 bits (108), Expect = 2e-04Identities = 40/150 (26%), Positives = 70/150 (46%), Gaps = 37/150 (24%)

Query: 27 VKENFDKARFSGTWYAMAKKDPEGLFLQDNIVAEFSVDETGQMSATAKGRVRLLNNWDVC 86V+ENFD ++ G WY + +K P + I A +S+ E G + K ++

Sbjct: 33 VQENFDVKKYLGRWYEI-EKIPASFEKGNCIQANYSLMENGNIEVLNK---------ELS 82

Query: 87 ADMVGTF---------TDTEDPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAVQYSCR 137D GT ++ +PAK +++++ + +WI+ TDY+ YA+ YSC

Sbjct: 83 PD--GTMNQVKGEAKQSNVSEPAKLEVQFFPLMP-----PAPYWILATDYENYALVYSCT 135

Query: 138 ----LLNLDGTCADSYSFVFSRDPNGLPPE 163L ++D + ++ R+P LPPE

Sbjct: 136 TFFWLFHVD------FFWILGRNPY-LPPE 158

PSI-BLAST alignment of RBP and β-lactoglobulin: iteration 1

Fig. 5.12Page 142


Score = 140 bits (353), Expect = 1e-32Identities = 45/176 (25%), Positives = 78/176 (43%), Gaps = 33/176 (18%)

Query: 4 VWALLLLAAWAAAERDCRVSSF--------RVKENFDKARFSGTWYAMAKKDPEGLFLQD 55V L+ LA A + +F V+ENFD ++ G WY + +K P +

Sbjct: 2 VTMLMFLATLAGLFTTAKGQNFHLGKCPSPPVQENFDVKKYLGRWYEI-EKIPASFEKGN 60

Query: 56 NIVAEFSVDETGQMSATAKGRVRLLNNWDVCADMV---GTFTDTEDPAKFKMKYWGVASF 112I A +S+ E G + K + D + V ++ +PAK +++++ +

Sbjct: 61 CIQANYSLMENGNIEVLNKEL-----SPDGTMNQVKGEAKQSNVSEPAKLEVQFFPL--- 112

Query: 113 LQKGNDDHWIVDTDYDTYAVQYSCR----LLNLDGTCADSYSFVFSRDPNGLPPEA 164+WI+ TDY+ YA+ YSC L ++D + ++ R+P LPPE

Sbjct: 113 --MPPAPYWILATDYENYALVYSCTTFFWLFHVD------FFWILGRNPY-LPPET 159

Fig. 5.12Page 142



Query: 3 WVWALLLLAAWAAAERD--------CRVSSFRVKENFDKARFSGTWYAMAKKDPEGLFLQ 54V L+ LA A + S V+ENFD ++ G WY + K

Sbjct: 1 MVTMLMFLATLAGLFTTAKGQNFHLGKCPSPPVQENFDVKKYLGRWYEIEKIPASFE-KG 59

Query: 55 DNIVAEFSVDETGQMSATAKGRVRLLNNWDVCADMVGTFTDTEDPAKFKMKYWGVASFLQ 114+ I A +S+ E G + K V + ++ +PAK +++++ +

Sbjct: 60 NCIQANYSLMENGNIEVLNKELSPDGTMNQVKGE--AKQSNVSEPAKLEVQFFPL----- 112

Query: 115 KGNDDHWIVDTDYDTYAVQYSCRLLNLDGTCADSYSFVFSRDPNGLPPEA 164+WI+ TDY+ YA+ YSC + ++ R+P LPPE

Sbjct: 113 MPPAPYWILATDYENYALVYSCTTFFWL--FHVDFFWILGRNPY-LPPET 159

Fig. 5.12Page 142


Query: 3 WVWALLLLAAWAAAERD--------CRVSSFRVKENFDKARFSGTWYAMAKKDPEGLFLQ 54V L+ LA A + S V+ENFD ++ G WY + K

Sbjct: 1 MVTMLMFLATLAGLFTTAKGQNFHLGKCPSPPVQENFDVKKYLGRWYEIEKIPASFE-KG 59

Query: 55 DNIVAEFSVDETGQMSATAKGRVRLLNNWDVCADMVGTFTDTEDPAKFKMKYWGVASFLQ 114+ I A +S+ E G + K V + ++ +PAK +++++ +

Sbjct: 60 NCIQANYSLMENGNIEVLNKELSPDGTMNQVKGE--AKQSNVSEPAKLEVQFFPL----- 112

Query: 115 KGNDDHWIVDTDYDTYAVQYSCRLLNLDGTCADSYSFVFSRDPNGLPPEA 164+WI+ TDY+ YA+ YSC + ++ R+P LPPE

Sbjct: 113 MPPAPYWILATDYENYALVYSCTTFFWL--FHVDFFWILGRNPY-LPPET 159

Score = 46.2 bits (108), Expect = 2e-04Identities = 40/150 (26%), Positives = 70/150 (46%), Gaps = 37/150 (24%)

Query: 27 VKENFDKARFSGTWYAMAKKDPEGLFLQDNIVAEFSVDETGQMSATAKGRVRLLNNWDVC 86V+ENFD ++ G WY + +K P + I A +S+ E G + K ++

Sbjct: 33 VQENFDVKKYLGRWYEI-EKIPASFEKGNCIQANYSLMENGNIEVLNK---------ELS 82

Query: 87 ADMVGTF---------TDTEDPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAVQYSCR 137D GT ++ +PAK +++++ + +WI+ TDY+ YA+ YSC

Sbjct: 83 PD--GTMNQVKGEAKQSNVSEPAKLEVQFFPLMP-----PAPYWILATDYENYALVYSCT 135

Query: 138 ----LLNLDGTCADSYSFVFSRDPNGLPPE 163L ++D + ++ R+P LPPE

Sbjct: 136 TFFWLFHVD------FFWILGRNPY-LPPE 158

1

3

Fig. 5.12Page 142

The universe of lipocalins (each dot is a protein)

retinol-binding protein

odorant-binding protein

apolipoprotein D

Fig. 5.13Page 143

Scoring matrices let you focus on the big (or small) picture


your RBP query Fig. 5.13Page 143

Scoring matrices let you focus on the big (or small) picture

retinol-binding proteinretinol-binding

protein

PAM250

PAM30

Blosum45

Blosum80

Fig. 5.13Page 143

PSI-BLAST generates scoring matrices more powerful than PAM or BLOSUM



Fig. 5.13Page 143

PSI-BLAST: performance assessment

Evaluate PSI-BLAST results using a database in which protein structures have been solved and allproteins in a group share < 40% amino acid identity.

Page 143

PSI-BLAST: the problem of corruption

PSI-BLAST is useful to detect weak but biologicallymeaningful relationships between proteins.

The main source of false positives is the spuriousamplification of sequences not related to the query.For instance, a query with a coiled-coil motif maydetect thousands of other proteins with this motifthat are not homologous.

Once even a single spurious protein is includedin a PSI-BLAST search above threshold, it will notgo away.

Page 144

PSI-BLAST: the problem of corruption

Corruption is defined as the presence of at least onefalse positive alignment with an E value < 10-4

after five iterations.

Three approaches to stopping corruption:

[1] Apply filtering of biased composition regions

[2] Adjust E value from 0.001 (default) to a lowervalue such as E = 0.0001.

[3] Visually inspect the output from each iteration.Remove suspicious hits by unchecking the box.

Page 144

Fig. 5.11Page 141

Fig. 5.21Page 152

See problem 5-1 (page 153): create an artificial proteinconsisting of RBP4 and protein kinase C. How doesPSI-BLAST perform?

Fig. 5.21Page 152

PHI-BLAST: Pattern hit initiated BLAST

Launches from the same page as PSI-BLAST

Combines matching of regular expressions with local alignments surrounding the match.

Page 145

PHI-BLAST: Pattern hit intiated BLAST

Launches from the same page as PSI-BLAST

Combines matching of regular expressions with local alignments surrounding the match.

Given a protein sequence S and a regular expression pattern P occurring in S, PHI-BLAST helps answer the question: What other protein sequences both contain an occurrence of P and are homologous to S in the vicinity of the pattern occurrences? PHI-BLAST may be preferable to just searching for pattern occurrences because it filters out those cases where the pattern occurrence is probably random and not indicative of homology.

Page 145

1 50ecblc MRLLPLVAAA TAAFLVVACS SPTPPRGVTV VNNFDAKRYL GTWYEIARFD

vc MRAIFLILCS V...LLNGCL G..MPESVKP VSDFELNNYL GKWYEVARLDhsrbp ~~~MKWVWAL LLLAAWAAAE RDCRVSSFRV KENFDKARFS GTWYAMAKKD

Align three lipocalins (RBP and two bacterial lipocalins)

Fig. 5.14Page 145



GTWYEIK AV

M

Pick a small, conserved region and see which amino acidresidues are used

Fig. 5.14Page 145



GTWYEIK AV

M

GXW[YF][EA][IVLM]

Create a pattern using the appropriate syntax

Fig. 5.14Page 145

Fig. 5.15Page 146

Fig. 5.16Page 147

Syntax rules for PHI-BLAST

The syntax for patterns in PHI-BLAST follows the conventions of PROSITE (protein lecture, Chapter 8).

When using the stand-alone program, it is permissible to have multiple patterns. When using the Web-page only one pattern is allowed per query.

[ ] means any one of the characters enclosed in the bracketse.g., [LFYT] means one occurrence of L or F or Y or T

- means nothing (spacer character)

x(5) means 5 positions in which any residue is allowed

x(2,4) means 2 to 4 positions where any residue is allowed

BLAST for gene discovery

You can use BLAST to find a “novel” gene

Page 147



Note to students taking this class for credit:You will need to do this for 40% of your grade.

In the first five years of this course,everyone has succeeded at this exercise.

Page 147

Start with the sequence of a known protein

Search a DNA database(e.g. HTGS, dbEST, or genomic sequencefrom a specific organism)

tblastn

Search your DNA or protein against a protein database (nr) to confirm you haveidentified a novel gene

blastxor

blastpnr

Find matches…[1] to DNA encoding

known proteins[2] to DNA encoding

related (novel!) proteins[3] to false positives

inspect

Fig. 5.17Page 148

Page 149

Page 149

(Page 150)

this is a good candidatefor a novel gene/protein

A blastp nr search confirms thatthe Salmonella query is closelyrelated to other lipocalins

(Page 150)



Note to students taking this class for credit:You will need to do this for 30% of your grade.

Ideally, try to find a new gene this week. Youcan discuss it anytime with me or Jen, Hugh and Meg. You should have your novel proteinby October 18 (for the first phylogeny lecture)so you can put your novel protein into a tree.

I will provide sample projects from last year.

Documents

Advanced BLAST Searching · 2014. 5. 26. · Advanced BLAST Searching. September 13, 2006. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. [email protected]. Johns Hopkins