64
Advanced BLAST Searching September 13, 2006 Jonathan Pevsner, Ph.D. Introduction to Bioinformatics [email protected] Johns Hopkins School of Public Health (260.602.01)

Advanced BLAST Searching · 2014. 5. 26. · Advanced BLAST Searching. September 13, 2006. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. [email protected]. Johns Hopkins

  • Upload
    others

  • View
    8

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Advanced BLAST Searching · 2014. 5. 26. · Advanced BLAST Searching. September 13, 2006. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. pevsner@jhmi.edu. Johns Hopkins

Advanced BLAST Searching

September 13, 2006

Jonathan Pevsner, Ph.D.Introduction to Bioinformatics

[email protected] Hopkins School of

Public Health (260.602.01)

Page 2: Advanced BLAST Searching · 2014. 5. 26. · Advanced BLAST Searching. September 13, 2006. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. pevsner@jhmi.edu. Johns Hopkins

Copyright notice

Many of the images in this powerpoint presentationare from Bioinformatics and Functional Genomicsby Jonathan Pevsner (ISBN 0-471-21004-8). Copyright © 2003 by John Wiley & Sons, Inc.

These images and materials may not be usedwithout permission from the publisher. We welcomeinstructors to use these powerpoints for educationalpurposes, but please acknowledge the source.

The book has a homepage at http://www.bioinfbook.orgincluding hyperlinks to the book chapters.

Page 3: Advanced BLAST Searching · 2014. 5. 26. · Advanced BLAST Searching. September 13, 2006. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. pevsner@jhmi.edu. Johns Hopkins

Outline of today’s lecture

Organism-specific BLAST sites

Specialized BLAST-related algorithms

BLAST-like tools for genomic DNA searches

PSI-BLAST

PHI-BLAST

Find-a-gene

Page 4: Advanced BLAST Searching · 2014. 5. 26. · Advanced BLAST Searching. September 13, 2006. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. pevsner@jhmi.edu. Johns Hopkins

Specialized BLAST servers

Organism-specific BLAST sites

Page 128

Page 5: Advanced BLAST Searching · 2014. 5. 26. · Advanced BLAST Searching. September 13, 2006. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. pevsner@jhmi.edu. Johns Hopkins

Fig. 5.1Page 131

Page 6: Advanced BLAST Searching · 2014. 5. 26. · Advanced BLAST Searching. September 13, 2006. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. pevsner@jhmi.edu. Johns Hopkins

Ensembl BLAST output includes an ideogram

Fig. 5.2Page 131

Page 7: Advanced BLAST Searching · 2014. 5. 26. · Advanced BLAST Searching. September 13, 2006. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. pevsner@jhmi.edu. Johns Hopkins

TIGR BLAST

Fig. 5.3Page 132

Page 8: Advanced BLAST Searching · 2014. 5. 26. · Advanced BLAST Searching. September 13, 2006. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. pevsner@jhmi.edu. Johns Hopkins

Fig. 5.4Page 132

Page 9: Advanced BLAST Searching · 2014. 5. 26. · Advanced BLAST Searching. September 13, 2006. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. pevsner@jhmi.edu. Johns Hopkins

Fig. 5.5Page 133

Page 10: Advanced BLAST Searching · 2014. 5. 26. · Advanced BLAST Searching. September 13, 2006. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. pevsner@jhmi.edu. Johns Hopkins

BLAST output of ProDom server: graphical view of domains

Fig. 5.6Page 134

Page 11: Advanced BLAST Searching · 2014. 5. 26. · Advanced BLAST Searching. September 13, 2006. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. pevsner@jhmi.edu. Johns Hopkins

Specialized BLAST servers

Molecule-specific BLAST sites

Species-specific BLAST sites

Specialized algorithms (WU-BLAST 2.0)

Page 135

Page 12: Advanced BLAST Searching · 2014. 5. 26. · Advanced BLAST Searching. September 13, 2006. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. pevsner@jhmi.edu. Johns Hopkins

Fig. 5.7Page 136

Page 13: Advanced BLAST Searching · 2014. 5. 26. · Advanced BLAST Searching. September 13, 2006. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. pevsner@jhmi.edu. Johns Hopkins

Fig. 5.8Page 137

Page 14: Advanced BLAST Searching · 2014. 5. 26. · Advanced BLAST Searching. September 13, 2006. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. pevsner@jhmi.edu. Johns Hopkins

FASTA server at the University of Virginia

Page 15: Advanced BLAST Searching · 2014. 5. 26. · Advanced BLAST Searching. September 13, 2006. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. pevsner@jhmi.edu. Johns Hopkins

We will discuss the Conserved Domain Database(CDD) in chapter 10(multiple sequence alignment)

Page 16: Advanced BLAST Searching · 2014. 5. 26. · Advanced BLAST Searching. September 13, 2006. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. pevsner@jhmi.edu. Johns Hopkins

We will discuss the Conserved Domain Database(CDD) in chapter 10(multiple sequence alignment)

Page 17: Advanced BLAST Searching · 2014. 5. 26. · Advanced BLAST Searching. September 13, 2006. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. pevsner@jhmi.edu. Johns Hopkins
Page 18: Advanced BLAST Searching · 2014. 5. 26. · Advanced BLAST Searching. September 13, 2006. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. pevsner@jhmi.edu. Johns Hopkins
Page 19: Advanced BLAST Searching · 2014. 5. 26. · Advanced BLAST Searching. September 13, 2006. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. pevsner@jhmi.edu. Johns Hopkins

BLAST-related tools for genomic DNA

The analysis of genomic DNA presents special challenges:• There are exons (protein-coding sequence) andintrons (intervening sequences).

• There may be sequencing errors or polymorphisms• The comparison may between be related species(e.g. human and mouse)

Page 135

Page 20: Advanced BLAST Searching · 2014. 5. 26. · Advanced BLAST Searching. September 13, 2006. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. pevsner@jhmi.edu. Johns Hopkins

BLAST-related tools for genomic DNA

Recently developed tools include:

• MegaBLAST at NCBI.

• BLAT (BLAST-like alignment tool). BLAT parses an entire genomic DNA database into words (11mers), then searches them against a query. Thus it is a mirror imageof the BLAST strategy. See http://genome.ucsc.edu

• SSAHA at Ensembl uses a similar strategy as BLAT.See http://www.ensembl.org

Page 136

Page 21: Advanced BLAST Searching · 2014. 5. 26. · Advanced BLAST Searching. September 13, 2006. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. pevsner@jhmi.edu. Johns Hopkins

To access BLAT, visit http://genome.ucsc.edu

“BLAT on DNA is designed to quickly find sequences of 95%and greater similarity of length 40 bases or more. It may missmore divergent or shorter sequence alignments. It will findperfect sequence matches of 33 bases, and sometimes findthem down to 20 bases. BLAT on proteins finds sequences of80% and greater similarity of length 20 amino acids or more.In practice DNA BLAT works well on primates, and proteinblat on land vertebrates.” --BLAT website

Page 22: Advanced BLAST Searching · 2014. 5. 26. · Advanced BLAST Searching. September 13, 2006. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. pevsner@jhmi.edu. Johns Hopkins

Paste DNA or protein sequencehere in the FASTA format

Page 23: Advanced BLAST Searching · 2014. 5. 26. · Advanced BLAST Searching. September 13, 2006. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. pevsner@jhmi.edu. Johns Hopkins

BLAT output includes browser and other formats

Page 24: Advanced BLAST Searching · 2014. 5. 26. · Advanced BLAST Searching. September 13, 2006. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. pevsner@jhmi.edu. Johns Hopkins

Position specific iterated BLAST: PSI-BLAST

The purpose of PSI-BLAST is to look deeperinto the database for matches to your queryprotein sequence by employing a scoringmatrix that is customized to your query.

Page 137

Page 25: Advanced BLAST Searching · 2014. 5. 26. · Advanced BLAST Searching. September 13, 2006. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. pevsner@jhmi.edu. Johns Hopkins

PSI-BLAST is performed in five steps

[1] Select a query and search it against a protein database

Page 138

Page 26: Advanced BLAST Searching · 2014. 5. 26. · Advanced BLAST Searching. September 13, 2006. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. pevsner@jhmi.edu. Johns Hopkins

PSI-BLAST is performed in five steps

[1] Select a query and search it against a protein database

[2] PSI-BLAST constructs a multiple sequence alignmentthen creates a “profile” or specialized position-specificscoring matrix (PSSM)

Page 138

Page 27: Advanced BLAST Searching · 2014. 5. 26. · Advanced BLAST Searching. September 13, 2006. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. pevsner@jhmi.edu. Johns Hopkins

R,I,K C D,E,T K,R,T N,L,Y,G

Fig. 5.9Page 139

Page 28: Advanced BLAST Searching · 2014. 5. 26. · Advanced BLAST Searching. September 13, 2006. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. pevsner@jhmi.edu. Johns Hopkins

A R N D C Q E G H I L K M F P S T W Y V 1 M -1 -2 -2 -3 -2 -1 -2 -3 -2 1 2 -2 6 0 -3 -2 -1 -2 -1 1 2 K -1 1 0 1 -4 2 4 -2 0 -3 -3 3 -2 -4 -1 0 -1 -3 -2 -3 3 W -3 -3 -4 -5 -3 -2 -3 -3 -3 -3 -2 -3 -2 1 -4 -3 -3 12 2 -3 4 V 0 -3 -3 -4 -1 -3 -3 -4 -4 3 1 -3 1 -1 -3 -2 0 -3 -1 4 5 W -3 -3 -4 -5 -3 -2 -3 -3 -3 -3 -2 -3 -2 1 -4 -3 -3 12 2 -3 6 A 5 -2 -2 -2 -1 -1 -1 0 -2 -2 -2 -1 -1 -3 -1 1 0 -3 -2 0 7 L -2 -2 -4 -4 -1 -2 -3 -4 -3 2 4 -3 2 0 -3 -3 -1 -2 -1 1 8 L -1 -3 -3 -4 -1 -3 -3 -4 -3 2 2 -3 1 3 -3 -2 -1 -2 0 3 9 L -1 -3 -4 -4 -1 -2 -3 -4 -3 2 4 -3 2 0 -3 -3 -1 -2 -1 2

10 L -2 -2 -4 -4 -1 -2 -3 -4 -3 2 4 -3 2 0 -3 -3 -1 -2 -1 1 11 A 5 -2 -2 -2 -1 -1 -1 0 -2 -2 -2 -1 -1 -3 -1 1 0 -3 -2 0 12 A 5 -2 -2 -2 -1 -1 -1 0 -2 -2 -2 -1 -1 -3 -1 1 0 -3 -2 0 13 W -2 -3 -4 -4 -2 -2 -3 -4 -3 1 4 -3 2 1 -3 -3 -2 7 0 0 14 A 3 -2 -1 -2 -1 -1 -2 4 -2 -2 -2 -1 -2 -3 -1 1 -1 -3 -3 -1 15 A 2 -1 0 -1 -2 2 0 2 -1 -3 -3 0 -2 -3 -1 3 0 -3 -2 -2 16 A 4 -2 -1 -2 -1 -1 -1 3 -2 -2 -2 -1 -1 -3 -1 1 0 -3 -2 -1 ...37 S 2 -1 0 -1 -1 0 0 0 -1 -2 -3 0 -2 -3 -1 4 1 -3 -2 -2 38 G 0 -3 -1 -2 -3 -2 -2 6 -2 -4 -4 -2 -3 -4 -2 0 -2 -3 -3 -4 39 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -3 -2 0 40 W -3 -3 -4 -5 -3 -2 -3 -3 -3 -3 -2 -3 -2 1 -4 -3 -3 12 2 -3 41 Y -2 -2 -2 -3 -3 -2 -2 -3 2 -2 -1 -2 -1 3 -3 -2 -2 2 7 -1 42 A 4 -2 -2 -2 -1 -1 -1 0 -2 -2 -2 -1 -1 -3 -1 1 0 -3 -2 0

Fig. 5.10Page 140

Page 29: Advanced BLAST Searching · 2014. 5. 26. · Advanced BLAST Searching. September 13, 2006. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. pevsner@jhmi.edu. Johns Hopkins

A R N D C Q E G H I L K M F P S T W Y V 1 M -1 -2 -2 -3 -2 -1 -2 -3 -2 1 2 -2 6 0 -3 -2 -1 -2 -1 1 2 K -1 1 0 1 -4 2 4 -2 0 -3 -3 3 -2 -4 -1 0 -1 -3 -2 -3 3 W -3 -3 -4 -5 -3 -2 -3 -3 -3 -3 -2 -3 -2 1 -4 -3 -3 12 2 -3 4 V 0 -3 -3 -4 -1 -3 -3 -4 -4 3 1 -3 1 -1 -3 -2 0 -3 -1 4 5 W -3 -3 -4 -5 -3 -2 -3 -3 -3 -3 -2 -3 -2 1 -4 -3 -3 12 2 -3 6 A 5 -2 -2 -2 -1 -1 -1 0 -2 -2 -2 -1 -1 -3 -1 1 0 -3 -2 0 7 L -2 -2 -4 -4 -1 -2 -3 -4 -3 2 4 -3 2 0 -3 -3 -1 -2 -1 1 8 L -1 -3 -3 -4 -1 -3 -3 -4 -3 2 2 -3 1 3 -3 -2 -1 -2 0 3 9 L -1 -3 -4 -4 -1 -2 -3 -4 -3 2 4 -3 2 0 -3 -3 -1 -2 -1 2

10 L -2 -2 -4 -4 -1 -2 -3 -4 -3 2 4 -3 2 0 -3 -3 -1 -2 -1 1 11 A 5 -2 -2 -2 -1 -1 -1 0 -2 -2 -2 -1 -1 -3 -1 1 0 -3 -2 0 12 A 5 -2 -2 -2 -1 -1 -1 0 -2 -2 -2 -1 -1 -3 -1 1 0 -3 -2 0 13 W -2 -3 -4 -4 -2 -2 -3 -4 -3 1 4 -3 2 1 -3 -3 -2 7 0 0 14 A 3 -2 -1 -2 -1 -1 -2 4 -2 -2 -2 -1 -2 -3 -1 1 -1 -3 -3 -1 15 A 2 -1 0 -1 -2 2 0 2 -1 -3 -3 0 -2 -3 -1 3 0 -3 -2 -2 16 A 4 -2 -1 -2 -1 -1 -1 3 -2 -2 -2 -1 -1 -3 -1 1 0 -3 -2 -1 ...37 S 2 -1 0 -1 -1 0 0 0 -1 -2 -3 0 -2 -3 -1 4 1 -3 -2 -2 38 G 0 -3 -1 -2 -3 -2 -2 6 -2 -4 -4 -2 -3 -4 -2 0 -2 -3 -3 -4 39 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -3 -2 0 40 W -3 -3 -4 -5 -3 -2 -3 -3 -3 -3 -2 -3 -2 1 -4 -3 -3 12 2 -3 41 Y -2 -2 -2 -3 -3 -2 -2 -3 2 -2 -1 -2 -1 3 -3 -2 -2 2 7 -1 42 A 4 -2 -2 -2 -1 -1 -1 0 -2 -2 -2 -1 -1 -3 -1 1 0 -3 -2 0

Fig. 5.10Page 140

Page 30: Advanced BLAST Searching · 2014. 5. 26. · Advanced BLAST Searching. September 13, 2006. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. pevsner@jhmi.edu. Johns Hopkins

PSI-BLAST is performed in five steps

[1] Select a query and search it against a protein database

[2] PSI-BLAST constructs a multiple sequence alignmentthen creates a “profile” or specialized position-specificscoring matrix (PSSM)

[3] The PSSM is used as a query against the database

[4] PSI-BLAST estimates statistical significance (E values)

Page 138

Page 31: Advanced BLAST Searching · 2014. 5. 26. · Advanced BLAST Searching. September 13, 2006. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. pevsner@jhmi.edu. Johns Hopkins

Fig. 5.11Page 141

Page 32: Advanced BLAST Searching · 2014. 5. 26. · Advanced BLAST Searching. September 13, 2006. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. pevsner@jhmi.edu. Johns Hopkins

PSI-BLAST is performed in five steps

[1] Select a query and search it against a protein database

[2] PSI-BLAST constructs a multiple sequence alignmentthen creates a “profile” or specialized position-specificscoring matrix (PSSM)

[3] The PSSM is used as a query against the database

[4] PSI-BLAST estimates statistical significance (E values)

[5] Repeat steps [3] and [4] iteratively, typically 5 times.At each new search, a new profile is used as the query.

Page 138

Page 33: Advanced BLAST Searching · 2014. 5. 26. · Advanced BLAST Searching. September 13, 2006. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. pevsner@jhmi.edu. Johns Hopkins

Results of a PSI-BLAST search

# hitsIteration # hits > threshold

1 104 492 173 963 236 1784 301 2405 344 2836 342 2987 378 3108 382 320 Table 5-4

Page 141

Page 34: Advanced BLAST Searching · 2014. 5. 26. · Advanced BLAST Searching. September 13, 2006. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. pevsner@jhmi.edu. Johns Hopkins

Score = 46.2 bits (108), Expect = 2e-04Identities = 40/150 (26%), Positives = 70/150 (46%), Gaps = 37/150 (24%)

Query: 27 VKENFDKARFSGTWYAMAKKDPEGLFLQDNIVAEFSVDETGQMSATAKGRVRLLNNWDVC 86V+ENFD ++ G WY + +K P + I A +S+ E G + K ++

Sbjct: 33 VQENFDVKKYLGRWYEI-EKIPASFEKGNCIQANYSLMENGNIEVLNK---------ELS 82

Query: 87 ADMVGTF---------TDTEDPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAVQYSCR 137D GT ++ +PAK +++++ + +WI+ TDY+ YA+ YSC

Sbjct: 83 PD--GTMNQVKGEAKQSNVSEPAKLEVQFFPLMP-----PAPYWILATDYENYALVYSCT 135

Query: 138 ----LLNLDGTCADSYSFVFSRDPNGLPPE 163L ++D + ++ R+P LPPE

Sbjct: 136 TFFWLFHVD------FFWILGRNPY-LPPE 158

PSI-BLAST alignment of RBP and β-lactoglobulin: iteration 1

Fig. 5.12Page 142

Page 35: Advanced BLAST Searching · 2014. 5. 26. · Advanced BLAST Searching. September 13, 2006. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. pevsner@jhmi.edu. Johns Hopkins

PSI-BLAST alignment of RBP and β-lactoglobulin: iteration 2

Score = 140 bits (353), Expect = 1e-32Identities = 45/176 (25%), Positives = 78/176 (43%), Gaps = 33/176 (18%)

Query: 4 VWALLLLAAWAAAERDCRVSSF--------RVKENFDKARFSGTWYAMAKKDPEGLFLQD 55V L+ LA A + +F V+ENFD ++ G WY + +K P +

Sbjct: 2 VTMLMFLATLAGLFTTAKGQNFHLGKCPSPPVQENFDVKKYLGRWYEI-EKIPASFEKGN 60

Query: 56 NIVAEFSVDETGQMSATAKGRVRLLNNWDVCADMV---GTFTDTEDPAKFKMKYWGVASF 112I A +S+ E G + K + D + V ++ +PAK +++++ +

Sbjct: 61 CIQANYSLMENGNIEVLNKEL-----SPDGTMNQVKGEAKQSNVSEPAKLEVQFFPL--- 112

Query: 113 LQKGNDDHWIVDTDYDTYAVQYSCR----LLNLDGTCADSYSFVFSRDPNGLPPEA 164+WI+ TDY+ YA+ YSC L ++D + ++ R+P LPPE

Sbjct: 113 --MPPAPYWILATDYENYALVYSCTTFFWLFHVD------FFWILGRNPY-LPPET 159

Fig. 5.12Page 142

Page 36: Advanced BLAST Searching · 2014. 5. 26. · Advanced BLAST Searching. September 13, 2006. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. pevsner@jhmi.edu. Johns Hopkins

PSI-BLAST alignment of RBP and β-lactoglobulin: iteration 3

Score = 159 bits (404), Expect = 1e-38Identities = 41/170 (24%), Positives = 69/170 (40%), Gaps = 19/170 (11%)

Query: 3 WVWALLLLAAWAAAERD--------CRVSSFRVKENFDKARFSGTWYAMAKKDPEGLFLQ 54V L+ LA A + S V+ENFD ++ G WY + K

Sbjct: 1 MVTMLMFLATLAGLFTTAKGQNFHLGKCPSPPVQENFDVKKYLGRWYEIEKIPASFE-KG 59

Query: 55 DNIVAEFSVDETGQMSATAKGRVRLLNNWDVCADMVGTFTDTEDPAKFKMKYWGVASFLQ 114+ I A +S+ E G + K V + ++ +PAK +++++ +

Sbjct: 60 NCIQANYSLMENGNIEVLNKELSPDGTMNQVKGE--AKQSNVSEPAKLEVQFFPL----- 112

Query: 115 KGNDDHWIVDTDYDTYAVQYSCRLLNLDGTCADSYSFVFSRDPNGLPPEA 164+WI+ TDY+ YA+ YSC + ++ R+P LPPE

Sbjct: 113 MPPAPYWILATDYENYALVYSCTTFFWL--FHVDFFWILGRNPY-LPPET 159

Fig. 5.12Page 142

Page 37: Advanced BLAST Searching · 2014. 5. 26. · Advanced BLAST Searching. September 13, 2006. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. pevsner@jhmi.edu. Johns Hopkins

Score = 159 bits (404), Expect = 1e-38Identities = 41/170 (24%), Positives = 69/170 (40%), Gaps = 19/170 (11%)

Query: 3 WVWALLLLAAWAAAERD--------CRVSSFRVKENFDKARFSGTWYAMAKKDPEGLFLQ 54V L+ LA A + S V+ENFD ++ G WY + K

Sbjct: 1 MVTMLMFLATLAGLFTTAKGQNFHLGKCPSPPVQENFDVKKYLGRWYEIEKIPASFE-KG 59

Query: 55 DNIVAEFSVDETGQMSATAKGRVRLLNNWDVCADMVGTFTDTEDPAKFKMKYWGVASFLQ 114+ I A +S+ E G + K V + ++ +PAK +++++ +

Sbjct: 60 NCIQANYSLMENGNIEVLNKELSPDGTMNQVKGE--AKQSNVSEPAKLEVQFFPL----- 112

Query: 115 KGNDDHWIVDTDYDTYAVQYSCRLLNLDGTCADSYSFVFSRDPNGLPPEA 164+WI+ TDY+ YA+ YSC + ++ R+P LPPE

Sbjct: 113 MPPAPYWILATDYENYALVYSCTTFFWL--FHVDFFWILGRNPY-LPPET 159

Score = 46.2 bits (108), Expect = 2e-04Identities = 40/150 (26%), Positives = 70/150 (46%), Gaps = 37/150 (24%)

Query: 27 VKENFDKARFSGTWYAMAKKDPEGLFLQDNIVAEFSVDETGQMSATAKGRVRLLNNWDVC 86V+ENFD ++ G WY + +K P + I A +S+ E G + K ++

Sbjct: 33 VQENFDVKKYLGRWYEI-EKIPASFEKGNCIQANYSLMENGNIEVLNK---------ELS 82

Query: 87 ADMVGTF---------TDTEDPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAVQYSCR 137D GT ++ +PAK +++++ + +WI+ TDY+ YA+ YSC

Sbjct: 83 PD--GTMNQVKGEAKQSNVSEPAKLEVQFFPLMP-----PAPYWILATDYENYALVYSCT 135

Query: 138 ----LLNLDGTCADSYSFVFSRDPNGLPPE 163L ++D + ++ R+P LPPE

Sbjct: 136 TFFWLFHVD------FFWILGRNPY-LPPE 158

1

3

Fig. 5.12Page 142

Page 38: Advanced BLAST Searching · 2014. 5. 26. · Advanced BLAST Searching. September 13, 2006. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. pevsner@jhmi.edu. Johns Hopkins

The universe of lipocalins (each dot is a protein)

retinol-binding protein

odorant-binding protein

apolipoprotein D

Fig. 5.13Page 143

Page 39: Advanced BLAST Searching · 2014. 5. 26. · Advanced BLAST Searching. September 13, 2006. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. pevsner@jhmi.edu. Johns Hopkins

Scoring matrices let you focus on the big (or small) picture

retinol-binding protein

your RBP query Fig. 5.13Page 143

Page 40: Advanced BLAST Searching · 2014. 5. 26. · Advanced BLAST Searching. September 13, 2006. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. pevsner@jhmi.edu. Johns Hopkins

Scoring matrices let you focus on the big (or small) picture

retinol-binding proteinretinol-binding

protein

PAM250

PAM30

Blosum45

Blosum80

Fig. 5.13Page 143

Page 41: Advanced BLAST Searching · 2014. 5. 26. · Advanced BLAST Searching. September 13, 2006. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. pevsner@jhmi.edu. Johns Hopkins

PSI-BLAST generates scoring matrices more powerful than PAM or BLOSUM

retinol-binding protein

retinol-binding protein

Fig. 5.13Page 143

Page 42: Advanced BLAST Searching · 2014. 5. 26. · Advanced BLAST Searching. September 13, 2006. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. pevsner@jhmi.edu. Johns Hopkins

PSI-BLAST: performance assessment

Evaluate PSI-BLAST results using a database in which protein structures have been solved and allproteins in a group share < 40% amino acid identity.

Page 143

Page 43: Advanced BLAST Searching · 2014. 5. 26. · Advanced BLAST Searching. September 13, 2006. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. pevsner@jhmi.edu. Johns Hopkins

PSI-BLAST: the problem of corruption

PSI-BLAST is useful to detect weak but biologicallymeaningful relationships between proteins.

The main source of false positives is the spuriousamplification of sequences not related to the query.For instance, a query with a coiled-coil motif maydetect thousands of other proteins with this motifthat are not homologous.

Once even a single spurious protein is includedin a PSI-BLAST search above threshold, it will notgo away.

Page 144

Page 44: Advanced BLAST Searching · 2014. 5. 26. · Advanced BLAST Searching. September 13, 2006. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. pevsner@jhmi.edu. Johns Hopkins

PSI-BLAST: the problem of corruption

Corruption is defined as the presence of at least onefalse positive alignment with an E value < 10-4

after five iterations.

Three approaches to stopping corruption:

[1] Apply filtering of biased composition regions

[2] Adjust E value from 0.001 (default) to a lowervalue such as E = 0.0001.

[3] Visually inspect the output from each iteration.Remove suspicious hits by unchecking the box.

Page 144

Page 45: Advanced BLAST Searching · 2014. 5. 26. · Advanced BLAST Searching. September 13, 2006. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. pevsner@jhmi.edu. Johns Hopkins

Fig. 5.11Page 141

Page 46: Advanced BLAST Searching · 2014. 5. 26. · Advanced BLAST Searching. September 13, 2006. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. pevsner@jhmi.edu. Johns Hopkins

Fig. 5.21Page 152

See problem 5-1 (page 153): create an artificial proteinconsisting of RBP4 and protein kinase C. How doesPSI-BLAST perform?

Page 47: Advanced BLAST Searching · 2014. 5. 26. · Advanced BLAST Searching. September 13, 2006. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. pevsner@jhmi.edu. Johns Hopkins

Fig. 5.21Page 152

Page 48: Advanced BLAST Searching · 2014. 5. 26. · Advanced BLAST Searching. September 13, 2006. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. pevsner@jhmi.edu. Johns Hopkins

PHI-BLAST: Pattern hit initiated BLAST

Launches from the same page as PSI-BLAST

Combines matching of regular expressions with local alignments surrounding the match.

Page 145

Page 49: Advanced BLAST Searching · 2014. 5. 26. · Advanced BLAST Searching. September 13, 2006. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. pevsner@jhmi.edu. Johns Hopkins

PHI-BLAST: Pattern hit intiated BLAST

Launches from the same page as PSI-BLAST

Combines matching of regular expressions with local alignments surrounding the match.

Given a protein sequence S and a regular expression pattern P occurring in S, PHI-BLAST helps answer the question: What other protein sequences both contain an occurrence of P and are homologous to S in the vicinity of the pattern occurrences? PHI-BLAST may be preferable to just searching for pattern occurrences because it filters out those cases where the pattern occurrence is probably random and not indicative of homology.

Page 145

Page 50: Advanced BLAST Searching · 2014. 5. 26. · Advanced BLAST Searching. September 13, 2006. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. pevsner@jhmi.edu. Johns Hopkins

1 50ecblc MRLLPLVAAA TAAFLVVACS SPTPPRGVTV VNNFDAKRYL GTWYEIARFD

vc MRAIFLILCS V...LLNGCL G..MPESVKP VSDFELNNYL GKWYEVARLDhsrbp ~~~MKWVWAL LLLAAWAAAE RDCRVSSFRV KENFDKARFS GTWYAMAKKD

Align three lipocalins (RBP and two bacterial lipocalins)

Fig. 5.14Page 145

Page 51: Advanced BLAST Searching · 2014. 5. 26. · Advanced BLAST Searching. September 13, 2006. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. pevsner@jhmi.edu. Johns Hopkins

1 50ecblc MRLLPLVAAA TAAFLVVACS SPTPPRGVTV VNNFDAKRYL GTWYEIARFD

vc MRAIFLILCS V...LLNGCL G..MPESVKP VSDFELNNYL GKWYEVARLDhsrbp ~~~MKWVWAL LLLAAWAAAE RDCRVSSFRV KENFDKARFS GTWYAMAKKD

GTWYEIK AV

M

Pick a small, conserved region and see which amino acidresidues are used

Fig. 5.14Page 145

Page 52: Advanced BLAST Searching · 2014. 5. 26. · Advanced BLAST Searching. September 13, 2006. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. pevsner@jhmi.edu. Johns Hopkins

1 50ecblc MRLLPLVAAA TAAFLVVACS SPTPPRGVTV VNNFDAKRYL GTWYEIARFD

vc MRAIFLILCS V...LLNGCL G..MPESVKP VSDFELNNYL GKWYEVARLDhsrbp ~~~MKWVWAL LLLAAWAAAE RDCRVSSFRV KENFDKARFS GTWYAMAKKD

GTWYEIK AV

M

GXW[YF][EA][IVLM]

Create a pattern using the appropriate syntax

Fig. 5.14Page 145

Page 53: Advanced BLAST Searching · 2014. 5. 26. · Advanced BLAST Searching. September 13, 2006. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. pevsner@jhmi.edu. Johns Hopkins

Fig. 5.15Page 146

Page 54: Advanced BLAST Searching · 2014. 5. 26. · Advanced BLAST Searching. September 13, 2006. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. pevsner@jhmi.edu. Johns Hopkins

Fig. 5.16Page 147

Page 55: Advanced BLAST Searching · 2014. 5. 26. · Advanced BLAST Searching. September 13, 2006. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. pevsner@jhmi.edu. Johns Hopkins

Syntax rules for PHI-BLAST

The syntax for patterns in PHI-BLAST follows the conventions of PROSITE (protein lecture, Chapter 8).

When using the stand-alone program, it is permissible to have multiple patterns. When using the Web-page only one pattern is allowed per query.

[ ] means any one of the characters enclosed in the bracketse.g., [LFYT] means one occurrence of L or F or Y or T

- means nothing (spacer character)

x(5) means 5 positions in which any residue is allowed

x(2,4) means 2 to 4 positions where any residue is allowed

Page 56: Advanced BLAST Searching · 2014. 5. 26. · Advanced BLAST Searching. September 13, 2006. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. pevsner@jhmi.edu. Johns Hopkins

BLAST for gene discovery

You can use BLAST to find a “novel” gene

Page 147

Page 57: Advanced BLAST Searching · 2014. 5. 26. · Advanced BLAST Searching. September 13, 2006. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. pevsner@jhmi.edu. Johns Hopkins

BLAST for gene discovery

You can use BLAST to find a “novel” gene

Note to students taking this class for credit:You will need to do this for 40% of your grade.

In the first five years of this course,everyone has succeeded at this exercise.

Page 147

Page 58: Advanced BLAST Searching · 2014. 5. 26. · Advanced BLAST Searching. September 13, 2006. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. pevsner@jhmi.edu. Johns Hopkins

Start with the sequence of a known protein

Search a DNA database(e.g. HTGS, dbEST, or genomic sequencefrom a specific organism)

tblastn

Search your DNA or protein against a protein database (nr) to confirm you haveidentified a novel gene

blastxor

blastpnr

Find matches…[1] to DNA encoding

known proteins[2] to DNA encoding

related (novel!) proteins[3] to false positives

inspect

Fig. 5.17Page 148

Page 59: Advanced BLAST Searching · 2014. 5. 26. · Advanced BLAST Searching. September 13, 2006. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. pevsner@jhmi.edu. Johns Hopkins

Page 149

Page 60: Advanced BLAST Searching · 2014. 5. 26. · Advanced BLAST Searching. September 13, 2006. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. pevsner@jhmi.edu. Johns Hopkins

Page 149

Page 61: Advanced BLAST Searching · 2014. 5. 26. · Advanced BLAST Searching. September 13, 2006. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. pevsner@jhmi.edu. Johns Hopkins

(Page 150)

Page 62: Advanced BLAST Searching · 2014. 5. 26. · Advanced BLAST Searching. September 13, 2006. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. pevsner@jhmi.edu. Johns Hopkins

this is a good candidatefor a novel gene/protein

Page 63: Advanced BLAST Searching · 2014. 5. 26. · Advanced BLAST Searching. September 13, 2006. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. pevsner@jhmi.edu. Johns Hopkins

A blastp nr search confirms thatthe Salmonella query is closelyrelated to other lipocalins

(Page 150)

Page 64: Advanced BLAST Searching · 2014. 5. 26. · Advanced BLAST Searching. September 13, 2006. Jonathan Pevsner, Ph.D. Introduction to Bioinformatics. pevsner@jhmi.edu. Johns Hopkins

BLAST for gene discovery

You can use BLAST to find a “novel” gene

Note to students taking this class for credit:You will need to do this for 30% of your grade.

Ideally, try to find a new gene this week. Youcan discuss it anytime with me or Jen, Hugh and Meg. You should have your novel proteinby October 18 (for the first phylogeny lecture)so you can put your novel protein into a tree.

I will provide sample projects from last year.