Upload
others
View
8
Download
0
Embed Size (px)
Citation preview
Advanced BLAST Searching
September 13, 2006
Jonathan Pevsner, Ph.D.Introduction to Bioinformatics
[email protected] Hopkins School of
Public Health (260.602.01)
Copyright notice
Many of the images in this powerpoint presentationare from Bioinformatics and Functional Genomicsby Jonathan Pevsner (ISBN 0-471-21004-8). Copyright © 2003 by John Wiley & Sons, Inc.
These images and materials may not be usedwithout permission from the publisher. We welcomeinstructors to use these powerpoints for educationalpurposes, but please acknowledge the source.
The book has a homepage at http://www.bioinfbook.orgincluding hyperlinks to the book chapters.
Outline of today’s lecture
Organism-specific BLAST sites
Specialized BLAST-related algorithms
BLAST-like tools for genomic DNA searches
PSI-BLAST
PHI-BLAST
Find-a-gene
Specialized BLAST servers
Organism-specific BLAST sites
Page 128
Fig. 5.1Page 131
Ensembl BLAST output includes an ideogram
Fig. 5.2Page 131
TIGR BLAST
Fig. 5.3Page 132
Fig. 5.4Page 132
Fig. 5.5Page 133
BLAST output of ProDom server: graphical view of domains
Fig. 5.6Page 134
Specialized BLAST servers
Molecule-specific BLAST sites
Species-specific BLAST sites
Specialized algorithms (WU-BLAST 2.0)
Page 135
Fig. 5.7Page 136
Fig. 5.8Page 137
FASTA server at the University of Virginia
We will discuss the Conserved Domain Database(CDD) in chapter 10(multiple sequence alignment)
We will discuss the Conserved Domain Database(CDD) in chapter 10(multiple sequence alignment)
BLAST-related tools for genomic DNA
The analysis of genomic DNA presents special challenges:• There are exons (protein-coding sequence) andintrons (intervening sequences).
• There may be sequencing errors or polymorphisms• The comparison may between be related species(e.g. human and mouse)
Page 135
BLAST-related tools for genomic DNA
Recently developed tools include:
• MegaBLAST at NCBI.
• BLAT (BLAST-like alignment tool). BLAT parses an entire genomic DNA database into words (11mers), then searches them against a query. Thus it is a mirror imageof the BLAST strategy. See http://genome.ucsc.edu
• SSAHA at Ensembl uses a similar strategy as BLAT.See http://www.ensembl.org
Page 136
To access BLAT, visit http://genome.ucsc.edu
“BLAT on DNA is designed to quickly find sequences of 95%and greater similarity of length 40 bases or more. It may missmore divergent or shorter sequence alignments. It will findperfect sequence matches of 33 bases, and sometimes findthem down to 20 bases. BLAT on proteins finds sequences of80% and greater similarity of length 20 amino acids or more.In practice DNA BLAT works well on primates, and proteinblat on land vertebrates.” --BLAT website
Paste DNA or protein sequencehere in the FASTA format
BLAT output includes browser and other formats
Position specific iterated BLAST: PSI-BLAST
The purpose of PSI-BLAST is to look deeperinto the database for matches to your queryprotein sequence by employing a scoringmatrix that is customized to your query.
Page 137
PSI-BLAST is performed in five steps
[1] Select a query and search it against a protein database
Page 138
PSI-BLAST is performed in five steps
[1] Select a query and search it against a protein database
[2] PSI-BLAST constructs a multiple sequence alignmentthen creates a “profile” or specialized position-specificscoring matrix (PSSM)
Page 138
R,I,K C D,E,T K,R,T N,L,Y,G
Fig. 5.9Page 139
A R N D C Q E G H I L K M F P S T W Y V 1 M -1 -2 -2 -3 -2 -1 -2 -3 -2 1 2 -2 6 0 -3 -2 -1 -2 -1 1 2 K -1 1 0 1 -4 2 4 -2 0 -3 -3 3 -2 -4 -1 0 -1 -3 -2 -3 3 W -3 -3 -4 -5 -3 -2 -3 -3 -3 -3 -2 -3 -2 1 -4 -3 -3 12 2 -3 4 V 0 -3 -3 -4 -1 -3 -3 -4 -4 3 1 -3 1 -1 -3 -2 0 -3 -1 4 5 W -3 -3 -4 -5 -3 -2 -3 -3 -3 -3 -2 -3 -2 1 -4 -3 -3 12 2 -3 6 A 5 -2 -2 -2 -1 -1 -1 0 -2 -2 -2 -1 -1 -3 -1 1 0 -3 -2 0 7 L -2 -2 -4 -4 -1 -2 -3 -4 -3 2 4 -3 2 0 -3 -3 -1 -2 -1 1 8 L -1 -3 -3 -4 -1 -3 -3 -4 -3 2 2 -3 1 3 -3 -2 -1 -2 0 3 9 L -1 -3 -4 -4 -1 -2 -3 -4 -3 2 4 -3 2 0 -3 -3 -1 -2 -1 2
10 L -2 -2 -4 -4 -1 -2 -3 -4 -3 2 4 -3 2 0 -3 -3 -1 -2 -1 1 11 A 5 -2 -2 -2 -1 -1 -1 0 -2 -2 -2 -1 -1 -3 -1 1 0 -3 -2 0 12 A 5 -2 -2 -2 -1 -1 -1 0 -2 -2 -2 -1 -1 -3 -1 1 0 -3 -2 0 13 W -2 -3 -4 -4 -2 -2 -3 -4 -3 1 4 -3 2 1 -3 -3 -2 7 0 0 14 A 3 -2 -1 -2 -1 -1 -2 4 -2 -2 -2 -1 -2 -3 -1 1 -1 -3 -3 -1 15 A 2 -1 0 -1 -2 2 0 2 -1 -3 -3 0 -2 -3 -1 3 0 -3 -2 -2 16 A 4 -2 -1 -2 -1 -1 -1 3 -2 -2 -2 -1 -1 -3 -1 1 0 -3 -2 -1 ...37 S 2 -1 0 -1 -1 0 0 0 -1 -2 -3 0 -2 -3 -1 4 1 -3 -2 -2 38 G 0 -3 -1 -2 -3 -2 -2 6 -2 -4 -4 -2 -3 -4 -2 0 -2 -3 -3 -4 39 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -3 -2 0 40 W -3 -3 -4 -5 -3 -2 -3 -3 -3 -3 -2 -3 -2 1 -4 -3 -3 12 2 -3 41 Y -2 -2 -2 -3 -3 -2 -2 -3 2 -2 -1 -2 -1 3 -3 -2 -2 2 7 -1 42 A 4 -2 -2 -2 -1 -1 -1 0 -2 -2 -2 -1 -1 -3 -1 1 0 -3 -2 0
Fig. 5.10Page 140
A R N D C Q E G H I L K M F P S T W Y V 1 M -1 -2 -2 -3 -2 -1 -2 -3 -2 1 2 -2 6 0 -3 -2 -1 -2 -1 1 2 K -1 1 0 1 -4 2 4 -2 0 -3 -3 3 -2 -4 -1 0 -1 -3 -2 -3 3 W -3 -3 -4 -5 -3 -2 -3 -3 -3 -3 -2 -3 -2 1 -4 -3 -3 12 2 -3 4 V 0 -3 -3 -4 -1 -3 -3 -4 -4 3 1 -3 1 -1 -3 -2 0 -3 -1 4 5 W -3 -3 -4 -5 -3 -2 -3 -3 -3 -3 -2 -3 -2 1 -4 -3 -3 12 2 -3 6 A 5 -2 -2 -2 -1 -1 -1 0 -2 -2 -2 -1 -1 -3 -1 1 0 -3 -2 0 7 L -2 -2 -4 -4 -1 -2 -3 -4 -3 2 4 -3 2 0 -3 -3 -1 -2 -1 1 8 L -1 -3 -3 -4 -1 -3 -3 -4 -3 2 2 -3 1 3 -3 -2 -1 -2 0 3 9 L -1 -3 -4 -4 -1 -2 -3 -4 -3 2 4 -3 2 0 -3 -3 -1 -2 -1 2
10 L -2 -2 -4 -4 -1 -2 -3 -4 -3 2 4 -3 2 0 -3 -3 -1 -2 -1 1 11 A 5 -2 -2 -2 -1 -1 -1 0 -2 -2 -2 -1 -1 -3 -1 1 0 -3 -2 0 12 A 5 -2 -2 -2 -1 -1 -1 0 -2 -2 -2 -1 -1 -3 -1 1 0 -3 -2 0 13 W -2 -3 -4 -4 -2 -2 -3 -4 -3 1 4 -3 2 1 -3 -3 -2 7 0 0 14 A 3 -2 -1 -2 -1 -1 -2 4 -2 -2 -2 -1 -2 -3 -1 1 -1 -3 -3 -1 15 A 2 -1 0 -1 -2 2 0 2 -1 -3 -3 0 -2 -3 -1 3 0 -3 -2 -2 16 A 4 -2 -1 -2 -1 -1 -1 3 -2 -2 -2 -1 -1 -3 -1 1 0 -3 -2 -1 ...37 S 2 -1 0 -1 -1 0 0 0 -1 -2 -3 0 -2 -3 -1 4 1 -3 -2 -2 38 G 0 -3 -1 -2 -3 -2 -2 6 -2 -4 -4 -2 -3 -4 -2 0 -2 -3 -3 -4 39 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -3 -2 0 40 W -3 -3 -4 -5 -3 -2 -3 -3 -3 -3 -2 -3 -2 1 -4 -3 -3 12 2 -3 41 Y -2 -2 -2 -3 -3 -2 -2 -3 2 -2 -1 -2 -1 3 -3 -2 -2 2 7 -1 42 A 4 -2 -2 -2 -1 -1 -1 0 -2 -2 -2 -1 -1 -3 -1 1 0 -3 -2 0
Fig. 5.10Page 140
PSI-BLAST is performed in five steps
[1] Select a query and search it against a protein database
[2] PSI-BLAST constructs a multiple sequence alignmentthen creates a “profile” or specialized position-specificscoring matrix (PSSM)
[3] The PSSM is used as a query against the database
[4] PSI-BLAST estimates statistical significance (E values)
Page 138
Fig. 5.11Page 141
PSI-BLAST is performed in five steps
[1] Select a query and search it against a protein database
[2] PSI-BLAST constructs a multiple sequence alignmentthen creates a “profile” or specialized position-specificscoring matrix (PSSM)
[3] The PSSM is used as a query against the database
[4] PSI-BLAST estimates statistical significance (E values)
[5] Repeat steps [3] and [4] iteratively, typically 5 times.At each new search, a new profile is used as the query.
Page 138
Results of a PSI-BLAST search
# hitsIteration # hits > threshold
1 104 492 173 963 236 1784 301 2405 344 2836 342 2987 378 3108 382 320 Table 5-4
Page 141
Score = 46.2 bits (108), Expect = 2e-04Identities = 40/150 (26%), Positives = 70/150 (46%), Gaps = 37/150 (24%)
Query: 27 VKENFDKARFSGTWYAMAKKDPEGLFLQDNIVAEFSVDETGQMSATAKGRVRLLNNWDVC 86V+ENFD ++ G WY + +K P + I A +S+ E G + K ++
Sbjct: 33 VQENFDVKKYLGRWYEI-EKIPASFEKGNCIQANYSLMENGNIEVLNK---------ELS 82
Query: 87 ADMVGTF---------TDTEDPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAVQYSCR 137D GT ++ +PAK +++++ + +WI+ TDY+ YA+ YSC
Sbjct: 83 PD--GTMNQVKGEAKQSNVSEPAKLEVQFFPLMP-----PAPYWILATDYENYALVYSCT 135
Query: 138 ----LLNLDGTCADSYSFVFSRDPNGLPPE 163L ++D + ++ R+P LPPE
Sbjct: 136 TFFWLFHVD------FFWILGRNPY-LPPE 158
PSI-BLAST alignment of RBP and β-lactoglobulin: iteration 1
Fig. 5.12Page 142
PSI-BLAST alignment of RBP and β-lactoglobulin: iteration 2
Score = 140 bits (353), Expect = 1e-32Identities = 45/176 (25%), Positives = 78/176 (43%), Gaps = 33/176 (18%)
Query: 4 VWALLLLAAWAAAERDCRVSSF--------RVKENFDKARFSGTWYAMAKKDPEGLFLQD 55V L+ LA A + +F V+ENFD ++ G WY + +K P +
Sbjct: 2 VTMLMFLATLAGLFTTAKGQNFHLGKCPSPPVQENFDVKKYLGRWYEI-EKIPASFEKGN 60
Query: 56 NIVAEFSVDETGQMSATAKGRVRLLNNWDVCADMV---GTFTDTEDPAKFKMKYWGVASF 112I A +S+ E G + K + D + V ++ +PAK +++++ +
Sbjct: 61 CIQANYSLMENGNIEVLNKEL-----SPDGTMNQVKGEAKQSNVSEPAKLEVQFFPL--- 112
Query: 113 LQKGNDDHWIVDTDYDTYAVQYSCR----LLNLDGTCADSYSFVFSRDPNGLPPEA 164+WI+ TDY+ YA+ YSC L ++D + ++ R+P LPPE
Sbjct: 113 --MPPAPYWILATDYENYALVYSCTTFFWLFHVD------FFWILGRNPY-LPPET 159
Fig. 5.12Page 142
PSI-BLAST alignment of RBP and β-lactoglobulin: iteration 3
Score = 159 bits (404), Expect = 1e-38Identities = 41/170 (24%), Positives = 69/170 (40%), Gaps = 19/170 (11%)
Query: 3 WVWALLLLAAWAAAERD--------CRVSSFRVKENFDKARFSGTWYAMAKKDPEGLFLQ 54V L+ LA A + S V+ENFD ++ G WY + K
Sbjct: 1 MVTMLMFLATLAGLFTTAKGQNFHLGKCPSPPVQENFDVKKYLGRWYEIEKIPASFE-KG 59
Query: 55 DNIVAEFSVDETGQMSATAKGRVRLLNNWDVCADMVGTFTDTEDPAKFKMKYWGVASFLQ 114+ I A +S+ E G + K V + ++ +PAK +++++ +
Sbjct: 60 NCIQANYSLMENGNIEVLNKELSPDGTMNQVKGE--AKQSNVSEPAKLEVQFFPL----- 112
Query: 115 KGNDDHWIVDTDYDTYAVQYSCRLLNLDGTCADSYSFVFSRDPNGLPPEA 164+WI+ TDY+ YA+ YSC + ++ R+P LPPE
Sbjct: 113 MPPAPYWILATDYENYALVYSCTTFFWL--FHVDFFWILGRNPY-LPPET 159
Fig. 5.12Page 142
Score = 159 bits (404), Expect = 1e-38Identities = 41/170 (24%), Positives = 69/170 (40%), Gaps = 19/170 (11%)
Query: 3 WVWALLLLAAWAAAERD--------CRVSSFRVKENFDKARFSGTWYAMAKKDPEGLFLQ 54V L+ LA A + S V+ENFD ++ G WY + K
Sbjct: 1 MVTMLMFLATLAGLFTTAKGQNFHLGKCPSPPVQENFDVKKYLGRWYEIEKIPASFE-KG 59
Query: 55 DNIVAEFSVDETGQMSATAKGRVRLLNNWDVCADMVGTFTDTEDPAKFKMKYWGVASFLQ 114+ I A +S+ E G + K V + ++ +PAK +++++ +
Sbjct: 60 NCIQANYSLMENGNIEVLNKELSPDGTMNQVKGE--AKQSNVSEPAKLEVQFFPL----- 112
Query: 115 KGNDDHWIVDTDYDTYAVQYSCRLLNLDGTCADSYSFVFSRDPNGLPPEA 164+WI+ TDY+ YA+ YSC + ++ R+P LPPE
Sbjct: 113 MPPAPYWILATDYENYALVYSCTTFFWL--FHVDFFWILGRNPY-LPPET 159
Score = 46.2 bits (108), Expect = 2e-04Identities = 40/150 (26%), Positives = 70/150 (46%), Gaps = 37/150 (24%)
Query: 27 VKENFDKARFSGTWYAMAKKDPEGLFLQDNIVAEFSVDETGQMSATAKGRVRLLNNWDVC 86V+ENFD ++ G WY + +K P + I A +S+ E G + K ++
Sbjct: 33 VQENFDVKKYLGRWYEI-EKIPASFEKGNCIQANYSLMENGNIEVLNK---------ELS 82
Query: 87 ADMVGTF---------TDTEDPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAVQYSCR 137D GT ++ +PAK +++++ + +WI+ TDY+ YA+ YSC
Sbjct: 83 PD--GTMNQVKGEAKQSNVSEPAKLEVQFFPLMP-----PAPYWILATDYENYALVYSCT 135
Query: 138 ----LLNLDGTCADSYSFVFSRDPNGLPPE 163L ++D + ++ R+P LPPE
Sbjct: 136 TFFWLFHVD------FFWILGRNPY-LPPE 158
1
3
Fig. 5.12Page 142
The universe of lipocalins (each dot is a protein)
retinol-binding protein
odorant-binding protein
apolipoprotein D
Fig. 5.13Page 143
Scoring matrices let you focus on the big (or small) picture
retinol-binding protein
your RBP query Fig. 5.13Page 143
Scoring matrices let you focus on the big (or small) picture
retinol-binding proteinretinol-binding
protein
PAM250
PAM30
Blosum45
Blosum80
Fig. 5.13Page 143
PSI-BLAST generates scoring matrices more powerful than PAM or BLOSUM
retinol-binding protein
retinol-binding protein
Fig. 5.13Page 143
PSI-BLAST: performance assessment
Evaluate PSI-BLAST results using a database in which protein structures have been solved and allproteins in a group share < 40% amino acid identity.
Page 143
PSI-BLAST: the problem of corruption
PSI-BLAST is useful to detect weak but biologicallymeaningful relationships between proteins.
The main source of false positives is the spuriousamplification of sequences not related to the query.For instance, a query with a coiled-coil motif maydetect thousands of other proteins with this motifthat are not homologous.
Once even a single spurious protein is includedin a PSI-BLAST search above threshold, it will notgo away.
Page 144
PSI-BLAST: the problem of corruption
Corruption is defined as the presence of at least onefalse positive alignment with an E value < 10-4
after five iterations.
Three approaches to stopping corruption:
[1] Apply filtering of biased composition regions
[2] Adjust E value from 0.001 (default) to a lowervalue such as E = 0.0001.
[3] Visually inspect the output from each iteration.Remove suspicious hits by unchecking the box.
Page 144
Fig. 5.11Page 141
Fig. 5.21Page 152
See problem 5-1 (page 153): create an artificial proteinconsisting of RBP4 and protein kinase C. How doesPSI-BLAST perform?
Fig. 5.21Page 152
PHI-BLAST: Pattern hit initiated BLAST
Launches from the same page as PSI-BLAST
Combines matching of regular expressions with local alignments surrounding the match.
Page 145
PHI-BLAST: Pattern hit intiated BLAST
Launches from the same page as PSI-BLAST
Combines matching of regular expressions with local alignments surrounding the match.
Given a protein sequence S and a regular expression pattern P occurring in S, PHI-BLAST helps answer the question: What other protein sequences both contain an occurrence of P and are homologous to S in the vicinity of the pattern occurrences? PHI-BLAST may be preferable to just searching for pattern occurrences because it filters out those cases where the pattern occurrence is probably random and not indicative of homology.
Page 145
1 50ecblc MRLLPLVAAA TAAFLVVACS SPTPPRGVTV VNNFDAKRYL GTWYEIARFD
vc MRAIFLILCS V...LLNGCL G..MPESVKP VSDFELNNYL GKWYEVARLDhsrbp ~~~MKWVWAL LLLAAWAAAE RDCRVSSFRV KENFDKARFS GTWYAMAKKD
Align three lipocalins (RBP and two bacterial lipocalins)
Fig. 5.14Page 145
1 50ecblc MRLLPLVAAA TAAFLVVACS SPTPPRGVTV VNNFDAKRYL GTWYEIARFD
vc MRAIFLILCS V...LLNGCL G..MPESVKP VSDFELNNYL GKWYEVARLDhsrbp ~~~MKWVWAL LLLAAWAAAE RDCRVSSFRV KENFDKARFS GTWYAMAKKD
GTWYEIK AV
M
Pick a small, conserved region and see which amino acidresidues are used
Fig. 5.14Page 145
1 50ecblc MRLLPLVAAA TAAFLVVACS SPTPPRGVTV VNNFDAKRYL GTWYEIARFD
vc MRAIFLILCS V...LLNGCL G..MPESVKP VSDFELNNYL GKWYEVARLDhsrbp ~~~MKWVWAL LLLAAWAAAE RDCRVSSFRV KENFDKARFS GTWYAMAKKD
GTWYEIK AV
M
GXW[YF][EA][IVLM]
Create a pattern using the appropriate syntax
Fig. 5.14Page 145
Fig. 5.15Page 146
Fig. 5.16Page 147
Syntax rules for PHI-BLAST
The syntax for patterns in PHI-BLAST follows the conventions of PROSITE (protein lecture, Chapter 8).
When using the stand-alone program, it is permissible to have multiple patterns. When using the Web-page only one pattern is allowed per query.
[ ] means any one of the characters enclosed in the bracketse.g., [LFYT] means one occurrence of L or F or Y or T
- means nothing (spacer character)
x(5) means 5 positions in which any residue is allowed
x(2,4) means 2 to 4 positions where any residue is allowed
BLAST for gene discovery
You can use BLAST to find a “novel” gene
Page 147
BLAST for gene discovery
You can use BLAST to find a “novel” gene
Note to students taking this class for credit:You will need to do this for 40% of your grade.
In the first five years of this course,everyone has succeeded at this exercise.
Page 147
Start with the sequence of a known protein
Search a DNA database(e.g. HTGS, dbEST, or genomic sequencefrom a specific organism)
tblastn
Search your DNA or protein against a protein database (nr) to confirm you haveidentified a novel gene
blastxor
blastpnr
Find matches…[1] to DNA encoding
known proteins[2] to DNA encoding
related (novel!) proteins[3] to false positives
inspect
Fig. 5.17Page 148
Page 149
Page 149
(Page 150)
this is a good candidatefor a novel gene/protein
A blastp nr search confirms thatthe Salmonella query is closelyrelated to other lipocalins
(Page 150)
BLAST for gene discovery
You can use BLAST to find a “novel” gene
Note to students taking this class for credit:You will need to do this for 30% of your grade.
Ideally, try to find a new gene this week. Youcan discuss it anytime with me or Jen, Hugh and Meg. You should have your novel proteinby October 18 (for the first phylogeny lecture)so you can put your novel protein into a tree.
I will provide sample projects from last year.