Upload
colton
View
29
Download
4
Embed Size (px)
DESCRIPTION
National Center for Biotechnology Information. A Field Guide part 2. UT-Health Science Center. February 14, 2006. Header. Feature Table. Sequence. GenBank Records. The Flatfile Format. LOCUS NM_019570 4279 bp mRNA linear INV 28-OCT-2004 - PowerPoint PPT Presentation
Citation preview
NC
BI
Fie
ldG
uid
eA Field Guide part 2
February 14, 2006 UT-Health Science Center
National Center for Biotechnology Information
NC
BI
Fie
ldG
uid
eGenBank Records
Header
Feature Table
Sequence
The Flatfile Format
NC
BI
Fie
ldG
uid
eA Typical GenBank Record
LOCUS NM_019570 4279 bp mRNA linear INV 28-OCT-2004DEFINITION Mus musculus REV1-like(S. cerevisiae)(Rev1l),mRNAACCESSION NM_019570VERSION NM_019570.3 GI:50811869 KEYWORDS .
= Title
NC
BI
Fie
ldG
uid
eGenBank Record: Feature Table
NC
BI
Fie
ldG
uid
e
GenPept identifier
GenBank Record: Feature Table, con’t.
NC
BI
Fie
ldG
uid
eGenBank Record: sequence
skip
NC
BI
Fie
ldG
uid
eIndexing for Nucleotide UID 59958365
Field Indexed Terms
[primary accession] NM_001012399[title] Bos taurus hemochromatosis (hfe), mRNA.[organism] Bos taurus[sequence length] 1168[modification date] 2005/02/19[properties] biomol mrna
gbdiv mamsrcdb refseq
[accn]
[orgn]
[mdat][prop]
NC
BI
Fie
ldG
uid
eGlobal Entrez Search: HFE
HFE
NC
BI
Fie
ldG
uid
e
Entrez Nucleotide: HFE 137 records
Not HFE [Title]
NC
BI
Fie
ldG
uid
eSmarter Query
hfe[title]
42 records
Curated HFE splice variants(11 total)
AND human[orgn]
NC
BI
Fie
ldG
uid
ehfe[title] AND human[orgn] (con’t)
Primary data
NC
BI
Fie
ldG
uid
ePreview/Index
Gateway to Advanced Searches
NC
BI
Fie
ldG
uid
ePreview/Index
NC
BI
Fie
ldG
uid
ePreview/Index: Properties, srcdb
srcdbProperties
NC
BI
Fie
ldG
uid
ePreview/Index: Properties, srcdb
…AND srcdb refseq[Properties]…AND srcdb refseq[Properties]
NC
BI
Fie
ldG
uid
ePreview/Index: Properties, srcdb
…AND srcdb ddbj/embl/genbank[Properties]…AND srcdb ddbj/embl/genbank[Properties]
NC
BI
Fie
ldG
uid
e#1 hfe 137#2 hfe[title] AND human[orgn] 42
#3 #2 AND srcdb refseq[prop] 11#4 #2 AND srcdb ddbj/embl/genbank[prop] 31
Database Queries
#5 #4 AND gbdiv pri[prop] 29
#4 #4 AND gbdiv est[prop] 2
Primate division gbdiv pri[prop]EST division gbdiv est[prop]
NC
BI
Fie
ldG
uid
e
Molecule Queries
#1 hfe 116
#2 hfe[title] AND human[orgn] 42
#3 #2 AND biomol mrna[prop] 29
#4 #2 AND biomol genomic[prop] 13
Genomic DNA biomol genomic[prop]cDNA biomol mrna[prop]
NC
BI
Fie
ldG
uid
eMore Queries…
Fields are database-specific
Entrez Nucleotide
Reviewed RefSeqs with transcript variants:
srcdb refseq reviewed[prop] AND transcript[title] AND variant[title]
NC
BI
Fie
ldG
uid
eMore Queries…
Fields are database-specific
Entrez Nucleotide
Reviewed RefSeqs with transcript variants:
srcdb refseq reviewed[prop] AND transcript[title] AND variant[title]
Topoisomerase genes from Archaea:
topoisomerase[gene name] AND archaea[organism]
Entrez Gene
Genes on human chromosome 2 with OMIM links
2[chromosome] AND human[organism] AND “gene omim”[filter]
Membrane proteins linked to cancer:
“integral to plasma membrane”[gene ontology] AND cancer[dis]
NC
BI
Fie
ldG
uid
eGenome Resources
UniGeneUniGene
Trace ArchiveTrace Archive
Map Map
ViewerViewer
Genomic BiologyGenomic BiologyGenomic BiologyGenomic Biology
E-PCRE-PCR
NC
BI
Fie
ldG
uid
e
Genomic Biology
NC
BI
Fie
ldG
uid
eGen Biol: Gen Resources
NC
BI
Fie
ldG
uid
eMap Viewer – Genome Annotation Updates
NC
BI
Fie
ldG
uid
eGen Biol: Gen Resources
NC
BI
Fie
ldG
uid
e
Genome Projects: microb
NC
BI
Fie
ldG
uid
eGenome Projects: microb
13 Eukaryotic Genome Sequencing Projects Selected: Complete – 0, Assembly – 2,In Progress - 11
NC
BI
Fie
ldG
uid
eGen Biol: Gen Resources
NC
BI
Fie
ldG
uid
eGen Biol: Gen Resources
NC
BI
Fie
ldG
uid
eGen Biol: Gen Resources
NC
BI
Fie
ldG
uid
eGen Biol: Gen Resources
NC
BI
Fie
ldG
uid
eGen Biol: Gen Resources
NC
BI
Fie
ldG
uid
eGenome Resources
UniGeneUniGene
Trace ArchiveTrace Archive
Map Map
ViewerViewer
Genomic BiologyGenomic Biology
E-PCRE-PCR
NC
BI
Fie
ldG
uid
eGene-oriented clusters of expressed sequences
• Automatic clustering using MegaBlast
• Each cluster represents a unique gene
• Informed by genome hits
• Information on tissue types and map locations
• Useful for gene discovery and selection of
mapping reagents
UniGene
NC
BI
Fie
ldG
uid
eA Cluster of ESTs
query
5’ EST hits
3’ EST hits
NC
BI
Fie
ldG
uid
eUniGene Collections
NC
BI
Fie
ldG
uid
eUniGene Collections
Species UniGene
NC
BI
Fie
ldG
uid
eUniGene Hs build 188
NC
BI
Fie
ldG
uid
eUniGene Cluster Hs.95351
Lipase, hormone-sensitive (LIPE)
NC
BI
Fie
ldG
uid
eUniGene Cluster Hs.95351
NC
BI
Fie
ldG
uid
e
UniGene Cluster Hs.95351: expression
NC
BI
Fie
ldG
uid
eUniGene Cluster Hs.95351: seqs
NC
BI
Fie
ldG
uid
eGet Sequences
web pageweb page
ftp://ftp.ncbi.nih.gov/repository/UniGene/Homo_sapiens/ftp://ftp.ncbi.nih.gov/repository/UniGene/Homo_sapiens/
NC
BI
Fie
ldG
uid
eGenome Resources
UniGeneUniGene
Trace ArchiveTrace Archive
Map Map
ViewerViewer
Genomic BiologyGenomic Biology
E-PCRE-PCR
NC
BI
Fie
ldG
uid
eE-PCR
Genomic sequence here
NC
BI
Fie
ldG
uid
e
Options
NC
BI
Fie
ldG
uid
e
Results
NC
BI
Fie
ldG
uid
e
reverse e-pcr
NC
BI
Fie
ldG
uid
e
reverse e-pcr
NC
BI
Fie
ldG
uid
e
reverse e-pcr
NC
BI
Fie
ldG
uid
e
reverse e-pcr
Gene STS
LY6G6D: lymphocyte antigen 6 complex, locus G6D
NC
BI
Fie
ldG
uid
eGenome Resources
UniGeneUniGene
Trace ArchiveTrace Archive
Genomic BiologyGenomic Biology
Map Map
ViewerViewer
E-PCRE-PCR
NC
BI
Fie
ldG
uid
e
List View
NC
BI
Fie
ldG
uid
eHuman MapViewer
adar
NC
BI
Fie
ldG
uid
eMapViewer: Human ADAR
NC
BI
Fie
ldG
uid
e
MV Hs ADAR3’ UTR
5’ UTR
NC
BI
Fie
ldG
uid
eMaps & Options
--Sequence maps--Ab initioAssemblyRepeatsBES_CloneCloneNCI_CloneContigComponentCpG islanddbSNP haplotypeFosmidGenBank_DNAGenePhenotypeSAGE_TagSTSTCAG_RNATranscript (RNA)Hs_UniGeneHs_EST
--Cytogenetic maps--IdeogramFISH CloneGene_CytogeneticMitelman BreakpointMorbid/Disease--Genetic Maps--deCODEGenethonMarshfield--RH maps--GeneMap99-G3GeneMap99-GB4NCBI RHStandford-G3TNGWhitehead-RHWhitehead-YAC
Mm_UniGeneMm_ESTRn_UniGeneRn_ESTSsc_UniGeneSsc_ESTBt_UniGeneBt_ESTGga_UniGeneGga_ESTVariation
Maps & Options
= SNP
NC
BI
Fie
ldG
uid
e
MapViewerUniGene
Component
Repeats
Gene
NC
BI
Fie
ldG
uid
eGene
Phenotype Variation
NC
BI
Fie
ldG
uid
eMaps & OptionsMaps & Options
NC
BI
Fie
ldG
uid
e
Human ADAR
Human ADAR
Chimp ADARChimp ADAR
Mouse ADAR
Mouse ADAR
NC
BI
Fie
ldG
uid
eGenome Resources
UniGeneUniGene
Map Map
ViewerViewer
Genomic BiologyGenomic Biology
Trace Trace
ArchiveArchive
E-PCRE-PCR
NC
BI
Fie
ldG
uid
e
Trace Archive Page
NC
BI
Fie
ldG
uid
e
Ciona savignyi Traces
NC
BI
Fie
ldG
uid
e
NC
BI
Fie
ldG
uid
e
Trace Archive BLAST Page
Potential access to sequences NOT yet in GenBankPotential access to sequences NOT yet in GenBank
NC
BI
Fie
ldG
uid
e
Basic Local Alignment Search Tool
NC
BI
Fie
ldG
uid
eBLAST Web Searches, 2005
200,000
NC
BI
Fie
ldG
uid
e
Nucleotide or protein: Related
Sequences
BLAST link: BLink
Precomputed BLAST Services
Transcript clusters: UniGene
Protein homologs: HomoloGene
NC
BI
Fie
ldG
uid
e
Link to Related Sequences
NC
BI
Fie
ldG
uid
eRelated Sequences
Most similar
Least similar
NC
BI
Fie
ldG
uid
e
BLink (BLAST Link)
NC
BI
Fie
ldG
uid
eBLink Output
Best hitsBest hits 3D structures3D structures CDD-SearchCDD-Search
NC
BI
Fie
ldG
uid
eFast- heuristic approach based on Smith Waterman
Local alignments
Statistical significance- Expect value
Versatile- blastn, blastp, blastx, tblastn, tblastx, rps-blast,
psi-blast- www, standalone, and network clients
Why Is BLAST So Popular?
NC
BI
Fie
ldG
uid
eGlobal vs Local Alignment
Seq 1
Seq 2
Seq 1
Seq 2
Global alignment
Local alignment
NC
BI
Fie
ldG
uid
e
Global vs Local Alignment
Seq1: WHEREISWALTERNOW (16aa)Seq2: HEWASHEREBUTNOWISHERE (21aa)
Global
Seq1: 1 W--HEREISWALTERNOW 16 W HERE
Seq2: 1 HEWASHEREBUTNOWISHERE 21
LocalSeq1: 1 W--HERE 5 Seq1: 1 W--HERE 5 W HERE W HERESeq2: 3 WASHERE 9 Seq2: 15 WISHERE 21
NC
BI
Fie
ldG
uid
e
How BLAST Works
1. Make lookup table of “words” for query
2. Scan database for hits
3. Extend alignment both directions
– Ungapped extensions of hits (initial HSPs)
– Gapped extensions (no traceback)
– Gapped extensions (traceback - alignment
details)
1. Make lookup table of “words” for query
2. Scan database for hits
3. Extend alignment both directions
– Ungapped extensions of hits (initial HSPs)
– Gapped extensions (no traceback)
– Gapped extensions (traceback - alignment
details)
NC
BI
Fie
ldG
uid
eProtein Words
GTQITVEDLFYNIATRRKALKNQuery:
Neighborhood Words
VTV, LTV, VSV, etc.
GTQ
TQI
QIT
ITV
TVE
VED
EDL
DLF
...
Make a lookuptable of words
Word size = 3 (default)
Word size can only be 2 or 3
VTV 12LTV 11VSV 8
Neighborhood score threshold
NC
BI
Fie
ldG
uid
e
BLASTP Summary
YLS HFLSbjct 287 LEETYAKYLHKGASYFVYLSLNMSPEQLDVNVHPSKRIVHFLYDQEI 333
Query 1 IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESI 47 +E YA YL K F+ L +SP+ +DVNVHP+K V +++ I
HFL 18HFV 15 HFS 14HWL 13NFL 13DFL 12HWV 10etc …
YLS 15YLT 12 YVS 12YIT 10etc …
Neighborhood words
Neighborhood score threshold
T (-f) =11
Query: IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESILEV…
example query words
Drop-off score =
Highest score – current score
-X X dropoff value for gapped alignment (in bits) blastn 30, megablast 20, tblastx 0, all others 15
NC
BI
Fie
ldG
uid
e
YLS HFLSbjct 287 LEETYAKYLHKGASYFVYLSLNMSPEQLDVNVHPSKRIVHFLYDQEI 333
Query 1 IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESI 47
Gapped extension with trace back
Query 1 IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESI-LEV… 50 +E YA YL K F+YLSL +SP+ +DVNVHP+K VHFL+++ I + +Sbjct 287 LEETYAKYLHKGASYFVYLSLNMSPEQLDVNVHPSKRIVHFLYDQEIATSI… 337
Final HSP
+E YA YL K F+ L +SP+ +DVNVHP+K V +++ I
High-scoring pair (HSP)
HFL 18HFV 15 HFS 14HWL 13NFL 13DFL 12HWV 10etc …
YLS 15YLT 12 YVS 12YIT 10etc …
Neighborhood words
Neighborhood score threshold
T (-f) =11
Query: IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESILEV…
example query words
BLASTP Summary
NC
BI
Fie
ldG
uid
e
Scoring Systems - Nucleotides
A G C T
A +1 –3 –3 -3
G –3 +1 –3 -3
C –3 –3 +1 -3
T –3 –3 –3 +1
Identity matrix
CAGGTAGCAAGCTTGCATGTCA
|| |||||||||||| ||||| raw score = 19-9 = 10
CACGTAGCAAGCTTG-GTGTCA
[ -r 1 -q -3 ]
NC
BI
Fie
ldG
uid
eScoring Systems - Proteins
Position Independent MatricesPAM Matrices (Percent Accepted Mutation)
• Derived from observation; small dataset of alignments• Implicit model of evolution• All calculated from PAM1• PAM250 widely used
BLOSUM Matrices (BLOck SUbstitution Matrices)• Derived from observation; large dataset of highly
conserved blocks• Each matrix derived separately from blocks with a
defined percent identity cutoff• BLOSUM62 - default matrix for BLAST
Position Specific Score Matrices (PSSMs)PSI- and RPS-BLAST
NC
BI
Fie
ldG
uid
e
A 4R -1 5 N -2 0 6D -2 -2 1 6C 0 -3 -3 -3 9Q -1 1 0 0 -3 5E -1 0 0 2 -4 2 5G 0 -2 0 -1 -3 -2 -2 6H -2 0 1 -1 -3 0 0 -2 8I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4X 0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 0 0 -2 -1 -1 -1 A R N D C Q E G H I L K M F P S T W Y V X
BLOSUM62
D
F
Negative for less likely substitutions
D
Y
FPositive for more likely substitutions
NC
BI
Fie
ldG
uid
e
Position-Specific Score Matrix
DAF-1
Serine/Threonine protein kinases catalytic loop
1 7 4PSSM scores 5 4
NC
BI
Fie
ldG
uid
e
A R N D C Q E G H I L K M F P S T W Y V 435 K -1 0 0 -1 -2 3 0 3 0 -2 -2 1 -1 -1 -1 -1 -1 -1 -1 -2 436 E 0 1 0 2 -1 0 2 -1 0 -1 -1 0 0 0 -1 0 0 -1 -1 -1 437 S 0 0 -1 0 1 1 0 1 1 0 -1 0 0 0 2 0 -1 -1 0 -1 438 N -1 0 -1 -1 1 0 -1 3 3 -1 -1 1 -1 0 0 -1 -1 1 1 -1 439 K -2 1 1 -1 -2 0 -1 -2 -2 -1 -2 5 1 -2 -2 -1 -1 -2 -2 -1 440 P -2 -2 -2 -2 -3 -2 -2 -2 -2 -1 -2 -1 0 -3 7 -1 -2 -3 -1 -1 441 A 3 -2 1 -2 0 -1 0 1 -2 -2 -2 0 -1 -2 3 1 0 -3 -3 0 442 M -3 -4 -4 -4 -3 -4 -4 -5 -4 7 0 -4 1 0 -4 -4 -2 -4 -1 2 443 A 4 -4 -4 -4 0 -4 -4 -3 -4 4 -1 -4 -2 -3 -4 -1 -2 -4 -3 4 444 H -4 -2 -1 -3 -5 -2 -2 -4 10 -6 -5 -3 -4 -3 -2 -3 -4 -5 0 -5 445 R -4 8 -3 -4 0 -1 -2 -3 -2 -5 -4 0 -3 -2 -4 -3 -3 0 -4 -5 446 D -4 -4 -1 8 -6 -2 0 -3 -3 -5 -6 -3 -5 -6 -4 -2 -3 -7 -5 -5 447 I -4 -5 -6 -6 -3 -4 -5 -6 -5 3 5 -5 1 1 -5 -5 -3 -4 -3 1 448 K 0 0 1 -3 -5 -1 -1 -3 -3 -5 -5 7 -4 -5 -3 -1 -2 -5 -4 -4 449 S 0 -3 -2 -3 0 -2 -2 -3 -3 -4 -4 -2 -4 -5 2 6 2 -5 -4 -4 450 K 0 3 0 1 -5 0 0 -4 -1 -4 -3 4 -3 -2 2 1 -1 -5 -4 -4 451 N -4 -3 8 -1 -5 -2 -2 -3 -1 -6 -6 -2 -4 -5 -4 -1 -2 -6 -4 -5 452 I -3 -5 -5 -6 0 -5 -5 -6 -5 6 2 -5 2 -2 -5 -4 -3 -5 -3 3 453 M -4 -4 -6 -6 -3 -4 -5 -6 -5 0 6 -5 1 0 -5 -4 -3 -4 -3 0 454 V -3 -3 -5 -6 -3 -4 -5 -6 -5 3 3 -4 2 -2 -5 -4 -3 -5 -3 5 455 K -2 1 1 4 -5 0 -1 -2 1 -4 -2 4 -3 -2 -3 0 -1 -5 -2 -3 456 N 1 1 3 0 -4 -1 1 0 -3 -4 -4 3 -2 -5 -2 2 -2 -5 -4 -4 457 D -3 -2 5 5 -1 -1 1 -1 0 -5 -4 0 -2 -5 -1 0 -2 -6 -4 -5 458 L -3 -1 0 -3 0 -3 -2 3 -4 -2 3 0 1 1 -2 -2 -3 5 -1 -3
Position-Specific Score Matrix
catalytic loop
NC
BI
Fie
ldG
uid
eLocal Alignment Statistics
(applies to ungapped alignments)
E = Kmne-S or E = mn2-S’
K = scale for search space = scale for scoring system S’ = bitscore = (S - lnK)/ln2
Expect ValueE = number of database hits you expect to find by chance, ≥ S
More info: The Statistics of Sequence Similarity Scores
NC
BI
Fie
ldG
uid
e
An Alignment BLAST Cannot Make
1 GAATATATGAAGACCAAGATTGCAGTCCTGCTGGCCTGAACCACGCTATTCTTGCTGTTG || | || || || | || || || || | ||| |||||| | | || | ||| | 1 GAGTGTACGATGAGCCCGAGTGTAGCAGTGAAGATCTGGACCACGGTGTACTCGTTGTCG
61 GTTACGGAACCGAGAATGGTAAAGACTACTGGATCATTAAGAACTCCTGGGGAGCCAGTT | || || || ||| || | |||||| || | |||||| ||||| | | 61 GCTATGGTGTTAAGGGTGGGAAGAAGTACTGGCTCGTCAAGAACAGCTGGGCTGAATCCT
121 GGGGTGAACAAGGTTATTTCAGGCTTGCTCGTGGTAAAAAC |||| || ||||| || || | | |||| || ||| 121 GGGGAGACCAAGGCTACATCCTTATGTCCCGTGACAACAAC
1 GAATATATGAAGACCAAGATTGCAGTCCTGCTGGCCTGAACCACGCTATTCTTGCTGTTG || | || || || | || || || || | ||| |||||| | | || | ||| | 1 GAGTGTACGATGAGCCCGAGTGTAGCAGTGAAGATCTGGACCACGGTGTACTCGTTGTCG
61 GTTACGGAACCGAGAATGGTAAAGACTACTGGATCATTAAGAACTCCTGGGGAGCCAGTT | || || || ||| || | |||||| || | |||||| ||||| | | 61 GCTATGGTGTTAAGGGTGGGAAGAAGTACTGGCTCGTCAAGAACAGCTGGGCTGAATCCT
121 GGGGTGAACAAGGTTATTTCAGGCTTGCTCGTGGTAAAAAC |||| || ||||| || || | | |||| || ||| 121 GGGGAGACCAAGGCTACATCCTTATGTCCCGTGACAACAAC
Reason:
no contiguous exact match of 7 bp.
NC
BI
Fie
ldG
uid
e
BLAST 2 Sequences (blastx) output:
An Alignment BLAST Can Make
Solution: compare protein sequences; BLASTXScore = 290 bits (741), Expect = 7e-77Identities = 147/331 (44%), Positives = 206/331 (61%), Gaps = 8/331 (2%)Frame = +3
Score = 290 bits (741), Expect = 7e-77Identities = 147/331 (44%), Positives = 206/331 (61%), Gaps = 8/331 (2%)Frame = +3
NC
BI
Fie
ldG
uid
e
Other BLAST Algorithms
• Megablast
• Discontiguous Megablast
• PSI-BLAST
• PHI-BLAST
NC
BI
Fie
ldG
uid
e
Megablast: NCBI’s Genome Annotator
• Long alignments of similar DNA sequences
• Greedy algorithm
• Concatenation of query sequences
• Faster than blastn; less sensitive
NC
BI
Fie
ldG
uid
e
MegaBLAST & Word Size
Trade-off: sensitivity vs speed
23blastp
828megablast
711blastn
minimumdefaultWORD SIZE
NC
BI
Fie
ldG
uid
e
Discontiguous Megablast
• Uses discontiguous word matches
• Better for cross-species comparisons
NC
BI
Fie
ldG
uid
e
Templates for Discontiguous Words
W = 11, t = 16, coding: 1101101101101101W = 11, t = 16, non-coding: 1110010110110111W = 12, t = 16, coding: 1111101101101101W = 12, t = 16, non-coding: 1110110110110111W = 11, t = 18, coding: 101101100101101101W = 11, t = 18, non-coding: 111010010110010111W = 12, t = 18, coding: 101101101101101101W = 12, t = 18, non-coding: 111010110010110111W = 11, t = 21, coding: 100101100101100101101W = 11, t = 21, non-coding: 111010010100010010111W = 12, t = 21, coding: 100101101101100101101W = 12, t = 21, non-coding: 111010010110010010111
Reference: Ma, B, Tromp, J, Li, M. PatternHunter: faster and more sensitive homology search. Bioinformatics March, 2002; 18(3):440-5
W = word size; # matches in template
t = template length
NC
BI
Fie
ldG
uid
e
NC
BI
Fie
ldG
uid
eDiscontiguous (Cross-species)
MegaBLAST
NC
BI
Fie
ldG
uid
eDiscontiguous Word
Options
NC
BI
Fie
ldG
uid
e
Disco. Megablast Example . . .
Discontiguous megaBLAST = numerous hits . . .
Query: NM_078651
Drosophila melanogaster CG18582-PA (mbt) mRNA, (3244 bp)
/note= mushroom bodies tiny; synonyms: Pak2, STE20, dPAK2
MegaBLAST = “No significant similarity found.”
Database: nr (nt), Mammalia[orgn]
NC
BI
Fie
ldG
uid
eEx: Discontiguous MegaBLAST
NC
BI
Fie
ldG
uid
eEx: BLASTN
NC
BI
Fie
ldG
uid
e
PSI-BLAST
Example: Confirming relationships of purine
nucleotide metabolism proteins
Position-specific Iterated BLAST
NC
BI
Fie
ldG
uid
e>gi|113340|sp|P03958|ADA_MOUSE ADENOSINE DEAMINASE (ADENOSINEMAQTPAFNKPKVELHVHLDGAIKPETILYFGKKRGIALPADTVEELRNIIGMDKPLSLPGFVIAGCREAIKRIAYEFVEMKAKEGVVYVEVRYSPHLLANSKVDPMPWNQTEGDVTPDDVVDEQAFGIKVRSILCCMRHQPSWSLEVLELCKKYNQKTVVAMDLAGDETIEGSSLFPGHVEAYRTVHAGEVGSPEVVREAVDILKTERVGHGYHTIEDEALYNRLLKENMHFEVCPWSSYLTGAVRFKNDKANYSLNTDDPLIFKSTLDTDYQMTKKDMGFTEEEFKRLNINAAKSSFLPEEEKK
PSI-BLAST
0.005 E value cutoff for PSSM
NC
BI
Fie
ldG
uid
eRESULTS: Initial BLASTP
Same results as protein-protein BLAST; different format
NC
BI
Fie
ldG
uid
eResults of First PSSM Search
Other purine nucleotide metabolizing enzymes not found by ordinary BLAST
NC
BI
Fie
ldG
uid
eTenth PSSM Search: Convergence
Just below threshold, another nucleotide metabolism enzyme
Check to add to PSSM
NC
BI
Fie
ldG
uid
e
PHI-BLAST
>gi|231729|sp|P30429|CED4_CAEEL CELL DEATH PROTEIN 4MLCEIECRALSTAHTRLIHDFEPRDALTYLEGKNIFTEDHSELISKMSTRLERIANFLRIYRRQASELIDFFNYNNQSHLADFLEDYIDFAINEPDLLRPVVIAPQFSRQMLDRKLLLGNVPKQMTCYIREYHVIKKLDEMCDLDSFFLFLHGRAGSGKSVIASQALSKSDQLIGINYDSIVWLKDSGTAPKSTFDLFTDILKSEDDLLNFPSVEHVTSVVLKRMICNALIDRPNTLFVFDDVVQEETIRWAQELRLRCLVTTRDVEIASQTCEFIEVTSLEIDECYDFLEAYGMPMPVGEKEEDVLNKTIELSSGNPATLMMFFKSCEPKTFEK
[GA]xxxxGK[ST]
NC
BI
Fie
ldG
uid
e
What’s New?
NC
BI
Fie
ldG
uid
e
BLAST DatabasesNucleotide
• refseq_rna = NM_*, XM_*
• refseq_genomic = NC_*, NG_*
• env_nt– environmental sample[filter], e.g., 16S
rRNA
Protein
• refseq = NP_*, XP_*
• env_nr
nr = nrnr = nr
NC
BI
Fie
ldG
uid
eNew Formatter
Select lower caseSelect lower case
Select redSelect red
NC
BI
Fie
ldG
uid
e
BLAST Output: Alignments & Filter
low complexity sequence filtered
NC
BI
Fie
ldG
uid
e
BLAST Output: CDS Feature
NC
BI
Fie
ldG
uid
eAdvanced Options
Limit to Organism
all[filter] NOT ma
Example Entrez Queriesall[Filter] NOT mammalia[Organism]ray finned fishes[Organism]srcdb refseq[Properties]
Nucleotide only:biomol mrna[Properties]biomol genomic[Properties]
OtherAdvanced–e 10000 expect value-v 2000 descriptions-b 2000alignments
Example Entrez Queriesall[Filter] NOT mammalia[Organism]ray finned fishes[Organism]srcdb refseq[Properties]
Nucleotide only:biomol mrna[Properties]biomol genomic[Properties]
OtherAdvanced–e 10000 expect value-v 2000 descriptions-b 2000alignments
-e 10000 -v 2000
NC
BI
Fie
ldG
uid
eGenome BLAST
NC
BI
Fie
ldG
uid
eGenome BLAST via Map Viewer
NC
BI
Fie
ldG
uid
eExample: Human Genome BLAST
TGCCTCCTTTGGTGAAGGTGACACATCATGTGACCTCTTCAGTGACCACTCTACGGTGTCGGGCCTTGAACTACTACCCCCAGAACATCACCATGAAGTGGCTGAAGGATAAGCAGCCAATGGATGCCAAGGAGTTCGAACCTAAAGACGTATTGCCCAATGGGGATGGGACCTACCAGGGCTGGATAACCTTGGCTGTACCCCCTGGGGAAGAGC
Human EST
NC
BI
Fie
ldG
uid
eHuman Genome BLAST: Results
NC
BI
Fie
ldG
uid
e
Human Genome BLAST: MapViewer
Entrez GeneEntrez Gene
NC
BI
Fie
ldG
uid
e
Example: Mapping Oligos Onto a Genome
>forwardCCATGGCGACCCTGGAAAAGC
>reverseCAGCAGCGGCTGTGCCTGCGG
??
?
NC
BI
Fie
ldG
uid
eMap Oligos Onto Genome
>CCATGGCGACCCTGGAAAAGCNNNNNNNNNNCAGCAGCGGCTGTGCCTGCGG
-W 7 –e 1000
forward primer reverse primer
NC
BI
Fie
ldG
uid
eGenome BLAST Results
NC
BI
Fie
ldG
uid
e
Primer Alignments
forward primer
reverse primer
NC
BI
Fie
ldG
uid
e
MapViewer
NC
BI
Fie
ldG
uid
e
MapViewer
NC
BI
Fie
ldG
uid
eSequence View (sv)
forward
reverse
NC
BI
Fie
ldG
uid
e
Service Addresses
•BLAST [email protected]
•General Help [email protected]•Wayne Matten [email protected]
•BLAST [email protected]
•General Help [email protected]•Wayne Matten [email protected]