124
NCBI FieldGuide A Field Guide part 2 February 14, 2006 UT-Health Science Center National Center for Biotechnology Information

A Field Guide part 2

  • Upload
    colton

  • View
    29

  • Download
    4

Embed Size (px)

DESCRIPTION

National Center for Biotechnology Information. A Field Guide part 2. UT-Health Science Center. February 14, 2006. Header. Feature Table. Sequence. GenBank Records. The Flatfile Format. LOCUS NM_019570 4279 bp mRNA linear INV 28-OCT-2004 - PowerPoint PPT Presentation

Citation preview

Page 1: A Field Guide part 2

NC

BI

Fie

ldG

uid

eA Field Guide part 2

February 14, 2006 UT-Health Science Center

National Center for Biotechnology Information

Page 2: A Field Guide part 2

NC

BI

Fie

ldG

uid

eGenBank Records

Header

Feature Table

Sequence

The Flatfile Format

Page 3: A Field Guide part 2

NC

BI

Fie

ldG

uid

eA Typical GenBank Record

LOCUS NM_019570 4279 bp mRNA linear INV 28-OCT-2004DEFINITION Mus musculus REV1-like(S. cerevisiae)(Rev1l),mRNAACCESSION NM_019570VERSION NM_019570.3 GI:50811869 KEYWORDS .

= Title

Page 4: A Field Guide part 2

NC

BI

Fie

ldG

uid

eGenBank Record: Feature Table

Page 5: A Field Guide part 2

NC

BI

Fie

ldG

uid

e

GenPept identifier

GenBank Record: Feature Table, con’t.

Page 6: A Field Guide part 2

NC

BI

Fie

ldG

uid

eGenBank Record: sequence

skip

Page 7: A Field Guide part 2

NC

BI

Fie

ldG

uid

eIndexing for Nucleotide UID 59958365

Field Indexed Terms

[primary accession] NM_001012399[title] Bos taurus hemochromatosis (hfe), mRNA.[organism] Bos taurus[sequence length] 1168[modification date] 2005/02/19[properties] biomol mrna

gbdiv mamsrcdb refseq

[accn]

[orgn]

[mdat][prop]

Page 8: A Field Guide part 2

NC

BI

Fie

ldG

uid

eGlobal Entrez Search: HFE

HFE

Page 9: A Field Guide part 2

NC

BI

Fie

ldG

uid

e

Entrez Nucleotide: HFE 137 records

Not HFE [Title]

Page 10: A Field Guide part 2

NC

BI

Fie

ldG

uid

eSmarter Query

hfe[title]

42 records

Curated HFE splice variants(11 total)

AND human[orgn]

Page 11: A Field Guide part 2

NC

BI

Fie

ldG

uid

ehfe[title] AND human[orgn] (con’t)

Primary data

Page 12: A Field Guide part 2

NC

BI

Fie

ldG

uid

ePreview/Index

Gateway to Advanced Searches

Page 13: A Field Guide part 2

NC

BI

Fie

ldG

uid

ePreview/Index

Page 14: A Field Guide part 2

NC

BI

Fie

ldG

uid

ePreview/Index: Properties, srcdb

srcdbProperties

Page 15: A Field Guide part 2

NC

BI

Fie

ldG

uid

ePreview/Index: Properties, srcdb

…AND srcdb refseq[Properties]…AND srcdb refseq[Properties]

Page 16: A Field Guide part 2

NC

BI

Fie

ldG

uid

ePreview/Index: Properties, srcdb

…AND srcdb ddbj/embl/genbank[Properties]…AND srcdb ddbj/embl/genbank[Properties]

Page 17: A Field Guide part 2

NC

BI

Fie

ldG

uid

e#1 hfe 137#2 hfe[title] AND human[orgn] 42

#3 #2 AND srcdb refseq[prop] 11#4 #2 AND srcdb ddbj/embl/genbank[prop] 31

Database Queries

#5 #4 AND gbdiv pri[prop] 29

#4 #4 AND gbdiv est[prop] 2

Primate division gbdiv pri[prop]EST division gbdiv est[prop]

Page 18: A Field Guide part 2

NC

BI

Fie

ldG

uid

e

Molecule Queries

#1 hfe 116

#2 hfe[title] AND human[orgn] 42

#3 #2 AND biomol mrna[prop] 29

#4 #2 AND biomol genomic[prop] 13

Genomic DNA biomol genomic[prop]cDNA biomol mrna[prop]

Page 19: A Field Guide part 2

NC

BI

Fie

ldG

uid

eMore Queries…

Fields are database-specific

Entrez Nucleotide

Reviewed RefSeqs with transcript variants:

srcdb refseq reviewed[prop] AND transcript[title] AND variant[title]

Page 20: A Field Guide part 2

NC

BI

Fie

ldG

uid

eMore Queries…

Fields are database-specific

Entrez Nucleotide

Reviewed RefSeqs with transcript variants:

srcdb refseq reviewed[prop] AND transcript[title] AND variant[title]

Topoisomerase genes from Archaea:

topoisomerase[gene name] AND archaea[organism]

Entrez Gene

Genes on human chromosome 2 with OMIM links

2[chromosome] AND human[organism] AND “gene omim”[filter]

Membrane proteins linked to cancer:

“integral to plasma membrane”[gene ontology] AND cancer[dis]

Page 21: A Field Guide part 2

NC

BI

Fie

ldG

uid

eGenome Resources

UniGeneUniGene

Trace ArchiveTrace Archive

Map Map

ViewerViewer

Genomic BiologyGenomic BiologyGenomic BiologyGenomic Biology

E-PCRE-PCR

Page 22: A Field Guide part 2

NC

BI

Fie

ldG

uid

e

Genomic Biology

Page 23: A Field Guide part 2

NC

BI

Fie

ldG

uid

eGen Biol: Gen Resources

Page 24: A Field Guide part 2

NC

BI

Fie

ldG

uid

eMap Viewer – Genome Annotation Updates

Page 25: A Field Guide part 2

NC

BI

Fie

ldG

uid

eGen Biol: Gen Resources

Page 26: A Field Guide part 2

NC

BI

Fie

ldG

uid

e

Genome Projects: microb

Page 27: A Field Guide part 2

NC

BI

Fie

ldG

uid

eGenome Projects: microb

13 Eukaryotic Genome Sequencing Projects Selected: Complete – 0, Assembly – 2,In Progress - 11

Page 28: A Field Guide part 2

NC

BI

Fie

ldG

uid

eGen Biol: Gen Resources

Page 29: A Field Guide part 2

NC

BI

Fie

ldG

uid

eGen Biol: Gen Resources

Page 30: A Field Guide part 2

NC

BI

Fie

ldG

uid

eGen Biol: Gen Resources

Page 31: A Field Guide part 2

NC

BI

Fie

ldG

uid

eGen Biol: Gen Resources

Page 32: A Field Guide part 2

NC

BI

Fie

ldG

uid

eGen Biol: Gen Resources

Page 33: A Field Guide part 2

NC

BI

Fie

ldG

uid

eGenome Resources

UniGeneUniGene

Trace ArchiveTrace Archive

Map Map

ViewerViewer

Genomic BiologyGenomic Biology

E-PCRE-PCR

Page 34: A Field Guide part 2

NC

BI

Fie

ldG

uid

eGene-oriented clusters of expressed sequences

• Automatic clustering using MegaBlast

• Each cluster represents a unique gene

• Informed by genome hits

• Information on tissue types and map locations

• Useful for gene discovery and selection of

mapping reagents

UniGene

Page 35: A Field Guide part 2

NC

BI

Fie

ldG

uid

eA Cluster of ESTs

query

5’ EST hits

3’ EST hits

Page 36: A Field Guide part 2

NC

BI

Fie

ldG

uid

eUniGene Collections

Page 37: A Field Guide part 2

NC

BI

Fie

ldG

uid

eUniGene Collections

Species UniGene

Page 38: A Field Guide part 2

NC

BI

Fie

ldG

uid

eUniGene Hs build 188

Page 39: A Field Guide part 2

NC

BI

Fie

ldG

uid

eUniGene Cluster Hs.95351

Lipase, hormone-sensitive (LIPE)

Page 40: A Field Guide part 2

NC

BI

Fie

ldG

uid

eUniGene Cluster Hs.95351

Page 41: A Field Guide part 2

NC

BI

Fie

ldG

uid

e

UniGene Cluster Hs.95351: expression

Page 42: A Field Guide part 2

NC

BI

Fie

ldG

uid

eUniGene Cluster Hs.95351: seqs

Page 43: A Field Guide part 2

NC

BI

Fie

ldG

uid

eGet Sequences

web pageweb page

ftp://ftp.ncbi.nih.gov/repository/UniGene/Homo_sapiens/ftp://ftp.ncbi.nih.gov/repository/UniGene/Homo_sapiens/

Page 44: A Field Guide part 2

NC

BI

Fie

ldG

uid

eGenome Resources

UniGeneUniGene

Trace ArchiveTrace Archive

Map Map

ViewerViewer

Genomic BiologyGenomic Biology

E-PCRE-PCR

Page 45: A Field Guide part 2

NC

BI

Fie

ldG

uid

eE-PCR

Genomic sequence here

Page 46: A Field Guide part 2

NC

BI

Fie

ldG

uid

e

Options

Page 47: A Field Guide part 2

NC

BI

Fie

ldG

uid

e

Results

Page 48: A Field Guide part 2

NC

BI

Fie

ldG

uid

e

reverse e-pcr

Page 49: A Field Guide part 2

NC

BI

Fie

ldG

uid

e

reverse e-pcr

Page 50: A Field Guide part 2

NC

BI

Fie

ldG

uid

e

reverse e-pcr

Page 51: A Field Guide part 2

NC

BI

Fie

ldG

uid

e

reverse e-pcr

Gene STS

LY6G6D: lymphocyte antigen 6 complex, locus G6D

Page 52: A Field Guide part 2

NC

BI

Fie

ldG

uid

eGenome Resources

UniGeneUniGene

Trace ArchiveTrace Archive

Genomic BiologyGenomic Biology

Map Map

ViewerViewer

E-PCRE-PCR

Page 53: A Field Guide part 2

NC

BI

Fie

ldG

uid

e

List View

Page 54: A Field Guide part 2

NC

BI

Fie

ldG

uid

eHuman MapViewer

adar

Page 55: A Field Guide part 2

NC

BI

Fie

ldG

uid

eMapViewer: Human ADAR

Page 56: A Field Guide part 2

NC

BI

Fie

ldG

uid

e

MV Hs ADAR3’ UTR

5’ UTR

Page 57: A Field Guide part 2

NC

BI

Fie

ldG

uid

eMaps & Options

--Sequence maps--Ab initioAssemblyRepeatsBES_CloneCloneNCI_CloneContigComponentCpG islanddbSNP haplotypeFosmidGenBank_DNAGenePhenotypeSAGE_TagSTSTCAG_RNATranscript (RNA)Hs_UniGeneHs_EST

--Cytogenetic maps--IdeogramFISH CloneGene_CytogeneticMitelman BreakpointMorbid/Disease--Genetic Maps--deCODEGenethonMarshfield--RH maps--GeneMap99-G3GeneMap99-GB4NCBI RHStandford-G3TNGWhitehead-RHWhitehead-YAC

Mm_UniGeneMm_ESTRn_UniGeneRn_ESTSsc_UniGeneSsc_ESTBt_UniGeneBt_ESTGga_UniGeneGga_ESTVariation

Maps & Options

= SNP

Page 58: A Field Guide part 2

NC

BI

Fie

ldG

uid

e

MapViewerUniGene

Component

Repeats

Gene

Page 59: A Field Guide part 2

NC

BI

Fie

ldG

uid

eGene

Phenotype Variation

Page 60: A Field Guide part 2

NC

BI

Fie

ldG

uid

eMaps & OptionsMaps & Options

Page 61: A Field Guide part 2

NC

BI

Fie

ldG

uid

e

Human ADAR

Human ADAR

Chimp ADARChimp ADAR

Mouse ADAR

Mouse ADAR

Page 62: A Field Guide part 2

NC

BI

Fie

ldG

uid

eGenome Resources

UniGeneUniGene

Map Map

ViewerViewer

Genomic BiologyGenomic Biology

Trace Trace

ArchiveArchive

E-PCRE-PCR

Page 63: A Field Guide part 2

NC

BI

Fie

ldG

uid

e

Trace Archive Page

Page 64: A Field Guide part 2

NC

BI

Fie

ldG

uid

e

Ciona savignyi Traces

Page 65: A Field Guide part 2

NC

BI

Fie

ldG

uid

e

Page 66: A Field Guide part 2

NC

BI

Fie

ldG

uid

e

Trace Archive BLAST Page

Potential access to sequences NOT yet in GenBankPotential access to sequences NOT yet in GenBank

Page 67: A Field Guide part 2

NC

BI

Fie

ldG

uid

e

Basic Local Alignment Search Tool

Page 68: A Field Guide part 2

NC

BI

Fie

ldG

uid

eBLAST Web Searches, 2005

200,000

Page 69: A Field Guide part 2

NC

BI

Fie

ldG

uid

e

Nucleotide or protein: Related

Sequences

BLAST link: BLink

Precomputed BLAST Services

Transcript clusters: UniGene

Protein homologs: HomoloGene

Page 70: A Field Guide part 2

NC

BI

Fie

ldG

uid

e

Link to Related Sequences

Page 71: A Field Guide part 2

NC

BI

Fie

ldG

uid

eRelated Sequences

Most similar

Least similar

Page 72: A Field Guide part 2

NC

BI

Fie

ldG

uid

e

BLink (BLAST Link)

Page 73: A Field Guide part 2

NC

BI

Fie

ldG

uid

eBLink Output

Best hitsBest hits 3D structures3D structures CDD-SearchCDD-Search

Page 74: A Field Guide part 2

NC

BI

Fie

ldG

uid

eFast- heuristic approach based on Smith Waterman

Local alignments

Statistical significance- Expect value

Versatile- blastn, blastp, blastx, tblastn, tblastx, rps-blast,

psi-blast- www, standalone, and network clients

Why Is BLAST So Popular?

Page 75: A Field Guide part 2

NC

BI

Fie

ldG

uid

eGlobal vs Local Alignment

Seq 1

Seq 2

Seq 1

Seq 2

Global alignment

Local alignment

Page 76: A Field Guide part 2

NC

BI

Fie

ldG

uid

e

Global vs Local Alignment

Seq1: WHEREISWALTERNOW (16aa)Seq2: HEWASHEREBUTNOWISHERE (21aa)

Global

Seq1: 1 W--HEREISWALTERNOW 16 W HERE

Seq2: 1 HEWASHEREBUTNOWISHERE 21

LocalSeq1: 1 W--HERE 5 Seq1: 1 W--HERE 5 W HERE W HERESeq2: 3 WASHERE 9 Seq2: 15 WISHERE 21

Page 77: A Field Guide part 2

NC

BI

Fie

ldG

uid

e

How BLAST Works

1. Make lookup table of “words” for query

2. Scan database for hits

3. Extend alignment both directions

– Ungapped extensions of hits (initial HSPs)

– Gapped extensions (no traceback)

– Gapped extensions (traceback - alignment

details)

1. Make lookup table of “words” for query

2. Scan database for hits

3. Extend alignment both directions

– Ungapped extensions of hits (initial HSPs)

– Gapped extensions (no traceback)

– Gapped extensions (traceback - alignment

details)

Page 78: A Field Guide part 2

NC

BI

Fie

ldG

uid

eProtein Words

GTQITVEDLFYNIATRRKALKNQuery:

Neighborhood Words

VTV, LTV, VSV, etc.

GTQ

TQI

QIT

ITV

TVE

VED

EDL

DLF

...

Make a lookuptable of words

Word size = 3 (default)

Word size can only be 2 or 3

VTV 12LTV 11VSV 8

Neighborhood score threshold

Page 79: A Field Guide part 2

NC

BI

Fie

ldG

uid

e

BLASTP Summary

YLS HFLSbjct 287 LEETYAKYLHKGASYFVYLSLNMSPEQLDVNVHPSKRIVHFLYDQEI 333

Query 1 IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESI 47 +E YA YL K F+ L +SP+ +DVNVHP+K V +++ I

HFL 18HFV 15 HFS 14HWL 13NFL 13DFL 12HWV 10etc …

YLS 15YLT 12 YVS 12YIT 10etc …

Neighborhood words

Neighborhood score threshold

T (-f) =11

Query: IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESILEV…

example query words

Drop-off score =

Highest score – current score

-X X dropoff value for gapped alignment (in bits) blastn 30, megablast 20, tblastx 0, all others 15

Page 80: A Field Guide part 2

NC

BI

Fie

ldG

uid

e

YLS HFLSbjct 287 LEETYAKYLHKGASYFVYLSLNMSPEQLDVNVHPSKRIVHFLYDQEI 333

Query 1 IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESI 47

Gapped extension with trace back

Query 1 IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESI-LEV… 50 +E YA YL K F+YLSL +SP+ +DVNVHP+K VHFL+++ I + +Sbjct 287 LEETYAKYLHKGASYFVYLSLNMSPEQLDVNVHPSKRIVHFLYDQEIATSI… 337

Final HSP

+E YA YL K F+ L +SP+ +DVNVHP+K V +++ I

High-scoring pair (HSP)

HFL 18HFV 15 HFS 14HWL 13NFL 13DFL 12HWV 10etc …

YLS 15YLT 12 YVS 12YIT 10etc …

Neighborhood words

Neighborhood score threshold

T (-f) =11

Query: IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESILEV…

example query words

BLASTP Summary

Page 81: A Field Guide part 2

NC

BI

Fie

ldG

uid

e

Scoring Systems - Nucleotides

A G C T

A +1 –3 –3 -3

G –3 +1 –3 -3

C –3 –3 +1 -3

T –3 –3 –3 +1

Identity matrix

CAGGTAGCAAGCTTGCATGTCA

|| |||||||||||| ||||| raw score = 19-9 = 10

CACGTAGCAAGCTTG-GTGTCA

[ -r 1 -q -3 ]

Page 82: A Field Guide part 2

NC

BI

Fie

ldG

uid

eScoring Systems - Proteins

Position Independent MatricesPAM Matrices (Percent Accepted Mutation)

• Derived from observation; small dataset of alignments• Implicit model of evolution• All calculated from PAM1• PAM250 widely used

BLOSUM Matrices (BLOck SUbstitution Matrices)• Derived from observation; large dataset of highly

conserved blocks• Each matrix derived separately from blocks with a

defined percent identity cutoff• BLOSUM62 - default matrix for BLAST

Position Specific Score Matrices (PSSMs)PSI- and RPS-BLAST

Page 83: A Field Guide part 2

NC

BI

Fie

ldG

uid

e

A 4R -1 5 N -2 0 6D -2 -2 1 6C 0 -3 -3 -3 9Q -1 1 0 0 -3 5E -1 0 0 2 -4 2 5G 0 -2 0 -1 -3 -2 -2 6H -2 0 1 -1 -3 0 0 -2 8I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4X 0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 0 0 -2 -1 -1 -1 A R N D C Q E G H I L K M F P S T W Y V X

BLOSUM62

D

F

Negative for less likely substitutions

D

Y

FPositive for more likely substitutions

Page 84: A Field Guide part 2

NC

BI

Fie

ldG

uid

e

Position-Specific Score Matrix

DAF-1

Serine/Threonine protein kinases catalytic loop

1 7 4PSSM scores 5 4

Page 85: A Field Guide part 2

NC

BI

Fie

ldG

uid

e

A R N D C Q E G H I L K M F P S T W Y V 435 K -1 0 0 -1 -2 3 0 3 0 -2 -2 1 -1 -1 -1 -1 -1 -1 -1 -2 436 E 0 1 0 2 -1 0 2 -1 0 -1 -1 0 0 0 -1 0 0 -1 -1 -1 437 S 0 0 -1 0 1 1 0 1 1 0 -1 0 0 0 2 0 -1 -1 0 -1 438 N -1 0 -1 -1 1 0 -1 3 3 -1 -1 1 -1 0 0 -1 -1 1 1 -1 439 K -2 1 1 -1 -2 0 -1 -2 -2 -1 -2 5 1 -2 -2 -1 -1 -2 -2 -1 440 P -2 -2 -2 -2 -3 -2 -2 -2 -2 -1 -2 -1 0 -3 7 -1 -2 -3 -1 -1 441 A 3 -2 1 -2 0 -1 0 1 -2 -2 -2 0 -1 -2 3 1 0 -3 -3 0 442 M -3 -4 -4 -4 -3 -4 -4 -5 -4 7 0 -4 1 0 -4 -4 -2 -4 -1 2 443 A 4 -4 -4 -4 0 -4 -4 -3 -4 4 -1 -4 -2 -3 -4 -1 -2 -4 -3 4 444 H -4 -2 -1 -3 -5 -2 -2 -4 10 -6 -5 -3 -4 -3 -2 -3 -4 -5 0 -5 445 R -4 8 -3 -4 0 -1 -2 -3 -2 -5 -4 0 -3 -2 -4 -3 -3 0 -4 -5 446 D -4 -4 -1 8 -6 -2 0 -3 -3 -5 -6 -3 -5 -6 -4 -2 -3 -7 -5 -5 447 I -4 -5 -6 -6 -3 -4 -5 -6 -5 3 5 -5 1 1 -5 -5 -3 -4 -3 1 448 K 0 0 1 -3 -5 -1 -1 -3 -3 -5 -5 7 -4 -5 -3 -1 -2 -5 -4 -4 449 S 0 -3 -2 -3 0 -2 -2 -3 -3 -4 -4 -2 -4 -5 2 6 2 -5 -4 -4 450 K 0 3 0 1 -5 0 0 -4 -1 -4 -3 4 -3 -2 2 1 -1 -5 -4 -4 451 N -4 -3 8 -1 -5 -2 -2 -3 -1 -6 -6 -2 -4 -5 -4 -1 -2 -6 -4 -5 452 I -3 -5 -5 -6 0 -5 -5 -6 -5 6 2 -5 2 -2 -5 -4 -3 -5 -3 3 453 M -4 -4 -6 -6 -3 -4 -5 -6 -5 0 6 -5 1 0 -5 -4 -3 -4 -3 0 454 V -3 -3 -5 -6 -3 -4 -5 -6 -5 3 3 -4 2 -2 -5 -4 -3 -5 -3 5 455 K -2 1 1 4 -5 0 -1 -2 1 -4 -2 4 -3 -2 -3 0 -1 -5 -2 -3 456 N 1 1 3 0 -4 -1 1 0 -3 -4 -4 3 -2 -5 -2 2 -2 -5 -4 -4 457 D -3 -2 5 5 -1 -1 1 -1 0 -5 -4 0 -2 -5 -1 0 -2 -6 -4 -5 458 L -3 -1 0 -3 0 -3 -2 3 -4 -2 3 0 1 1 -2 -2 -3 5 -1 -3

Position-Specific Score Matrix

catalytic loop

Page 86: A Field Guide part 2

NC

BI

Fie

ldG

uid

eLocal Alignment Statistics

(applies to ungapped alignments)

E = Kmne-S or E = mn2-S’

K = scale for search space = scale for scoring system S’ = bitscore = (S - lnK)/ln2

Expect ValueE = number of database hits you expect to find by chance, ≥ S

More info: The Statistics of Sequence Similarity Scores

Page 87: A Field Guide part 2

NC

BI

Fie

ldG

uid

e

An Alignment BLAST Cannot Make

1 GAATATATGAAGACCAAGATTGCAGTCCTGCTGGCCTGAACCACGCTATTCTTGCTGTTG || | || || || | || || || || | ||| |||||| | | || | ||| | 1 GAGTGTACGATGAGCCCGAGTGTAGCAGTGAAGATCTGGACCACGGTGTACTCGTTGTCG

61 GTTACGGAACCGAGAATGGTAAAGACTACTGGATCATTAAGAACTCCTGGGGAGCCAGTT | || || || ||| || | |||||| || | |||||| ||||| | | 61 GCTATGGTGTTAAGGGTGGGAAGAAGTACTGGCTCGTCAAGAACAGCTGGGCTGAATCCT

121 GGGGTGAACAAGGTTATTTCAGGCTTGCTCGTGGTAAAAAC |||| || ||||| || || | | |||| || ||| 121 GGGGAGACCAAGGCTACATCCTTATGTCCCGTGACAACAAC

1 GAATATATGAAGACCAAGATTGCAGTCCTGCTGGCCTGAACCACGCTATTCTTGCTGTTG || | || || || | || || || || | ||| |||||| | | || | ||| | 1 GAGTGTACGATGAGCCCGAGTGTAGCAGTGAAGATCTGGACCACGGTGTACTCGTTGTCG

61 GTTACGGAACCGAGAATGGTAAAGACTACTGGATCATTAAGAACTCCTGGGGAGCCAGTT | || || || ||| || | |||||| || | |||||| ||||| | | 61 GCTATGGTGTTAAGGGTGGGAAGAAGTACTGGCTCGTCAAGAACAGCTGGGCTGAATCCT

121 GGGGTGAACAAGGTTATTTCAGGCTTGCTCGTGGTAAAAAC |||| || ||||| || || | | |||| || ||| 121 GGGGAGACCAAGGCTACATCCTTATGTCCCGTGACAACAAC

Reason:

no contiguous exact match of 7 bp.

Page 88: A Field Guide part 2

NC

BI

Fie

ldG

uid

e

BLAST 2 Sequences (blastx) output:

An Alignment BLAST Can Make

Solution: compare protein sequences; BLASTXScore = 290 bits (741), Expect = 7e-77Identities = 147/331 (44%), Positives = 206/331 (61%), Gaps = 8/331 (2%)Frame = +3

Score = 290 bits (741), Expect = 7e-77Identities = 147/331 (44%), Positives = 206/331 (61%), Gaps = 8/331 (2%)Frame = +3

Page 89: A Field Guide part 2

NC

BI

Fie

ldG

uid

e

Other BLAST Algorithms

• Megablast

• Discontiguous Megablast

• PSI-BLAST

• PHI-BLAST

Page 90: A Field Guide part 2

NC

BI

Fie

ldG

uid

e

Megablast: NCBI’s Genome Annotator

• Long alignments of similar DNA sequences

• Greedy algorithm

• Concatenation of query sequences

• Faster than blastn; less sensitive

Page 91: A Field Guide part 2

NC

BI

Fie

ldG

uid

e

MegaBLAST & Word Size

Trade-off: sensitivity vs speed

23blastp

828megablast

711blastn

minimumdefaultWORD SIZE

Page 92: A Field Guide part 2

NC

BI

Fie

ldG

uid

e

Discontiguous Megablast

• Uses discontiguous word matches

• Better for cross-species comparisons

Page 93: A Field Guide part 2

NC

BI

Fie

ldG

uid

e

Templates for Discontiguous Words

W = 11, t = 16, coding: 1101101101101101W = 11, t = 16, non-coding: 1110010110110111W = 12, t = 16, coding: 1111101101101101W = 12, t = 16, non-coding: 1110110110110111W = 11, t = 18, coding: 101101100101101101W = 11, t = 18, non-coding: 111010010110010111W = 12, t = 18, coding: 101101101101101101W = 12, t = 18, non-coding: 111010110010110111W = 11, t = 21, coding: 100101100101100101101W = 11, t = 21, non-coding: 111010010100010010111W = 12, t = 21, coding: 100101101101100101101W = 12, t = 21, non-coding: 111010010110010010111

Reference: Ma, B, Tromp, J, Li, M. PatternHunter: faster and more sensitive homology search. Bioinformatics March, 2002; 18(3):440-5

W = word size; # matches in template

t = template length

Page 94: A Field Guide part 2

NC

BI

Fie

ldG

uid

e

Page 95: A Field Guide part 2

NC

BI

Fie

ldG

uid

eDiscontiguous (Cross-species)

MegaBLAST

Page 96: A Field Guide part 2

NC

BI

Fie

ldG

uid

eDiscontiguous Word

Options

Page 97: A Field Guide part 2

NC

BI

Fie

ldG

uid

e

Disco. Megablast Example . . .

Discontiguous megaBLAST = numerous hits . . .

Query: NM_078651

Drosophila melanogaster CG18582-PA (mbt) mRNA, (3244 bp)

/note= mushroom bodies tiny; synonyms: Pak2, STE20, dPAK2

MegaBLAST = “No significant similarity found.”

Database: nr (nt), Mammalia[orgn]

Page 98: A Field Guide part 2

NC

BI

Fie

ldG

uid

eEx: Discontiguous MegaBLAST

Page 99: A Field Guide part 2

NC

BI

Fie

ldG

uid

eEx: BLASTN

Page 100: A Field Guide part 2

NC

BI

Fie

ldG

uid

e

PSI-BLAST

Example: Confirming relationships of purine

nucleotide metabolism proteins

Position-specific Iterated BLAST

Page 101: A Field Guide part 2

NC

BI

Fie

ldG

uid

e>gi|113340|sp|P03958|ADA_MOUSE ADENOSINE DEAMINASE (ADENOSINEMAQTPAFNKPKVELHVHLDGAIKPETILYFGKKRGIALPADTVEELRNIIGMDKPLSLPGFVIAGCREAIKRIAYEFVEMKAKEGVVYVEVRYSPHLLANSKVDPMPWNQTEGDVTPDDVVDEQAFGIKVRSILCCMRHQPSWSLEVLELCKKYNQKTVVAMDLAGDETIEGSSLFPGHVEAYRTVHAGEVGSPEVVREAVDILKTERVGHGYHTIEDEALYNRLLKENMHFEVCPWSSYLTGAVRFKNDKANYSLNTDDPLIFKSTLDTDYQMTKKDMGFTEEEFKRLNINAAKSSFLPEEEKK

PSI-BLAST

0.005 E value cutoff for PSSM

Page 102: A Field Guide part 2

NC

BI

Fie

ldG

uid

eRESULTS: Initial BLASTP

Same results as protein-protein BLAST; different format

Page 103: A Field Guide part 2

NC

BI

Fie

ldG

uid

eResults of First PSSM Search

Other purine nucleotide metabolizing enzymes not found by ordinary BLAST

Page 104: A Field Guide part 2

NC

BI

Fie

ldG

uid

eTenth PSSM Search: Convergence

Just below threshold, another nucleotide metabolism enzyme

Check to add to PSSM

Page 105: A Field Guide part 2

NC

BI

Fie

ldG

uid

e

PHI-BLAST

>gi|231729|sp|P30429|CED4_CAEEL CELL DEATH PROTEIN 4MLCEIECRALSTAHTRLIHDFEPRDALTYLEGKNIFTEDHSELISKMSTRLERIANFLRIYRRQASELIDFFNYNNQSHLADFLEDYIDFAINEPDLLRPVVIAPQFSRQMLDRKLLLGNVPKQMTCYIREYHVIKKLDEMCDLDSFFLFLHGRAGSGKSVIASQALSKSDQLIGINYDSIVWLKDSGTAPKSTFDLFTDILKSEDDLLNFPSVEHVTSVVLKRMICNALIDRPNTLFVFDDVVQEETIRWAQELRLRCLVTTRDVEIASQTCEFIEVTSLEIDECYDFLEAYGMPMPVGEKEEDVLNKTIELSSGNPATLMMFFKSCEPKTFEK

[GA]xxxxGK[ST]

Page 106: A Field Guide part 2

NC

BI

Fie

ldG

uid

e

What’s New?

Page 107: A Field Guide part 2

NC

BI

Fie

ldG

uid

e

BLAST DatabasesNucleotide

• refseq_rna = NM_*, XM_*

• refseq_genomic = NC_*, NG_*

• env_nt– environmental sample[filter], e.g., 16S

rRNA

Protein

• refseq = NP_*, XP_*

• env_nr

nr = nrnr = nr

Page 108: A Field Guide part 2

NC

BI

Fie

ldG

uid

eNew Formatter

Select lower caseSelect lower case

Select redSelect red

Page 109: A Field Guide part 2

NC

BI

Fie

ldG

uid

e

BLAST Output: Alignments & Filter

low complexity sequence filtered

Page 110: A Field Guide part 2

NC

BI

Fie

ldG

uid

e

BLAST Output: CDS Feature

Page 111: A Field Guide part 2

NC

BI

Fie

ldG

uid

eAdvanced Options

Limit to Organism

all[filter] NOT ma

Example Entrez Queriesall[Filter] NOT mammalia[Organism]ray finned fishes[Organism]srcdb refseq[Properties]

Nucleotide only:biomol mrna[Properties]biomol genomic[Properties]

OtherAdvanced–e 10000 expect value-v 2000 descriptions-b 2000alignments

Example Entrez Queriesall[Filter] NOT mammalia[Organism]ray finned fishes[Organism]srcdb refseq[Properties]

Nucleotide only:biomol mrna[Properties]biomol genomic[Properties]

OtherAdvanced–e 10000 expect value-v 2000 descriptions-b 2000alignments

-e 10000 -v 2000

Page 112: A Field Guide part 2

NC

BI

Fie

ldG

uid

eGenome BLAST

Page 113: A Field Guide part 2

NC

BI

Fie

ldG

uid

eGenome BLAST via Map Viewer

Page 114: A Field Guide part 2

NC

BI

Fie

ldG

uid

eExample: Human Genome BLAST

TGCCTCCTTTGGTGAAGGTGACACATCATGTGACCTCTTCAGTGACCACTCTACGGTGTCGGGCCTTGAACTACTACCCCCAGAACATCACCATGAAGTGGCTGAAGGATAAGCAGCCAATGGATGCCAAGGAGTTCGAACCTAAAGACGTATTGCCCAATGGGGATGGGACCTACCAGGGCTGGATAACCTTGGCTGTACCCCCTGGGGAAGAGC

Human EST

Page 115: A Field Guide part 2

NC

BI

Fie

ldG

uid

eHuman Genome BLAST: Results

Page 116: A Field Guide part 2

NC

BI

Fie

ldG

uid

e

Human Genome BLAST: MapViewer

Entrez GeneEntrez Gene

Page 117: A Field Guide part 2

NC

BI

Fie

ldG

uid

e

Example: Mapping Oligos Onto a Genome

>forwardCCATGGCGACCCTGGAAAAGC

>reverseCAGCAGCGGCTGTGCCTGCGG

??

?

Page 118: A Field Guide part 2

NC

BI

Fie

ldG

uid

eMap Oligos Onto Genome

>CCATGGCGACCCTGGAAAAGCNNNNNNNNNNCAGCAGCGGCTGTGCCTGCGG

-W 7 –e 1000

forward primer reverse primer

Page 119: A Field Guide part 2

NC

BI

Fie

ldG

uid

eGenome BLAST Results

Page 120: A Field Guide part 2

NC

BI

Fie

ldG

uid

e

Primer Alignments

forward primer

reverse primer

Page 121: A Field Guide part 2

NC

BI

Fie

ldG

uid

e

MapViewer

Page 122: A Field Guide part 2

NC

BI

Fie

ldG

uid

e

MapViewer

Page 123: A Field Guide part 2

NC

BI

Fie

ldG

uid

eSequence View (sv)

forward

reverse

Page 124: A Field Guide part 2

NC

BI

Fie

ldG

uid

e

Service Addresses

•BLAST [email protected]

•General Help [email protected]•Wayne Matten [email protected]

•BLAST [email protected]

•General Help [email protected]•Wayne Matten [email protected]