Upload
imogene-cobb
View
222
Download
1
Embed Size (px)
Citation preview
Secondary databases derived from conserved obtained from multiple sequence alignment of primary databases such as GenBank, EMBL,DDBJ, SP/TrEMBL,PIR,etc
Pattern databases – definition
Primary databases(SWISS-PROT - Protein
GenBank - DNA)
Millions of sequences
Pattern databases
Pattern Extraction - Multiple sequence alignment
Thousands of patterns
Pattern Databases - Applications
Function prediction of protein/ nucleotide sequences even when sequence similarity is low (<25%).
Useful for classification of protein sequences into families.
It takes less time to search the pattern than the primary database.– Since “patterns” is the compact representation of
features of many sequences.
Multiple Sequence Alignment (MSA)
Family based databases – considers full MSA
Motif -3Motif -1
Motif based databases – considers local regions in MSA
Pattern Databases – Protein
Motif based PROSITE PRINTS BLOCKS
Family based ProDom PIR-ALN ProtoMap DOMO ProClass Pfam SMART TIGRFAMs SBASE SYSTERS
InterPro - Integrated resources of protein families and sites PROSITE PRINTS BLOCKS Pfam ProDom
InterPro
Pattern databases
Definition Applications Classifications Common Databases
– PROSITE, PRINTS, BLOCKS & SMART (motif based)
– MetaFam, InterPro (Integrated databases)
Conclusions
Databases – General Tips
1. Source
2. Input formats & parameters
3. Output formats
4. Quality of the data
5. Other details – updates, coverage, speed, download, reference, methods etc.
Focus To search pattern databases using the text
or keyword search options in them for “Alkaline phosphatase” enzyme.
To analyze the quality of results from each of these database– Sensitivity, specificity.
Sequence & Pattern searches- In the afternoon’s practical.
PROSITE http://www.expasy.org/prosite/
consists of biologically significant protein sites, patterns and profiles that help to reliably identify to which known protein family (if any) a new sequence belongs.
Based on SWISSPROT/TrEMBL
Details about the pattern/profileDetails about the pattern/profile
PROSITE IDPROSITE ID
PROSITE PatternPROSITE Pattern
Result: PROSITE Documentaion pageResult: PROSITE Documentaion page
[IV]-x-D-S-[GAS]-[GASC]-[GAST]-[GA]-T [S is the active site residue]
Numerical ResultsNumerical Results
PROSITE PatternPROSITE Pattern
Detailed View - page 1Detailed View - page 1
Detailed View - page 2Detailed View - page 2
True PositivesTrue Positives
False PositivesFalse Positives
View entry in raw text format (no links)
ID Identification AC Accession number DT Date DE Short descriptionPA Pattern MA Matrix/profileRU RuleNR Numerical resultsCC CommentsDR Cross-references to SWISS-PROT3D Cross-references to PDBDO Pointer to the documentation file
// Termination line
Highly degenerate protein structural and functional domains– immunoglobulin domains, SH2 and SH3 domains.
Consensus sequences of repetitive DNA elements– SINEs, LINEs
Basic gene expression signals– promoter elements, RNA processing signals,
translational initiation sites.
DNA-binding protein motifs. Protein and nucleic acid compositional
domains– glutamine-rich activation domains, CpG islands.
PROSITE - features
Completeness High specificity Documentation Periodic reviewing Parallel update with SWISS-
PROT(primary database)
Multiple Sequence Alignment
Find 4-5 functionally conserved residues
cydeggiscyedggiscyeeggitcyhgdggscyrgdgnt
C-Y-x2-[DG]-G-x-[ST] CORE PATTERN
SWISS-PROT
MoreFALSE POSITIVES ?
Increase the sequence length of the pattern
PROSITE DBYES NO
motif
http://bioinf.man.ac.uk/dbbrowser/PRINTS/
Protein fingerprint database Fingerprint - set of motifs used that
represent the most conserved regions of multiple sequence alignment.
Improved diagnostic reliability than single motif methods
Source – SWISSPROT/TrEMBL
Multiple Sequence Alignment
Identification of ALL the conserved regions
cydeggiscyedggiscyeeggitcyhgdggs
Creation of frequency matrices
SWISS-PROT/ Tr-EMBL
PRINTS DB
xxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxx
Frequency matricesFrequency matrices
motif
fingerprint
Iterative database scanning of the frequency matrices with protein databases till convergence
http://bioinf.man.ac.uk/dbbrowser/PRINTS/
Database ID , no. of motifs and text Search
Motif scanner (for searching a sequence or pattern against PRINTS database)
Page 1 for ‘alkaline phosphatase’ entry in PRINTSPage 1 for ‘alkaline phosphatase’ entry in PRINTS
Documentation,Links & references
Documentation,Links & references
Page 3Page 3
Motif no. 1Motif no. 1
Motif no. 2Motif no. 2
“Raw” motif“Raw” motif
SWISSPROT -IDsSWISSPROT -IDs
Start and Interval between motifs in the fingerprintStart and Interval between motifs in the fingerprint
BLOCKS http://blocks.fhcrc.org/blocks/
Blocks are multiple aligned ungapped segments corresponding to the most highly conserved regions of proteins
The BLOCKS database is a collection of blocks representing known protein families that can be used to compare a protein or DNA sequence with documented families of proteins.
Blocks Making
Blocks are produced by the automated PROTOMAT system (Henikoff and Henikoff, 1991), which applies a robust motif-finder to a set of related protein sequences.
Page 2
BLOCK - 1BLOCK - 1
Represent start position of the blockRepresent start position of the block
SWISSPROT IDSWISSPROT ID
Weak Blocks - Strength < 1100 Strong Blocks - Strength >= 1100Weak Blocks - Strength < 1100 Strong Blocks - Strength >= 1100
Contains >500 domain families associated with signaling, extra-cellular and chromatin-associated proteins are found.
Each domain is extensively annotated with phyletic distributions, functional class, tertiary structures and functionally important residues.
http://smart.embl-heidelberg.de/
Results – Alkaline phosphatase “Signatures” PROSITE
– Represented as a single motif. PRINTS
– Represented as 5 motif regions. BLOCKS
– Represented as 6 block regions SMART
– Represented as a single profile
Metafam & PANAL
Metafam - http://metafam.ahc.umn.edu/
PANAL – Protein ANALysis tool page of Metafam http://mgd.ahc.umn.edu/panal/
Protein family classification built with Blocks+, DOMO, Pfam, PIR-ALN, PRINTS, Prosite, ProDom, SBASE, SYSTERS.
Interpro http://www.ebi.ac.uk/interpro Built from PROSITE, PRINTS, Pfam,
ProDom, SMART, TIGRFAM, SWISS-PROT and TrEMBL
Text- and sequence-based searches.
Pattern databases
Definition Applications Classifications Common Databases
– PROSITE, PRINTS & BLOCKS (motif based)– MetaFam, InterPro (Integrated databases)
Conclusions