Upload
tiffany-ball
View
221
Download
3
Tags:
Embed Size (px)
Citation preview
EBI is an Outstation of the European Molecular Biology Laboratory.
Proteins to Proteomes The InterPro Database
http://www.ebi.ac.uk/interpro
raw data
Origins of InterPro
UniProt
automated annotationInterPro
Swiss-Prot TrEMBL290K
annotated5M ???
http://www.ebi.ac.uk/interpro
Curated Annotation in InterPro
TrEMBL
uncharacterised sequence
Swiss-Prot
annotated sequence
TrEMBL
feed back common annotatio
n
groups of related proteins
(same family or share domains)
multiple signatures
InterPro
http://www.ebi.ac.uk/interpro
Finding Conserved Signatures
• Pattern
More information
Simplest (limited)• Fingerprint
• Sequence clustering
• HMM
http://www.ebi.ac.uk/interpro
Patterns
Pattern/motif in sequence regular expression
Can define important sites
Enzyme catalytic site Prosthetic group attachment Metal ion binding site Cysteines for disulphide bonds Protein or molecule binding
B chain xxxxxxCxxxxxxxxxxxxCxxxxxxxxx A chain xxxxxCCxxxCxxxxxxxxCx | |
EXAMPLE: Insulin
http://www.ebi.ac.uk/interpro
Patterns
Pattern/motif in sequence regular expression
Can define important sites
MALWMRLLPLLALLALWGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAEDLQVGQVELGGGPGAGSLQPLALEGSLQKRGIVEQCCTSICSLYQLENYCN
B chain xxxxxxCxxxxxxxxxxxxCxxxxxxxxx A chain xxxxxCCxxxCxxxxxxxxCx | |
EXAMPLE: PS00262 Insulin family signature
http://www.ebi.ac.uk/interpro
Patterns
Pattern/motif in sequence regular expression
Can define important sites
B chain xxxxxxCxxxxxxxxxxxxCxxxxxxxxx A chain xxxxxCCxxxCxxxxxxxxCx | |
EXAMPLE: PS00262 Insulin family signature
MALWMRLLPLLALLALWGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAEDLQVGQVELGGGPGAGSLQPLALEGSLQKRGIVEQ CCTSICSLYQLENYC N
http://www.ebi.ac.uk/interpro
Patterns
Pattern/motif in sequence regular expression
Can define important sites
B chain xxxxxxCxxxxxxxxxxxxCxxxxxxxxx A chain xxxxxCCxxxCxxxxxxxxCx | |
EXAMPLE: PS00262 Insulin family signature
C-C-{P}-x(2)-C-[STDNEKPI]-x(3)-[LIVMFS]-x(3)-C
Regular expression
MALWMRLLPLLALLALWGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAEDLQVGQVELGGGPGAGSLQPLALEGSLQKRGIVEQ CCTSICSLYQLENYC N
http://www.ebi.ac.uk/interpro
Patterns
Extract pattern sequencesxxxxxxxxxxxxxxxxxxxxxxxx
Sequence alignment
Insulin family motifDefine pattern
Pattern signature
C-C-{P}-x(2)-C-[STDNEKPI]-x(3)-[LIVMFS]-x(3)-CBuild regular expression
PS00000
http://www.ebi.ac.uk/interpro
Fingerprints
Several motifs characterise family
Different combinations of motifs describe subfamilies
Identify small conserved regions in divergent proteins
EXAMPLE: PR00107 Phosphocarrier HPr signature
PTHP_ENTFA: MEKKEFHIVAETGIHARPATLLVQTASKFNSDINLEYKGKSVNLK
SIMGVMSLGVGQGSDVTITVDGADEAEGMAAIVETLQKEGLAE
http://www.ebi.ac.uk/interpro
Fingerprints
Several motifs characterise family
Different combinations of motifs describe subfamilies
Identify small conserved regions in divergent proteins
EXAMPLE: PR00107 Phosphocarrier HPr signature
PTHP_ENTFA: MEKKEFHIVAET GIHARPATLLVQTASKF NSDINLEYKGKSVNLK
SIMGVMSLGVGQGSDVTITVDGADEAEGMAAIVETLQKEGLAE
His phosphorylation site
http://www.ebi.ac.uk/interpro
Fingerprints
Several motifs characterise family
Different combinations of motifs describe subfamilies
Identify small conserved regions in divergent proteins
EXAMPLE: PR00107 Phosphocarrier HPr signature
PTHP_ENTFA:
His phosphorylation site
Ser phosphorylation site
MEKKEFHIVAET GIHARPATLLVQTASKF NSDINLEY KGKSVNLK
SIMGVMSL GVGQGSDVTITVDGADEAEGMAAIVETLQKEGLAE
http://www.ebi.ac.uk/interpro
Fingerprints
Several motifs characterise family
Different combinations of motifs describe subfamilies
Identify small conserved regions in divergent proteins
EXAMPLE: PR00107 Phosphocarrier HPr signature
PTHP_ENTFA:
His phosphorylation site
Ser phosphorylation siteConserved site
MEKKEFHIVAET GIHARPATLLVQTASK FNSDINLEY KGKSVNLK
SIMGVMSL GVGQGSDVTITVDGADE AEGMAAIVETLQKEGLAE
http://www.ebi.ac.uk/interpro
Fingerprints
Several motifs characterise family
Different combinations of motifs describe subfamilies
Identify small conserved regions in divergent proteins
EXAMPLE: PR00107 Phosphocarrier HPr signature
PTHP_ENTFA: MEKKEFHIVAET GIHARPATLLVQTASK FNSDINLEY KGKSVNLK
SIMGVMSL GVGQGSDVTITVDGADE AEGMAAIVETLQKEGLAE
1) GIHARPATLLVQTASKF2) KGKSVNLKSIMGVMSL
3) LGVGQGSDVTITVDGADE 3-motif fingerprint
http://www.ebi.ac.uk/interpro
Fingerprints
Extract motif sequences
xxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxx
Sequence alignment
Correct order
Correct spacing
Ser phosphorylation
site
Conserved site
His phosphorylation
siteDefine motifs
Fingerprint signature 1 2 3
PR00000
http://www.ebi.ac.uk/interpro
Sequence clustering
Automatic clustering of homologous domains
**Rarely covers entire domain (conserved core)
**Signature size can change with release
Known domain families
Recruit homologous domains
PSI-BLAST
MKDOM2
Automatic clustering
ProDomAlignAlign domain families
http://www.ebi.ac.uk/interpro
Hidden Markov Models (HMM)
Can characterise protein over entire length
Models conserved and divergent regions (position-specific scoring)
Models insertions and deletions
Outperform in sensitivity and specificity
More flexible (can use partial alignments)
Sequence 1:Sequence 2:Sequence 3:Sequence 4:Sequence 5:Sequence 6:Sequence 7:
Sequence alignment
Scoring matrix
(residue frequency at each position in
alignment)
Profile
Phe most conserved
Sequence 1:Sequence 2:Sequence 3:Sequence 4:Sequence 5:Sequence 6:Sequence 7:
Phe, Tyr and Leu found at position 1 of alignment
highest match value
Sequence 1:Sequence 2:Sequence 3:Sequence 4:Sequence 5:Sequence 6:Sequence 7:
Tyr and Leu found at equal frequency at position 1
Tyr closer to Phe than Leu
Scores: F > Y > L
Probability method gauges scoring
parameters
http://www.ebi.ac.uk/interpro
Hidden Markov Models (HMM)
Sequence alignment
M1 M2 M3 M4Begin
End
M = match state
http://www.ebi.ac.uk/interpro
Hidden Markov Models (HMM)
D3
I2 I3
M1 M2 M3 M4Begin
End
D1 D4
M = match state,
D2
D = delete state
I1
I = insert state,
http://www.ebi.ac.uk/interpro
SAM Profile HMMs
Homologous structural superfamilies
Start with single seed sequence
Proteins in superfamily may have low
sequence identity
Few proteins in family have PDB structures
Create 1 model for every protein in superfamily combine results
http://www.ebi.ac.uk/interpro
Specialisation of Databases
Describe sibling families
Identify binding and active sites
Describe conserved core of domains
PRINTS
PROSITE
PRODOM
Wide coverage of domains & familiesPFAM
Signalling, extracellular & nuclear domainsSMART
Functional classification of familiesTIGRFAM
Families conserved in domain compositionPIRSF
Functional classification of familiesPANTHER
Structural-based domain classificationSuperfam
Structural-based domain classificationGENE3D
http://www.ebi.ac.uk/interpro
Manual curation
Integration of signatures
InterProInterPro
Foundations of InterPro
http://www.ebi.ac.uk/interpro
InterPro Entry
Groups similar signature together
Adds extensive annotation
Linked to other databases
Structural information and viewers
Links related signatures
http://www.ebi.ac.uk/interpro
Assigning Type
Domain Biological units with defined boundaries
Full-length signatures grouping related proteins Family
Repeat
Site
Signature repeated as a series of short motifs
Protein feature described by a Prosite pattern
Region Any signature that doesn’t fit the above
http://www.ebi.ac.uk/interpro
Grouping Signatures Together
Same positions
Different protein hits2)
PFAM
PROSITE (100)
(50)
PROSITE
PFAM
3) (100)
(100)
Different positions
Same protein hits
PFAM
PROSITE1) (100)
(100)Same positionsSame protein hits
IPR000001
Different positions4)PFAM
PROSITE (100)
(100)
IPR000001
IPR000002
IPR000001
IPR000002
IPR000001
IPR000002
http://www.ebi.ac.uk/interpro
Link related signatures - relationships
1) Parent - Child (subgroup of more closely related proteins)
PFAM
(75)
(100)
SMART
Protein kinase
Serine kinase
PROSITE (25) Tyrosine kinase
*
PFAM (100) Protein kinase*
No proteins in common
SMART PROSITE
PFAM
Protein kinase
SMART PROSITE
Serine kinase Tyrosine kinase
Parent
Children
Applies to domains and families
http://www.ebi.ac.uk/interpro
2) Contains – Found in (Describes domain composition)
PROSITE C-terminal domainSMARTN-terminal domain
PFAM Receptor family
PFAM
Receptor Family
SMART PROSITE
N-terminal domain C-terminal domain
Both families and domains can contain domains
Found in(Pfam)
Contains (Smart and Prosite)
Link related signatures - relationships
http://www.ebi.ac.uk/interpro
2) Contains – Found in
Link related signatures - relationships
Coverage Signature must cover the entire (>90%) sequence of contained signature
PFAM
SMART
ContainsFound in
PFAM
SMART
Contains
Found in
Overlapping
http://www.ebi.ac.uk/interpro
Relationships – evolutionary context
GENE3D Grandparent
Parents
Children
InterPro Relationship
Criteria for Signature
Structural family
PFAM PFAMSequence families
TIGRFAM TIGRFAM TIGRFAM TIGRFAMFunctional families
Unique to InterPro
http://www.ebi.ac.uk/interpro
Extensive Annotation
Annotation Fields in InterPro
• Name and short name• Entry type (family, domain, site)
• Relationships (links related signatures)
• GO mapping ( large scale classification)
• Abstract • Taxonomy (search/download using taxonomy)
• Examples• Publications
http://www.ebi.ac.uk/interpro
Extensive Annotation
Annotation Fields in InterPro
• Name and short name• Entry type (family, domain, site)
• Relationships (links related signatures)
• GO mapping ( large scale classification)
• Abstract • Taxonomy (search/download using taxonomy)
• Examples• Publications
Select species-specific protein sets
http://www.ebi.ac.uk/interpro
Links to Other Databases
Annotation Fields in InterPro
• Blocks (family alignments)
• IntEnz (enzymes)
• Prosite documents• COME (bioinorganic motifs)
• CAZy (carbohydrate-active enzymes)
• IUPHAR (GPCR receptors)
• CluS-Tr (protein clusters)
• Pandit (phylogenetic trees of PFAMs)
• Merops (peptidases & inhibitors)
http://www.ebi.ac.uk/interpro
Structural information
PDB
Classification
Structures
CATH
SCOP
Homology Models
Swiss-Model
ModBase
http://www.ebi.ac.uk/interpro
Sequence-Structure Display
Signatures predictive of
protein annotation
Structural data for specific proteins
AstexViewer® for structure
http://www.ebi.ac.uk/interpro
Structure Viewer
Navigate between structure and sequence
Manipulate structures
http://www.ebi.ac.uk/interpro
Other Features – splice variants
Splice variants
http://www.ebi.ac.uk/interpro
Other Features – domain architecture
Select data set of these proteins
Each ‘balloon’ represents a
linked InterPro domain
http://www.ebi.ac.uk/interpro
Other Features – protein-protein interactions
Lists proteins in entry known to be involved in protein-protein interactions
IntAct database of interactions
http://www.ebi.ac.uk/interpro
Protein Sequence Coverage
InterPro signatures cover:
95% of UniProt/Swiss-Prot proteins
79% of UniProt/TrEMBL proteins
>4 million matches in InterPro
>16,000 InterPro entries
>50,000 signature methods
http://www.ebi.ac.uk/interpro
Searching InterPro
http://www.ebi.ac.uk/interpro/
Search tools include:
• Text Search
• InterProScan (sequence search)
http://www.ebi.ac.uk/interpro
InterPro Text Search
Text search box Search using:• text• protein ID• InterPro ID• GO term
Search results
Direct links to entry
http://www.ebi.ac.uk/interpro
InterProScan Search Use ftp site to run multiple sequences
simultaneously
Member database search engines
Paste in sequence (protein/nucleotide)
http://www.ebi.ac.uk/interpro
InterProScan Search Results
single InterPro entry
Direct links to entry
Direct links to signature databases