150
A Proteomics A Proteomics Toolkit: Toolkit: UniProt, InterPro UniProt, InterPro and IntAct and IntAct Databases at the Databases at the EBI EBI

A Proteomics Toolkit:

  • Upload
    nonnie

  • View
    50

  • Download
    0

Embed Size (px)

DESCRIPTION

A Proteomics Toolkit:. UniProt, InterPro and IntAct Databases at the EBI. Hinxton,U.K. EMBL. GenBank. EBI (EMBL). NCBI (NIH). DDBJ. CIB (NIG). European Bioinformatics Institute. (http://www.ebi.ac.uk/). Created as part of the EMBL in 1992 - PowerPoint PPT Presentation

Citation preview

Page 1: A Proteomics Toolkit:

A Proteomics A Proteomics Toolkit:Toolkit:

UniProt, InterPro and IntAct UniProt, InterPro and IntAct Databases at the EBIDatabases at the EBI

Page 2: A Proteomics Toolkit:

Hinxton,U.K.

Page 3: A Proteomics Toolkit:

European Bioinformatics InstituteEuropean Bioinformatics Institute

Created as part of the EMBL in 1992

• To house EMBL Nucleotide Sequence Data Library established in 1980

Today, 3 databases accept primary nucleotide data:

(http://www.ebi.ac.uk/)

EBI (EMBL)EBI (EMBL) EMBL

CIB (NIG)CIB (NIG)

DDBJ

NCBI (NIH)NCBI (NIH)GenBank

Page 4: A Proteomics Toolkit:

EMBL-EBI EMBL-EBI maintains the maintains the world’s most world’s most

comprehensive comprehensive range of range of

molecular molecular databasesdatabases

European Bioinformatics InstituteEuropean Bioinformatics Institute(http://www.ebi.ac.uk/)

Page 5: A Proteomics Toolkit:

Nucleotide Sequence Database

Database of Protein Families and Domains

ArrayExpress

Alternative Splicing Database

Protein Sequence Database

Molecular Structure Database

Alternative Transcript Diversity

Automatic Annotation of Genomes

Protein Interaction Database

Chemical Entities of Biological

Interest

Gene Ontology

Enzyme Database

Database of Biological Processes

Page 6: A Proteomics Toolkit:

http://www.ebi.ac.uk/services/

Page 7: A Proteomics Toolkit:

Roles of Public Domain DatabasesRoles of Public Domain Databases

To provide stable, long-term sources of basic information

To react in the long-term for the needs of the community

To act as repositories for published information

To bridge the gap between multiple data sources

Page 8: A Proteomics Toolkit:

Protein DatabasesProtein Databases

UniProtUniProt Database of Protein Sequences

InterPro InterPro Database of Protein Families and Domains

IntAct IntAct Database of Protein Interactions

Page 9: A Proteomics Toolkit:

World's most comprehensive catalogue of information on proteins

Funded mainly by NIH

A central repository of protein sequence and function

Based on the original work of PIR, Swiss-Prot and TrEMBL

UniProtUniProt

Page 10: A Proteomics Toolkit:

Met-Gln-Pro-Glu-Glu-Gly-Thr-Gly-Trp-Leu-Leu-Glu-Val-Gln-Gln-

Met-Gly-Arg-Gly-Arg-Cys-Val-Gly-Pro-Ser-Leu-Gln-Glu-Trp-Arg-

protein sequencingprotein sequencing

annotationannotation Swiss-Prot

EMBL

CGCGCCTGTACGCTGAACGCTCGTGACGTGTAGTGCGCG

CGCTGTGATAGCGCTGATCGTGATGCGTATGCAGGTCGT

nucleotide sequencingnucleotide sequencing

Page 11: A Proteomics Toolkit:

Swiss-Prot

EMBL

CGCGCCTGTACGCTGAACGCTCGTGACGTGTAGTGCGCG

CGCTGTGATAGCGCTGATCGTGATGCGTATGCAGGTCGT

nucleotide sequencingnucleotide sequencing

TrEMBL

translated EMBLtranslated EMBL

annotationannotation

UniProUniPrott

PSD

PIRPIR

annotation

++

EBIEBI

Page 12: A Proteomics Toolkit:

UniProt Consortium

Page 13: A Proteomics Toolkit:

UniProtUniProt

UniProt Reference ClustersUniProt Reference Clusters (UniRef)

UniProt KnowledgebaseUniProt Knowledgebase (UniProt)

UniProt ArchiveUniProt Archive (UniParc)

3 Components:3 Components:

Page 14: A Proteomics Toolkit:

UniProtUniProt

UniProt KnowledgebaseUniProt Knowledgebase (UniProt)

3 Components:3 Components:

UniProt Reference ClustersUniProt Reference Clusters (UniRef)

UniProt ArchiveUniProt Archive (UniParc)

• Central repository for annotated protein sequences

Page 15: A Proteomics Toolkit:

UniProtUniProt

UniProt KnowledgebaseUniProt Knowledgebase (UniProt)

UniProt ArchiveUniProt Archive (UniParc)

3 Components:3 Components:

• Swiss-Prot: non-redundant, manually annotated• TrEMBL: redundant, automatically annotated

• Central repository for annotated protein sequences

UniProt Reference ClustersUniProt Reference Clusters (UniRef)

Page 16: A Proteomics Toolkit:

UniProtUniProt

UniProt KnowledgebaseUniProt Knowledgebase (UniProt)

3 Components:3 Components:

• Swiss-Prot: non-redundant, manually annotated• TrEMBL: redundant, automatically annotated

UniProt ArchiveUniProt Archive (UniParc)

• Central repository for annotated protein sequences

UniProt Reference ClustersUniProt Reference Clusters (UniRef)• Combines related sequences for speed searching

Page 17: A Proteomics Toolkit:

UniProtUniProt

UniProt KnowledgebaseUniProt Knowledgebase (UniProt)

3 Components:3 Components:

• Swiss-Prot: non-redundant, manually annotated• TrEMBL: redundant, automatically annotated

• Central repository for annotated protein sequences

UniProt Reference ClustersUniProt Reference Clusters (UniRef)• Combines related sequences for speed searching• UniRef100, UniRef90, UniRef50

UniProt ArchiveUniProt Archive (UniParc)

Page 18: A Proteomics Toolkit:

UniProtUniProt

UniProt Reference ClustersUniProt Reference Clusters (UniRef)

UniProt KnowledgebaseUniProt Knowledgebase (UniProt)

UniProt ArchiveUniProt Archive (UniParc)

3 Components:3 Components:

• Combines related sequences for speed searching

• Comprehensive repository for history of sequences

• Central repository for annotated protein sequences• Swiss-Prot: non-redundant, manually annotated• TrEMBL: redundant, automatically annotated

• UniRef100, UniRef90, UniRef50

Page 19: A Proteomics Toolkit:
Page 20: A Proteomics Toolkit:

UniProt Explicit Links

SequenceEMBL/GenBank/DDBJPIR

PTM GlycoSuiteDBPhosSite

StructureHSSPPDBMSD

Domains, Sites, FamiliesGene3DHAMAPInterProPANTHERPfamPIRSFPRINTSProDomPROSITESMARTTIGRFAM

2D-gel ElectrophoresisANU-2DPAGEAarhus/Ghent-2DPAGECOMPLUYEAST-2DPAGEECO2DPAGEHSC-2DPAGEMAIZE-2DPAGEOGPPHCI-2DPAGEPMMA-2DPAGERat-heart-2DPAGESiena-2DPAGESWISS-2DPAGE

Molecular InteractionIntActTRANSFAC

DatabasesDatabasescross-referencedcross-referenced

in UniProtin UniProt

MiscellaneousEnsemblGermOnlineGene OntologyMEROPS

Organism-SpecificAGDdbSNPDictyBaseEcoGeneEchoBASEFlyBaseGeneDB_SpombeGeneFarmGenewGrameneHIVH-InvDBLegioListLepromaListiListMaizeDBMGDMypuListOMIMPhotoListReactomeRGDSagaListSGDStyGeneSubtiListTAIRTIGRTubercuListWormBaseWormPepZFIN

Page 21: A Proteomics Toolkit:

http://www.ebi.ac.uk/services/

Page 22: A Proteomics Toolkit:

Search tools include:

• Text Search

http://www.ebi.uniprot.org/index.shtml

• Blast, Fasta and MPsrch

• Links to extra search services (including SRS)

• Power Search

Searching UniProtSearching UniProt

Page 23: A Proteomics Toolkit:

http://www.ebi.uniprot.org/index.shtml

• Text-based searching• Logical operators ‘&’ (and), ‘|’ (or) • (Wildcards and numerical operators not allowed)

• Text Search – keyword queries• Power Search – can search for specific entry lines• Warehouse Search – link query to other databases

Page 24: A Proteomics Toolkit:

Text Search ResultsText Search Results

Each linked to the UniProt entry

Page 25: A Proteomics Toolkit:

• Sequence-based searching• BLAST, Fasta, MPsrch

Page 26: A Proteomics Toolkit:

Sequence Search ResultsSequence Search Results

UniProt entry

Identity score

View alignments

Page 27: A Proteomics Toolkit:

Manipulate multiple data sets

Page 28: A Proteomics Toolkit:

Use Venn diagrams to combine, intersect, or

subtract multiple data sets

Build complex data sets

Page 29: A Proteomics Toolkit:

UniProt/Swiss-Prot entry for UniProt/Swiss-Prot entry for human ubiquitin-protein ligase E3 human ubiquitin-protein ligase E3

mdm2mdm2

Page 30: A Proteomics Toolkit:

Some literature search engines pull

synonyms from UniProt for more

complete searching

Merged entries:• Remove redundancy• Can still be searched

Page 31: A Proteomics Toolkit:
Page 32: A Proteomics Toolkit:
Page 33: A Proteomics Toolkit:
Page 34: A Proteomics Toolkit:
Page 35: A Proteomics Toolkit:
Page 36: A Proteomics Toolkit:
Page 37: A Proteomics Toolkit:

IntAct Database

Page 38: A Proteomics Toolkit:
Page 39: A Proteomics Toolkit:

Summary of nucleotide data

upon which entry is originally basedStructural data associated with entry protein

Page 40: A Proteomics Toolkit:

IntAct Database

Page 41: A Proteomics Toolkit:

IntAct Database

All the interactions with

entry protein

Page 42: A Proteomics Toolkit:

IntAct Database

Page 43: A Proteomics Toolkit:
Page 44: A Proteomics Toolkit:

IntAct Database

Page 45: A Proteomics Toolkit:

IntAct Database

Page 46: A Proteomics Toolkit:

Experimental information

Experimental name Experimental technique:

co-immunoprecipitation

Literature citation used for curationTaxonomic Reference

Interaction information

Links to interacting protein

Page 47: A Proteomics Toolkit:

IntAct Database

Displays interactions graphically

Page 48: A Proteomics Toolkit:

View all 7 interactions involving MDM2

View all GO interactions involving MDM2

Page 49: A Proteomics Toolkit:

View all InterPro entries associated with MDM2

Expand graph to see network surrounding one protein

Expand graph to see entire network

Page 50: A Proteomics Toolkit:

View interactions associated with both MDM2 and p53

View all proteins in a network associated with a specific GO term

Page 51: A Proteomics Toolkit:

All protein in red associated with “negative regulation of cell proliferation”

Page 52: A Proteomics Toolkit:
Page 53: A Proteomics Toolkit:

Genomic location

Complete nucleotide sequence

SNP information

Transcript and protein information

Transcript structure

Page 54: A Proteomics Toolkit:
Page 55: A Proteomics Toolkit:

Interactive map. Can zoom in/out, and move around

Summary and links to information about processes involving this molecule (here

cell-cycle checkpoints)GeneralSpecific

Page 56: A Proteomics Toolkit:

Mendelian Inheritance in Man

Page 57: A Proteomics Toolkit:

Cellular componentMolecular functionBiological process

Page 58: A Proteomics Toolkit:

InterPro Database

• Allow searching for terms• Linked to GO

Page 59: A Proteomics Toolkit:

Domain organisation

Position of motifs and sites

Positions of variable splicing

Experimental mutation information

Sequencing conflicts

Page 60: A Proteomics Toolkit:

Secondary structure

Page 61: A Proteomics Toolkit:

Easy navigation between UniProt/UniParc/UniRef

Useful for cut/paste into search engines

Page 62: A Proteomics Toolkit:
Page 63: A Proteomics Toolkit:
Page 64: A Proteomics Toolkit:
Page 65: A Proteomics Toolkit:

UniProt/TrEMBLUniProt/TrEMBL

>2.5 M entries in TrEMBL

Doubled since mid-2004 Doubled since mid-2001

>200 K entries in Swiss-Prot

Page 66: A Proteomics Toolkit:

UniProt

raw data

Curated automated annotationCurated automated annotation

TrEMBL TrEMBL ??

SwissProt SwissProt annotationannotation

Page 67: A Proteomics Toolkit:

UniProt/TrEMBLUniProt/TrEMBL

Redundancy

Automatically maintained

• Automatic clean-up of nucleotide data

• Automatic annotation

• InterPro run and cross-references updated every 2 weeks

Recognises common annotation in related Swiss-Prot entries

Identifies all members of family using InterPro

Page 68: A Proteomics Toolkit:

SwissProt SwissProt annotated annotated sequencessequences

uncharacterised

Multiple Multiple signaturessignatures INTERPROINTERPRO

provides provides annotation on annotation on multiple levelsmultiple levels

Feeds back to Feeds back to TrEMBLTrEMBL

Curated Annotation in InterProCurated Annotation in InterPro

Page 69: A Proteomics Toolkit:

Entry name uses accession number

Automatic annotation through machine learning

Page 70: A Proteomics Toolkit:

Foundations of InterProFoundations of InterPro

Manual curation

Integration of signatures

InterPro

Page 71: A Proteomics Toolkit:

• Greater coverage of proteins

• Relationships between signatures

• Signature databases specialised

greater coverage of annotation features

evolutionary context

Unique to InterProUnique to InterPro

Advantages of integrated signaturesAdvantages of integrated signatures

Page 72: A Proteomics Toolkit:

Characterisation of Protein SequencesCharacterisation of Protein Sequences

Build up consensus sequences of families, domains, motifs or sites Conserved signatures

more sequences

BLAST

Basic information

Page 73: A Proteomics Toolkit:

Finding Conserved SignaturesFinding Conserved Signatures

• Pattern

More information

Simplest (limited)

• Profile

• Fingerprint

• Sequence clustering

• HMM

Page 74: A Proteomics Toolkit:

PatternsPatterns

Patterns in sequence regular expressions

Often used to define important sites within proteins

PROSITE best-known pattern database

Page 75: A Proteomics Toolkit:

PatternsPatterns

Example: PS00262 Insulin family signature

B chain xxxxxxCxxxxxxxxxxxxCxxxxxxxxx | | A chain xxxxxCCxxxCxxxxxxxxCx | |

MALWMRLLPL LALLALWGPD PAAAFVNQHL CGSHLVEALY LVCGERGFFY TPKTRREAED LQVGQVELGG GPGAGSLQPL ALEGSLQKRG IVEQCCTSIC SLYQLENYCN

INS_HUMAN

Page 76: A Proteomics Toolkit:

B chain xxxxxxCxxxxxxxxxxxxCxxxxxxxxx | | A chain xxxxxCCxxxCxxxxxxxxCx | |

Example: PS00262 Insulin family signature

MALWMRLLPL LALLALWGPD PAAAFVNQHL CGSHLVEALY LVCGERGFFY TPKTRREAED LQVGQVELGG GPGAGSLQPL ALEGSLQKRG IVEQCCTSIC SLYQLENYCN

INS_HUMAN

PatternsPatterns

Page 77: A Proteomics Toolkit:

PatternsPatterns

B chain xxxxxxCxxxxxxxxxxxxCxxxxxxxxx | | A chain xxxxxCCxxxCxxxxxxxxCx | |

Example: PS00262 Insulin family signature

MALWMRLLPL LALLALWGPD PAAAFVNQHL CGSHLVEALY LVCGERGFFY TPKTRREAED LQVGQVELGG GPGAGSLQPL ALEGSLQKRG IVEQCCTSIC SLYQLENYCN

INS_HUMAN

Page 78: A Proteomics Toolkit:

PatternsPatterns

B chain xxxxxxCxxxxxxxxxxxxCxxxxxxxxx | | A chain xxxxxCCxxxCxxxxxxxxCx | |

Example: PS00262 Insulin family signature

MALWMRLLPL LALLALWGPD PAAAFVNQHL CGSHLVEALY LVCGERGFFY TPKTRREAED LQVGQVELGG GPGAGSLQPL ALEGSLQKRG IVEQCCTSIC SLYQLENYCN

INS_HUMAN

Page 79: A Proteomics Toolkit:

PatternsPatterns

B chain xxxxxxCxxxxxxxxxxxxCxxxxxxxxx | | A chain xxxxxCCxxxCxxxxxxxxCx | |

Example: PS00262 Insulin family signature

INS_HUMAN

MALWMRLLPL LALLALWGPD PAAAFVNQHL CGSHLVEALY LVCGERGFFY TPKTRREAED LQVGQVELGG GPGAGSLQPL ALEGSLQKRG IVEQ CCTSICSLYQLENYC N

Page 80: A Proteomics Toolkit:

PatternsPatterns

B chain xxxxxxCxxxxxxxxxxxxCxxxxxxxxx | | A chain xxxxxCCxxxCxxxxxxxxCx | |

Example: PS00262 Insulin family signature

MALWMRLLPL LALLALWGPD PAAAFVNQHL CGSHLVEALY LVCGERGFFY TPKTRREAED LQVGQVELGG GPGAGSLQPL ALEGSLQKRG IVEQ CCTSICSLYQLENYC N

C-C-{P}-x(2)-C-[STDNEKPI]-x(3)-[LIVMFS]-x(3)-C

Regular expression

Page 81: A Proteomics Toolkit:

Extract pattern sequencesxxxxxxxxxxxxxxxxxxxxxxxx

Sequence alignment

Insulin family motifDefine pattern

Pattern signature

C-C-{P}-x(2)-C-[STDNEKPI]-x(3)-[LIVMFS]-x(3)-C

Build regular expression

Page 82: A Proteomics Toolkit:

FingerprintsFingerprints

Several discrete motifs characterise family

Highly specific matches to small regions of proteins

PRINTS best-known fingerprint database

Page 83: A Proteomics Toolkit:

FingerprintsFingerprints

Example: PR00107 Phosphocarrier HPr signature

MEKKEFHIVA ETGIHARPA TLLVQTASK FNSDINLEY KGKSVNLKS IMGVMSLGV GQGSDVTITV DGADEAEGMA

AIVETLQKEG LAE

PTHP_ENTFA:

Page 84: A Proteomics Toolkit:

FingerprintsFingerprints

MEKKEFHIVA ETGIHARPA TLLVQTASK FNSDINLEY KGKSVNLKS IMGVMSLGV GQGSDVTITV DGADEAEGMA

AIVETLQKEG LAE

His phosphorylation site

Example: PR00107 Phosphocarrier HPr signature

PTHP_ENTFA:

Page 85: A Proteomics Toolkit:

FingerprintsFingerprints

MEKKEFHIVA ETGIHARPA TLLVQTASK FNSDINLEY KGKSVNLKS IMGVMSLGV GQGSDVTITV DGADEAEGMA

AIVETLQKEG LAE

His phosphorylation site

Ser phosphorylation site

Example: PR00107 Phosphocarrier HPr signature

PTHP_ENTFA:

Page 86: A Proteomics Toolkit:

FingerprintsFingerprints

His phosphorylation site

Conserved site

Example: PR00107 Phosphocarrier HPr signature

MEKKEFHIVA ETGIHARPA TLLVQTASK FNSDINLEY KGKSVNLKS IMGVMSLGV GQGSDVTITV

DGADEAEGMA AIVETLQKEG LAE

PTHP_ENTFA:

Ser phosphorylation site

Page 87: A Proteomics Toolkit:

FingerprintsFingerprints

Example: PR00107 Phosphocarrier HPr signature

MEKKEFHIVA ETGIHARPA TLLVQTASK FNSDINLEY KGKSVNLKS IMGVMSLGV GQGSDVTITV

DGADEAEGMA AIVETLQKEG LAE

1) GIHARPATLLVQTASKF

2) KGKSVNLKSIMGVMSL

3) LGVGQGSDVTITVDGADE

PR00107 a fingerprint with three motifs

PTHP_ENTFA:

Page 88: A Proteomics Toolkit:

Extract motif sequences

xxxxxxxxxxxxxxxxxxxxxxxx

xxxxxxxxxxxxxxxxxxxxxxxx

xxxxxxxxxxxxxxxxxxxxxxxx

Sequence alignment

Fingerprint signature 1 2 3

Correct order

Correct spacing

Ser phosphorylation

site

Conserved site

His phosphorylation

siteDefine motifs

Page 89: A Proteomics Toolkit:

Sequence ClusteringSequence Clustering

Automatic clustering of homologous domains

Used by ProDom database

Page 90: A Proteomics Toolkit:

Sequence ClusteringSequence Clustering

Well-characterised domain families

Align resulting protein domain families

ProDomAlign

Automatically cluster homologous domains

MKDOM2

Recruit homologous domains

PSI-BLAST

Page 91: A Proteomics Toolkit:

ProfilesProfiles

Sequence alignment scoring matrix

Profile

Sequence search

Page 92: A Proteomics Toolkit:

Matrix

(frequency of each residue at each position in alignment)

Sequence 1:Sequence 2:Sequence 3:Sequence 4:Sequence 5:Sequence 6:Sequence 7:

Sequence alignment

Page 93: A Proteomics Toolkit:

Match values are higher for conserved residues

e.g. Position 1 F>Y>L (phenylalanine and tyrosine are closer than leucine)

Sequence 1:Sequence 2:Sequence 3:Sequence 4:Sequence 5:Sequence 6:Sequence 7:

Page 94: A Proteomics Toolkit:

Match values are higher for conserved residues

e.g. Position 1 F>Y>L (phenylalanine and tyrosine are closer than leucine)

Sequence 1:Sequence 2:Sequence 3:Sequence 4:Sequence 5:Sequence 6:Sequence 7:

Page 95: A Proteomics Toolkit:

Match values are higher for conserved residues

e.g. Position 1 F>Y>L (phenylalanine and tyrosine are closer than leucine)

Sequence 1:Sequence 2:Sequence 3:Sequence 4:Sequence 5:Sequence 6:Sequence 7:

Page 96: A Proteomics Toolkit:

Match values are higher for conserved residues

e.g. Position 1 F>Y>L (phenylalanine and tyrosine are closer than leucine)

Sequence 1:Sequence 2:Sequence 3:Sequence 4:Sequence 5:Sequence 6:Sequence 7:

Page 97: A Proteomics Toolkit:

ProfilesProfiles

Problem insertions and deletions not well accounted for

Can characterise proteins over entire length (need trusted sequence alignment)

Position-specific scoring good for modelling divergent as well as conserved regions

Page 98: A Proteomics Toolkit:

Hidden Markov Models (HMM)Hidden Markov Models (HMM)

Large scale profiles

Outperform in sensitivity and specificity

More flexible (can use partial alignments)

• Probability method gauges scoring parameters

• Allows insertions and deletions

Improvements:

Page 99: A Proteomics Toolkit:

Hidden Markov Models (HMM)Hidden Markov Models (HMM)

Sequence alignment

M1 M2 M3 M4Begin

End

M = match state

Page 100: A Proteomics Toolkit:

Hidden Markov Models (HMM)Hidden Markov Models (HMM)

D3

I2 I3

M1 M2 M3 M4Begin

End

D1 D4

M = match state,

D2

D = delete state

I1 I4

I = insert state,

I0

Page 101: A Proteomics Toolkit:

Hidden Markov Models (HMM)Hidden Markov Models (HMM)

HMMbuild

Database search

HMMcalibrate

HMMER2 package:

http://hmmer.wustl.edu/

Page 102: A Proteomics Toolkit:

Hidden Markov Models (HMM)Hidden Markov Models (HMM)

HMM databases:

• PIR SUPERFAMILY

• PANTHER

• TIGRFAM

• PFAM

• SMART

• SUPERFAMILY

• GENE3D

Domains conserved in sequence

Families conserved in sequence

Domains conserved in structure

Page 103: A Proteomics Toolkit:

Hidden Markov Models (HMM)Hidden Markov Models (HMM)

HMM databases:

• PIR SUPERFAMILY

• PANTHER

• TIGRFAM

• PFAM

• SMART

• SUPERFAMILYSUPERFAMILY

• GENE3DGENE3D

Domains conserved in sequence

Families conserved in sequence

Domains conserved in structure

Special Special casecase

Page 104: A Proteomics Toolkit:

SAM Profile HMMsSAM Profile HMMs

(http://www.cse.ucsc.edu/research/compbio/sam.html)

SUPERFAMILY + GENE3D

• Start with single seed sequence

SAM:

• Proteins related by structure

• Uses Target99 (T99) script

Often only 1 protein in a family with structural

information

May have low sequence identity

Combine results

Multiple models/ superfamily

• Homologous Structural Superfamilies

Page 105: A Proteomics Toolkit:

SAM T99 Profile HMMsSAM T99 Profile HMMs

T99 script:

Low identity matches

Close homologues

WU-BLASTP

search

Final HMM

Single seed sequenceGIHARPATLLVQTASKF

Initial HMM

GIHARPATLLVQTASKF GIHARPATLLVQTASKF GIHARPATLLVQTASKF

New larger alignmentGIHARPATLLVQTASKF GIHARPATLLVQTASKF GIHARPATLLVQTASKF GIHARPATLLVQTASKF GIHARPATLLVQTASKF

Page 106: A Proteomics Toolkit:

xxxxxxxxxxxxxxxxxxxxxxxx

Extract motif pattern (PROSITE)

Single motif method

Multiple motif methods

Full alignment methods

Extract multiple motifs (PRINTS)

xxxxxxxxxxxxxxxxxxxxxxxx

xxxxxxxxxxxxxxxxxxxxxxxx

xxxxxxxxxxxxxxxxxxxxxxxx

xxxxxxxxxxxxxxxxxxxxxxxx

Full sequence:

1) profile (PROSITE)

2) HMM (PFAM, SMART, SUPERFAMILY, TIGRFAM, PIRSF, GENE3D, PANTHER)

Sequence alignment

Summary of signature methodsSummary of signature methods

Page 107: A Proteomics Toolkit:

Patterns Prosite

Fingerprints Prints

Sequence clustering ProDom

Profiles PrositeHMM PIR Superfamily Panther

Tigrfam Pfam

Smart

Protein Signature DatabasesProtein Signature Databases

T99-SAM HMM Gene3D Superfamily

Page 108: A Proteomics Toolkit:

PrintsPrints

Fingerprint is a set of motifs

Full length of protein

PR00000

Can identify small conserved regions in divergent proteins

Use different combinations of motifs to describe families and sibling subfamilies

http://umber.sbs.man.ac.uk/dbbrowser/PRINTS/

Page 109: A Proteomics Toolkit:

Prosite PatternsProsite Patterns

Pattern is a regular expression

PS00000

Identify various important sites within proteins

Several models characterise enzymes

Used by UniProt to define catalytic sites

Enzyme catalytic site Prosthetic group attachment Metal ion binding site Cysteines for disulphide bonds Protein or molecule binding

http://us.expasy.org/prosite/

Page 110: A Proteomics Toolkit:

Prosite ProfilesProsite Profiles

ProfilePatternPS00000

PS00000

Describe protein families or domains conserved in sequence

Use curated sequence alignments

Accurate

Profile is a multiple alignment with matrix frequencies

http://us.expasy.org/prosite/

Page 111: A Proteomics Toolkit:

ProDomProDom

Sequence clustering method automatic process (mkdom2)

PD000000

Groups UniProt sequences into (core) domains conserved in sequence

http://protein.toulouse.inra.fr/prodom/current/html/home.php

Page 112: A Proteomics Toolkit:

PfamPfam

HMM models built from HMMER2

PF00000

Pfam A manually curatedPfam B automatic clustering

Use trusted cut-offs accurate

Wide coverage of protein families and domains conserved in sequence

http://www.sanger.ac.uk/Software/Pfam/

Only PFAM A used to build signatures in

InterPro

Page 113: A Proteomics Toolkit:

SmartSmart

HMM domains using curated sequence alignments of families from psi-blast

SM00000

Primarily describe domains conserved in sequence

Concentrate on signalling proteins, and extracellular and nuclear domains

http://smart.embl-heidelberg.de/

Page 114: A Proteomics Toolkit:

TigrfamsTigrfams

HMM families built with curated alignments

TIGR00000

Describe protein families (and domains) conserved in sequence and function

Functional classifications using equivalogs(functionally conserved homologues)

Curated trusted cut-off Very accurateUse phylogenetic trees Accurate family

membershiphttp://www.tigr.org/TIGRFAMs/

Page 115: A Proteomics Toolkit:

PIRSFPIRSF

http://pir.georgetown.edu/pirsf/

HMM families using computationally defined non-overlapping clusters of sequences

PIRSF000000

Comprehensive protein family database of full-length models

Describe protein families conserved in sequence and domain composition:

Homeomorphic

Page 116: A Proteomics Toolkit:

PantherPanther

https://panther.appliedbiosystems.com/

HMM families based on phylogenetic trees

PTHR00000

Comprehensive protein family database of full-length models

Provides family classification by functions, processes, pathways and taxonomy

Use phylogenetic trees Define functionally distinct families

Page 117: A Proteomics Toolkit:

SuperfamilySuperfamily

HMMs based on SCOP structural superfamilies

Describe protein domains conserved in structure with evidence of common evolutionary origin

Provides information on structural classification

http://supfam.mrc-lmb.cam.ac.uk/SUPERFAMILY/

Good at describing non-contiguous structural domains

SSF00000

Often define structural domain boundaries

Page 118: A Proteomics Toolkit:

Gene3DGene3D

HMM domains based on CATH structural superfamily

G3D.0.0.0.0

Provides information on structural classification

http://cathwww.biochem.ucl.ac.uk/latest/index.html

Describe protein domains conserved in structure with evidence of common evolutionary origin

Always define structural domain boundaries

**

Good at describing non-contiguous structural domains

Page 119: A Proteomics Toolkit:

PrintsPrints Describe sibling families

PrositeProsite Identify binding and active sites (enzymes)

ProDomProDom Describe conserved core of domains

PfamPfam Wide coverage of domains and families

SmartSmart Signalling, extracellular & nuclear domains

TigrfamTigrfam Functional classification of equivalogs

PIRSFPIRSF Homeomorphs, conserved in domain composition

PantherPanther Functional families; best at detecting fragments

SuperfamilySuperfamily Structural-based domain classification

Gene3DGene3D Describe structural domain boundaries

Specialisation of databasesSpecialisation of databases

Page 120: A Proteomics Toolkit:

Structural Representation in InterProStructural Representation in InterPro

MSD

PDB sequence

UniProt amino acid position

Residue-by-residuemapping

InterPro sequence-structure

comparison

Page 121: A Proteomics Toolkit:

PDB structures displayed as striped patterns

Structural classification in CATHCATH

SCOP

and SCOP

Homology models from Swiss-model

Swiss-M

and ModBase

ModB

Structural RepresentationStructural Representation

Page 122: A Proteomics Toolkit:

Structural RepresentationStructural Representation

CATH and SCOP divide PDB structures into domains

Swiss-Model and ModBase predict structure for regions not covered by PDB

Note that one domain is non-contiguous

Page 123: A Proteomics Toolkit:

Sequence-Structure DisplaySequence-Structure Display

Structural data for specific

proteins

Signatures predictive of

protein annotation

Page 124: A Proteomics Toolkit:

http://www.ebi.ac.uk/interpro/

Search tools include:

• Text Search

• InterProScan (sequence search)

• SRS (multiple database search)

Searching InterProSearching InterPro

Page 125: A Proteomics Toolkit:

Text Text Search Search ResultsResults

Direct links to entry

Page 126: A Proteomics Toolkit:

InterProScan search resultsInterProScan search results

Link to InterPro entry

Link to SRS view of InterPro entry

Enables direct searching of other databases in SRS

using InterProScan results

Link to signature database

Mouse-over provides signature data: residue position, E-value, accession ID, and name

Single InterPro

entry

Page 127: A Proteomics Toolkit:

InterPro EntryInterPro Entry

• Groups similar signatures together and provide relationships between signatures

• Provides extensive manual annotation

• Provides links to other databases

• Provides structural information and viewers

Page 128: A Proteomics Toolkit:

• Name and short name• Entry type• Relationships• GO mapping• Abstract• Structural links• Database links• Taxonomy• Examples• Publications

Annotation Fields in InterProAnnotation Fields in InterPro

Page 129: A Proteomics Toolkit:

InterPro entry for the ligand-binding InterPro entry for the ligand-binding domain of the nuclear hormone domain of the nuclear hormone

receptorreceptor

Page 130: A Proteomics Toolkit:

Protein matches

Page 131: A Proteomics Toolkit:

Shows the InterPro entries

that match a protein

Page 132: A Proteomics Toolkit:

Protein matchesShows each individual

signature that matches a protein

Shows structural information for

protein with links to PDB, CATH,

SCOP

Page 133: A Proteomics Toolkit:

Protein matches

Page 134: A Proteomics Toolkit:

Protein matches

Splice variants

Page 135: A Proteomics Toolkit:

Select data set of these proteins

Page 136: A Proteomics Toolkit:

Detailed information

Family, domain, site, repeat

Links to signature databases

Relationships linking different

signatures

Mapping to GO terms

Abstract with references

Contains/Found inContains/Found inDescribe composition of protein sequences

Parent/ChildParent/ChildFamily or domain evolutionary hierarchies

Page 137: A Proteomics Toolkit:

Structural links

Page 138: A Proteomics Toolkit:

Database links

Page 139: A Proteomics Toolkit:

Taxonomy

Page 140: A Proteomics Toolkit:

Overlap with other InterPro entries

Examples

References

Page 141: A Proteomics Toolkit:

Integration of signatures

Greater coverage of annotation features

Relationships provide evolutionary context (unique to InterPro)

Increased coverage of proteins

Enhances functional annotation of

TrEMBL

Powerful Annotation ToolPowerful Annotation Tool

Page 142: A Proteomics Toolkit:

Database links

Taxonomy Search/download using taxonomy

GO mapping Large-scale classification using GO terms

To several databases to increase annotation

Structural information Structural classification, 3-D viewers

Signature databases Direct links to their annotation

Powerful Annotation ToolPowerful Annotation Tool

Page 143: A Proteomics Toolkit:

InterPro signatures cover:

90% of UniProt/Swiss-Prot proteins

69% of UniProt/TrEMBL proteins

CoverageCoverage

>2 million matches in InterPro>2 million matches in InterPro

>13,000 InterPro entries>13,000 InterPro entries

>22,000 signature methods>22,000 signature methods

Page 144: A Proteomics Toolkit:

Structural coverage in InterPro:

0.6% of proteins have PDB structures

20% of proteins have Swiss-Model structures

63% of proteins have ModBase structures

CoverageCoverage

>9500 PDB structures in InterPro>9500 PDB structures in InterPro

>300,000 Swiss Model links in InterPro>300,000 Swiss Model links in InterPro

>950,000 ModBase links in InterPro>950,000 ModBase links in InterPro

Page 145: A Proteomics Toolkit:

Web accessWeb access

Tool/Databases:

Availability and downloadsAvailability and downloads

ftp://ftp.ebi.ac.uk/pub/databases/ftp site:

DownloadsDownloads

http://www.ebi.ac.uk/services/

Page 146: A Proteomics Toolkit:

2Can Training and Education2Can Training and Education

Bioinformatics Educational ResourceBioinformatics Educational Resource

Information on EBI Databases

On-line tutorials on EBI Databases and tools

Glossary

Guide to bioinformatics resources on the internet

EBI web servicesProtein structureNucleotide analysis

Proteomics analysis

Protein function

Genome browsing Database browsing

Page 147: A Proteomics Toolkit:

http://www.ebi.ac.uk/

Page 148: A Proteomics Toolkit:

http://www.ebi.ac.uk/2can/

Page 149: A Proteomics Toolkit:

http://www.ebi.ac.uk/interpro/

Page 150: A Proteomics Toolkit:

Rolf Apweiler

Amos Bairoch

Cathy Wu

+100 annotators

AcknowledgementsAcknowledgements

Nicky Mulder

IntAct Team

InterPro Consortium

Henning Hermajakob

InterPro Team