View
218
Download
0
Category
Tags:
Preview:
Citation preview
Protein Sequence
Databases for Proteomics
The good, the bad & the ugly
Protein Sequence
Databases for Proteomics
The good, the bad & the ugly
US HUPO: Bioinformatics for ProteomicsNathan Edwards – March 12, 2005
2US HUPO: Bioinformatics for Proteomics
Protein Sequence Databases
• Link between mass spectra and proteins• A protein’s amino-acid sequence provides
a basis for interpreting• Enzymatic digestion• Separation protocols• Fragmentation
• We must interpret database information as carefully as mass spectra.
3US HUPO: Bioinformatics for Proteomics
More than sequence…
Protein sequence databases provide much more than sequence:
• Names• Descriptions• Facts• Predictions• Links to other information sources
Protein databases provide a link to the current state of our understanding about a protein.
4US HUPO: Bioinformatics for Proteomics
Much more than sequence
Names• Accession, Name, Description
Biological Source• Organism, Source, Taxonomy
LiteratureFunction
• Biological process, molecular function, cellular component
• Known and predictedFeatures
• Polymorphism, Isoforms, PTMs, DomainsDerived Data
• Molecular weight, pI
5US HUPO: Bioinformatics for Proteomics
Database types
Curated• Swiss-Prot• PIR• RefSeq NP
Translated• TrEMBL• RefSeq XP, ZP
Omnibus• NCBI’s nr• MSDB• IPI
Other• PDB• HPRD
6US HUPO: Bioinformatics for Proteomics
Accessions
• Permanent labels• Short, machine readable• Enable precise communication• Typos render them unusable!• Each database uses a different format
• Swiss-Prot: P17947• Ensembl: ENSG00000066336• PIR: S60367; S60367• GO: GO:0003700;
7US HUPO: Bioinformatics for Proteomics
Names / IDs
• Compact mnemonic labels• Not guaranteed permanent• Require careful curation• Conceptual objects
• Swiss-Prot names changed recently!
• ALBU_HUMAN• Serum Albumin
• RT30_HUMAN• Mitochondrial 28S ribosomal protein S30
• CP3A7_HUMAN• Cytochrome P450 3A7
8US HUPO: Bioinformatics for Proteomics
Description / Name
• Free text description• Human readable• Space limited• Hard for computers to interpret!• No standard nomenclature or format• Often abused….
• COX7R_HUMAN• Cytochrome c oxidase subunit VIIa-
related protein, mitochondrial [Precursor]
9US HUPO: Bioinformatics for Proteomics
Organism / Species / Taxonomy
• The protein’s organism…• …or the source of the biological sample
• The most reliable sequence annotation available
• Useful only to the extent that it is correct• NCBI’s taxonomy is widely used
• Provides a standard of sorts; Heirachical• Other databases don’t necessarily keep up
• Organism specific sequence databases starting to become available.
10US HUPO: Bioinformatics for Proteomics
Organism / Species / Taxonomy
11US HUPO: Bioinformatics for Proteomics
Organism / Species / Taxonomy
12US HUPO: Bioinformatics for Proteomics
Organism / Species / Taxonomy
13US HUPO: Bioinformatics for Proteomics
Organism / Species / Taxonomy
14US HUPO: Bioinformatics for Proteomics
Organism / Species / Taxonomy
15US HUPO: Bioinformatics for Proteomics
Organism / Species / Taxonomy
• Buffalo rat• Gunn rats• Norway rat• Rattus PC12 clone IS• Rattus norvegicus• Rattus norvegicus8• Rattus norwegicus• Rattus rattiscus
• Rattus sp.
• Rattus sp. strain Wistar• Sprague-Dawley rat• Wistar rats• brown rat• laboratory rat• rat• rats• zitter rats
16US HUPO: Bioinformatics for Proteomics
Controlled Vocabulary
• Middle ground between computers and people
• Provides precision for concepts• Searching, sorting, browsing• Concept relationships
• Vocabulary / Ontology must be established• Human curation
• Link between concept and object:• Manually curated• Automatic / Predicted
17US HUPO: Bioinformatics for Proteomics
Controlled Vocabulary
18US HUPO: Bioinformatics for Proteomics
Controlled Vocabulary
19US HUPO: Bioinformatics for Proteomics
Controlled Vocabulary
20US HUPO: Bioinformatics for Proteomics
Controlled Vocabulary
21US HUPO: Bioinformatics for Proteomics
Controlled Vocabulary
22US HUPO: Bioinformatics for Proteomics
Controlled Vocabulary
23US HUPO: Bioinformatics for Proteomics
Ontology Structure
• NCBI Taxonomy• Tree
• Gene Ontology (GO)• Molecular function• Biological process• Cellular component• Directed, Acyclic Graph (DAG)
• Unstructured labels• Overlapping?
24US HUPO: Bioinformatics for Proteomics
Ontology Structure
25US HUPO: Bioinformatics for Proteomics
Protein Families
• Similar sequence implies similar function• Similar structure implies similar function• Common domains imply similar function
• Bootstrap up from small sets of proteins with well understood characteristics
• Usually a hybrid manual / automatic approach
26US HUPO: Bioinformatics for Proteomics
Protein Families
27US HUPO: Bioinformatics for Proteomics
Protein Families
28US HUPO: Bioinformatics for Proteomics
Protein Families
• PROSITE, PFam, InterPro, PRINTS• Gene Ontology• Swiss-Prot keywords
• Differences:• Motif style, ontology structure, degree of
manual curation• Similarities:
• Primarily sequence based, cross species
29US HUPO: Bioinformatics for Proteomics
Sequence Variants
• Protein sequence can vary due to• Polymorphism• Alternative splicing• Post-translational modification
• Sequence databases typically do not capture all versions of a protein’s sequence
30US HUPO: Bioinformatics for Proteomics
Sequence Variants
Swiss-Prot; a curated protein sequence database which strives to provide a high level of annotation (such as the description of the function of a protein, its domains structure, post-translational modifications, variants, etc.), a minimal level of redundancy and high level of integration with other databases
- Swiss-Prot web site front page
31US HUPO: Bioinformatics for Proteomics
Sequence Variants
b) Minimal redundancy
Many sequence databases contain, for a given protein sequence, separate entries which correspond to different literature reports. In Swiss-Prot we try as much as possible to merge all these data so as to minimize the redundancy of the database. If conflicts exist between various sequencing reports, they are indicated in the feature table of the corresponding entry.
- Swiss-Prot User Manual, Section 1.1
32US HUPO: Bioinformatics for Proteomics
Sequence Variants
IPI provides a top level guide to the main databases that describe the proteomes of higher eukaryotic organisms. IPI:
1. effectively maintains a database of cross references between the primary data sources
2. provides minimally redundant yet maximally complete sets of proteins for featured species (one sequence per transcript)
3. maintains stable identifiers (with incremental versioning) to allow the tracking of sequences in IPI between IPI releases.
- IPI web site front page
33US HUPO: Bioinformatics for Proteomics
Sequence Variants
IPI provides a top level guide to the main databases that describe the proteomes of higher eukaryotic organisms. IPI:
1. effectively maintains a database of cross references between the primary data sources
2. provides minimally redundant yet maximally complete sets of proteins for featured species (one sequence per transcript)
3. maintains stable identifiers (with incremental versioning) to allow the tracking of sequences in IPI between IPI releases.
- IPI web site front page
34US HUPO: Bioinformatics for Proteomics
Sequence Variants
• Swiss-Prot variants, isoforms and conflicts are retained as features
• Script varsplic.pl can enumerate all sequence variants
• Command-line options for full enumeration-which full -varsplic -variant -conflict
35US HUPO: Bioinformatics for Proteomics
Swiss-Prot Variant Annotations
36US HUPO: Bioinformatics for Proteomics
Swiss-Prot Variant Annotations
37US HUPO: Bioinformatics for Proteomics
Swiss-Prot Variant Annotations
Feature viewer
Variants
38US HUPO: Bioinformatics for Proteomics
Swiss-Prot VarSplic Output
P13746-00-01-00 MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFIAVGYVDDTQFVRF
P13746-01-01-00 MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFIAVGYVDDTQFVRF
P13746-00-00-00 MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFIAVGYVDDTQFVRF
P13746-00-03-00 MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFIAVGYVDDTQFVRF
P13746-01-03-00 MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFIAVGYVDDTQFVRF
P13746-00-04-00 MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGKPRFIAVGYVDDTQFVRF
P13746-01-04-00 MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGKPRFIAVGYVDDTQFVRF
P13746-00-05-00 MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFIAVGYVDDTQFVRF
P13746-01-05-00 MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFIAVGYVDDTQFVRF
P13746-01-00-00 MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFIAVGYVDDTQFVRF
P13746-00-02-00 MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFIAVGYVDDTQFVRF
P13746-01-02-00 MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFIAVGYVDDTQFVRF
******************************************:*****************
39US HUPO: Bioinformatics for Proteomics
Swiss-Prot VarSplic Output
P13746-00-01-00 SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSS------DRKGGSYTQAASSDSAQ
P13746-01-01-00 SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSSGGEGVKDRKGGSYTQAASSDSAQ
P13746-00-00-00 SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSS------DRKGGSYTQAASSDSAQ
P13746-00-03-00 SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSS------DRKGGSYTQAASSDSAQ
P13746-01-03-00 SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSSGGEGVKDRKGGSYTQAASSDSAQ
P13746-00-04-00 SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSS------DRKGGSYTQAASSDSAQ
P13746-01-04-00 SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSSGGEGVKDRKGGSYTQAASSDSAQ
P13746-00-05-00 SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSS------DRKGGSYTQAASSDSAQ
P13746-01-05-00 SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSSGGEGVKDRKGGSYTQAASSDSAQ
P13746-01-00-00 SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSSGGEGVKDRKGGSYTQAASSDSAQ
P13746-00-02-00 SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSS------DRKGGSYSQAASSDSAQ
P13746-01-02-00 SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSSGGEGVKDRKGGSYSQAASSDSAQ
************************************* *******:*********
40US HUPO: Bioinformatics for Proteomics
Omnibus Database Redundancy Elimination
• Source databases often contain the same sequences with different descriptions
• Omnibus databases keep one copy of the sequence, and • An arbitrary description, or• All descriptions, or• Particular description, based on source preference
• Good definitions can be lost, including taxonomy
41US HUPO: Bioinformatics for Proteomics
Omnibus Database Redundancy Elimination
NCBI’s nr:Keeps all descriptions, separated by ^A
MSDB:Pecking order: PIR1-4, TrEMBL, GenBank, Swiss-Prot, NRL3D
IPI:All accessions, one description
42US HUPO: Bioinformatics for Proteomics
Description Elimination
• gi|12053249|emb|CAB66806.1| hypothetical protein [Homo sapiens]
• gi|46255828|gb|AAH68998.1| COMMD4 protein [Homo sapiens]
• gi|42632621|gb|AAS22242.1| COMMD4 [Homo sapiens]
• gi|21361661|ref|NP_060298.2| COMM domain containing 4 [Homo sapiens]
• gi|51316094|sp|Q9H0A8|COM4_HUMAN COMM domain containing protein 4
• gi|49065330|emb|CAG38483.1| COMMD4 [Homo sapiens]
43US HUPO: Bioinformatics for Proteomics
Description Elimination
• gi|2947219|gb|AAC39645.1| UDP-galactose 4' epimerase [Homo sapiens]
• gi|1119217|gb|AAB86498.1| UDP-galactose-4-epimerase [Homo sapiens]
• gi|14277913|pdb|1HZJ|B Chain B, Human Udp-Galactose 4-Epimerase: Accommodation Of Udp-N- Acetylglucosamine Within The Active Site
• gi|14277912|pdb|1HZJ|A Chain A, Human Udp-Galactose 4-Epimerase: Accommodation Of Udp-N- Acetylglucosamine Within The Active Site
• gi|2494659|sp|Q14376|GALE_HUMAN UDP-glucose 4-epimerase (Galactowaldenase) (UDP-galactose 4-epimerase)
• gi|1585500|prf||2201313AUDP galactose 4'-epimerase
44US HUPO: Bioinformatics for Proteomics
Description Elimination
• gi|4261710|gb|AAD14010.1| chlordecone reductase [Homo sapiens]
• gi|2117443|pir||A57407 chlordecone reductase (EC 1.1.1.225) / 3alpha-hydroxysteroid dehydrogenase (EC 1.1.1.-) I [validated] – human
• gi|1839264|gb|AAB47003.1| HAKRa product/3 alpha-hydroxysteroid dehydrogenase homolog [human, liver, Peptide, 323 aa]
• gi|1705823|sp|P17516|AKC4_HUMAN Aldo-keto reductase family 1 member C4 (Chlordecone reductase) (CDR) (3-alpha-hydroxysteroid dehydrogenase) (3-alpha-HSD) (Dihydrodiol dehydrogenase 4) (DD4) (HAKRA)
• gi|7328948|dbj|BAA92885.1| dihydrodiol dehydrogenase 4 [Homo sapiens]
• gi|7328971|dbj|BAA92893.1|dihydrodiol dehydrogenase 4 [Homo sapiens]
45US HUPO: Bioinformatics for Proteomics
Translated sequences
• Gene models describe introns and exons• Start site?• Splice sites?• Alternative splicing?
• ESTs provide limited evidence of transcription only
• There is a lot we don’t know about what protein sequences result from a gene
• Recent revision of number of human genes suggest a bigger role for alternative splicing.
46US HUPO: Bioinformatics for Proteomics
Translated sequences
Lewis et al. PNAS 2003
47US HUPO: Bioinformatics for Proteomics
Molecular Weight / pI
• Documentation is often lacking• Monoisotopic or average weight?• N-terminal H; C-terminal OH?• Protonated or not?• Sequence variants or not?
• Swiss-Prot: Rounded average uncharged molecular weight, including N & C-term.
• TIGR (bioperl): Rounded average uncharged molecular weight, no N & C-term.
48US HUPO: Bioinformatics for Proteomics
Summary
• Protein sequence databases should be interpreted with as much care as mass spectra
• Use controlled vocabularies• Understand the structure of ontologies• Take advantage of computational
predictions• Look for sequence variants• Be careful with omnibus databases
Recommended