48
Protein Sequence Databases for Proteomics The good, the bad & the ugly US HUPO: Bioinformatics for Proteomics Nathan Edwards – March 12, 2006

Protein Sequence Databases for Proteomics The good, the bad & the ugly US HUPO: Bioinformatics for Proteomics Nathan Edwards – March 12, 2006

Embed Size (px)

DESCRIPTION

3 US HUPO: Bioinformatics for Proteomics More than sequence… Protein sequence databases provide much more than sequence: Names Descriptions Facts Predictions Links to other information sources Protein databases provide a link to the current state of our understanding about a protein.

Citation preview

Page 1: Protein Sequence Databases for Proteomics The good, the bad & the ugly US HUPO: Bioinformatics for Proteomics Nathan Edwards – March 12, 2006

Protein Sequence Databases for ProteomicsThe good, the bad & the ugly

US HUPO: Bioinformatics for ProteomicsNathan Edwards – March 12, 2006

Page 2: Protein Sequence Databases for Proteomics The good, the bad & the ugly US HUPO: Bioinformatics for Proteomics Nathan Edwards – March 12, 2006

2US HUPO: Bioinformatics for Proteomics

Protein Sequence Databases

• Link between mass spectra and proteins• A protein’s amino-acid sequence provides

a basis for interpreting• Enzymatic digestion• Separation protocols• Fragmentation

• We must interpret database information as carefully as mass spectra.

Page 3: Protein Sequence Databases for Proteomics The good, the bad & the ugly US HUPO: Bioinformatics for Proteomics Nathan Edwards – March 12, 2006

3US HUPO: Bioinformatics for Proteomics

More than sequence…

Protein sequence databases provide much more than sequence:

• Names• Descriptions• Facts• Predictions• Links to other information sources

Protein databases provide a link to the current state of our understanding about a protein.

Page 4: Protein Sequence Databases for Proteomics The good, the bad & the ugly US HUPO: Bioinformatics for Proteomics Nathan Edwards – March 12, 2006

4US HUPO: Bioinformatics for Proteomics

Much more than sequence

Names• Accession, Name, Description

Biological Source• Organism, Source, Taxonomy

LiteratureFunction

• Biological process, molecular function, cellular component

• Known and predictedFeatures

• Polymorphism, Isoforms, PTMs, Domains

Page 5: Protein Sequence Databases for Proteomics The good, the bad & the ugly US HUPO: Bioinformatics for Proteomics Nathan Edwards – March 12, 2006

5US HUPO: Bioinformatics for Proteomics

Database types

Curated• Swiss-Prot• PIR• RefSeq NP

Translated• TrEMBL• RefSeq XP, ZP

Omnibus• NCBI’s nr• MSDB• IPI

Other• PDB• HPRD• EST• Genomic

Page 6: Protein Sequence Databases for Proteomics The good, the bad & the ugly US HUPO: Bioinformatics for Proteomics Nathan Edwards – March 12, 2006

6US HUPO: Bioinformatics for Proteomics

Human Sequences

• Number of Human Genes is believed to be between 20,000 and 25,000

PIR ~ 10,500

SwissProt ~ 12,000

RefSeq ~ 28,000

IPI-HUMAN ~ 48,000

TrEMBL ~ 52,000

MSDB ~ 105,000

Page 7: Protein Sequence Databases for Proteomics The good, the bad & the ugly US HUPO: Bioinformatics for Proteomics Nathan Edwards – March 12, 2006

7US HUPO: Bioinformatics for Proteomics

Accessions

• Permanent labels• Short, machine readable• Enable precise communication• Typos render them unusable!• Each database uses a different format

• Swiss-Prot: P17947• Ensembl: ENSG00000066336• PIR: S60367; S60367• GO: GO:0003700;

Page 8: Protein Sequence Databases for Proteomics The good, the bad & the ugly US HUPO: Bioinformatics for Proteomics Nathan Edwards – March 12, 2006

8US HUPO: Bioinformatics for Proteomics

Names / IDs

• Compact mnemonic labels• Not guaranteed permanent• Require careful curation• Conceptual objects

• Swiss-Prot names changed last year!

• ALBU_HUMAN• Serum Albumin

• RT30_HUMAN• Mitochondrial 28S ribosomal protein S30

• CP3A7_HUMAN• Cytochrome P450 3A7

Page 9: Protein Sequence Databases for Proteomics The good, the bad & the ugly US HUPO: Bioinformatics for Proteomics Nathan Edwards – March 12, 2006

9US HUPO: Bioinformatics for Proteomics

Description / Name

• Free text description• Human readable• Space limited• Hard for computers to interpret!• No standard nomenclature or format• Often abused….

• COX7R_HUMAN• Cytochrome c oxidase subunit VIIa-

related protein, mitochondrial [Precursor]

Page 10: Protein Sequence Databases for Proteomics The good, the bad & the ugly US HUPO: Bioinformatics for Proteomics Nathan Edwards – March 12, 2006

10US HUPO: Bioinformatics for Proteomics

FASTA Format

Page 11: Protein Sequence Databases for Proteomics The good, the bad & the ugly US HUPO: Bioinformatics for Proteomics Nathan Edwards – March 12, 2006

11US HUPO: Bioinformatics for Proteomics

FASTA Format

• >• Accession number

• No uniform format• Multiple accessions separated by |

• One line of description• Usually pretty cryptic

• Organism of sequence?• No uniform format• Official latin name not necessarily used

• Amino-acid sequence in single-letter code• Usually spread over multiple lines.

Page 12: Protein Sequence Databases for Proteomics The good, the bad & the ugly US HUPO: Bioinformatics for Proteomics Nathan Edwards – March 12, 2006

12US HUPO: Bioinformatics for Proteomics

Organism / Species / Taxonomy

• The protein’s organism…• …or the source of the biological sample

• The most reliable sequence annotation available

• Useful only to the extent that it is correct• NCBI’s taxonomy is widely used

• Provides a standard of sorts; Heirachical• Other databases don’t necessarily keep up

• Organism specific sequence databases are also available.

Page 13: Protein Sequence Databases for Proteomics The good, the bad & the ugly US HUPO: Bioinformatics for Proteomics Nathan Edwards – March 12, 2006

13US HUPO: Bioinformatics for Proteomics

Organism / Species / Taxonomy• Buffalo rat• Gunn rats• Norway rat• Rattus PC12 clone IS• Rattus norvegicus• Rattus norvegicus8• Rattus norwegicus• Rattus rattiscus

• Rattus sp.

• Rattus sp. strain Wistar• Sprague-Dawley rat• Wistar rats• brown rat• laboratory rat• rat• rats• zitter rats

Page 14: Protein Sequence Databases for Proteomics The good, the bad & the ugly US HUPO: Bioinformatics for Proteomics Nathan Edwards – March 12, 2006

14US HUPO: Bioinformatics for Proteomics

Controlled Vocabulary

• Middle ground between computers and people

• Provides precision for concepts• Searching, sorting, browsing• Concept relationships

• Vocabulary / Ontology must be established• Human curation

• Link between concept and object:• Manually curated• Automatic / Predicted

Page 15: Protein Sequence Databases for Proteomics The good, the bad & the ugly US HUPO: Bioinformatics for Proteomics Nathan Edwards – March 12, 2006

15US HUPO: Bioinformatics for Proteomics

Controlled Vocabulary

Page 16: Protein Sequence Databases for Proteomics The good, the bad & the ugly US HUPO: Bioinformatics for Proteomics Nathan Edwards – March 12, 2006

16US HUPO: Bioinformatics for Proteomics

Controlled Vocabulary

Page 17: Protein Sequence Databases for Proteomics The good, the bad & the ugly US HUPO: Bioinformatics for Proteomics Nathan Edwards – March 12, 2006

17US HUPO: Bioinformatics for Proteomics

Controlled Vocabulary

Page 18: Protein Sequence Databases for Proteomics The good, the bad & the ugly US HUPO: Bioinformatics for Proteomics Nathan Edwards – March 12, 2006

18US HUPO: Bioinformatics for Proteomics

Controlled Vocabulary

Page 19: Protein Sequence Databases for Proteomics The good, the bad & the ugly US HUPO: Bioinformatics for Proteomics Nathan Edwards – March 12, 2006

19US HUPO: Bioinformatics for Proteomics

Controlled Vocabulary

Page 20: Protein Sequence Databases for Proteomics The good, the bad & the ugly US HUPO: Bioinformatics for Proteomics Nathan Edwards – March 12, 2006

20US HUPO: Bioinformatics for Proteomics

Controlled Vocabulary

Page 21: Protein Sequence Databases for Proteomics The good, the bad & the ugly US HUPO: Bioinformatics for Proteomics Nathan Edwards – March 12, 2006

21US HUPO: Bioinformatics for Proteomics

Ontology Structure

• NCBI Taxonomy• Tree

• Gene Ontology (GO)• Molecular function• Biological process• Cellular component• Directed, Acyclic Graph (DAG)

• Unstructured labels• InterPro, Pfam, Swiss-Prot keywords• Overlapping?

Page 22: Protein Sequence Databases for Proteomics The good, the bad & the ugly US HUPO: Bioinformatics for Proteomics Nathan Edwards – March 12, 2006

22US HUPO: Bioinformatics for Proteomics

Ontology Structure

Page 23: Protein Sequence Databases for Proteomics The good, the bad & the ugly US HUPO: Bioinformatics for Proteomics Nathan Edwards – March 12, 2006

23US HUPO: Bioinformatics for Proteomics

Protein Families

• Similar sequence implies similar function• Similar structure implies similar function• Common domains imply similar function

• Bootstrap up from small sets of proteins with well understood characteristics

• Usually a hybrid manual / automatic approach

Page 24: Protein Sequence Databases for Proteomics The good, the bad & the ugly US HUPO: Bioinformatics for Proteomics Nathan Edwards – March 12, 2006

24US HUPO: Bioinformatics for Proteomics

Protein Families

Page 25: Protein Sequence Databases for Proteomics The good, the bad & the ugly US HUPO: Bioinformatics for Proteomics Nathan Edwards – March 12, 2006

25US HUPO: Bioinformatics for Proteomics

Protein Families

Page 26: Protein Sequence Databases for Proteomics The good, the bad & the ugly US HUPO: Bioinformatics for Proteomics Nathan Edwards – March 12, 2006

26US HUPO: Bioinformatics for Proteomics

Protein Families

• PROSITE, PFam, InterPro, PRINTS• Swiss-Prot keywords

• Differences:• Motif style, ontology structure, degree of

manual curation• Similarities:

• Primarily sequence based, cross species

Page 27: Protein Sequence Databases for Proteomics The good, the bad & the ugly US HUPO: Bioinformatics for Proteomics Nathan Edwards – March 12, 2006

27US HUPO: Bioinformatics for Proteomics

Gene Ontology

• Hierarchical• Molecular function• Biological process• Cellular component

• Describes the vocabulary only!• Protein families provide GO association

• Not necessarily any appropriate GO category.• Not necessarily in all three hierarchies.• Sometimes general categories are used because

none of the specific categories are correct.

Page 28: Protein Sequence Databases for Proteomics The good, the bad & the ugly US HUPO: Bioinformatics for Proteomics Nathan Edwards – March 12, 2006

28US HUPO: Bioinformatics for Proteomics

Protein Family / Gene Ontology

Page 29: Protein Sequence Databases for Proteomics The good, the bad & the ugly US HUPO: Bioinformatics for Proteomics Nathan Edwards – March 12, 2006

29US HUPO: Bioinformatics for Proteomics

Sequence Variants

• Protein sequence can vary due to• Polymorphism• Alternative splicing• Post-translational modification

• Sequence databases typically do not capture all versions of a protein’s sequence

Page 30: Protein Sequence Databases for Proteomics The good, the bad & the ugly US HUPO: Bioinformatics for Proteomics Nathan Edwards – March 12, 2006

30US HUPO: Bioinformatics for Proteomics

Sequence Variants

Swiss-Prot; a curated protein sequence database which strives to provide a high level of annotation (such as the description of the function of a protein, its domains structure, post-translational modifications, variants, etc.), a minimal level of redundancy and high level of integration with other databases

- Swiss-Prot web site front page

Page 31: Protein Sequence Databases for Proteomics The good, the bad & the ugly US HUPO: Bioinformatics for Proteomics Nathan Edwards – March 12, 2006

31US HUPO: Bioinformatics for Proteomics

Sequence Variants

b) Minimal redundancy

Many sequence databases contain, for a given protein sequence, separate entries which correspond to different literature reports. In Swiss-Prot we try as much as possible to merge all these data so as to minimize the redundancy of the database. If conflicts exist between various sequencing reports, they are indicated in the feature table of the corresponding entry.

- Swiss-Prot User Manual, Section 1.1

Page 32: Protein Sequence Databases for Proteomics The good, the bad & the ugly US HUPO: Bioinformatics for Proteomics Nathan Edwards – March 12, 2006

32US HUPO: Bioinformatics for Proteomics

Sequence Variants

IPI provides a top level guide to the main databases that describe the proteomes of higher eukaryotic organisms. IPI:

1. effectively maintains a database of cross references between the primary data sources

2. provides minimally redundant yet maximally complete sets of proteins for featured species (one sequence per transcript)

3. maintains stable identifiers (with incremental versioning) to allow the tracking of sequences in IPI between IPI releases.

- IPI web site front page

Page 33: Protein Sequence Databases for Proteomics The good, the bad & the ugly US HUPO: Bioinformatics for Proteomics Nathan Edwards – March 12, 2006

33US HUPO: Bioinformatics for Proteomics

Sequence Variants

• Swiss-Prot variants, isoforms and conflicts are retained as features

• Script varsplic.pl can enumerate all sequence variants

• Command-line options for full enumeration-which full -varsplic -variant -conflict

Page 34: Protein Sequence Databases for Proteomics The good, the bad & the ugly US HUPO: Bioinformatics for Proteomics Nathan Edwards – March 12, 2006

34US HUPO: Bioinformatics for Proteomics

Swiss-Prot Variant Annotations

Page 35: Protein Sequence Databases for Proteomics The good, the bad & the ugly US HUPO: Bioinformatics for Proteomics Nathan Edwards – March 12, 2006

35US HUPO: Bioinformatics for Proteomics

Swiss-Prot Variant Annotations

Page 36: Protein Sequence Databases for Proteomics The good, the bad & the ugly US HUPO: Bioinformatics for Proteomics Nathan Edwards – March 12, 2006

36US HUPO: Bioinformatics for Proteomics

Swiss-Prot Variant Annotations

Feature viewer

Variants

Page 37: Protein Sequence Databases for Proteomics The good, the bad & the ugly US HUPO: Bioinformatics for Proteomics Nathan Edwards – March 12, 2006

37US HUPO: Bioinformatics for Proteomics

Swiss-Prot VarSplic Output

P13746-00-01-00 MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFIAVGYVDDTQFVRF

P13746-01-01-00 MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFIAVGYVDDTQFVRF

P13746-00-00-00 MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFIAVGYVDDTQFVRF

P13746-00-03-00 MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFIAVGYVDDTQFVRF

P13746-01-03-00 MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFIAVGYVDDTQFVRF

P13746-00-04-00 MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGKPRFIAVGYVDDTQFVRF

P13746-01-04-00 MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGKPRFIAVGYVDDTQFVRF

P13746-00-05-00 MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFIAVGYVDDTQFVRF

P13746-01-05-00 MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFIAVGYVDDTQFVRF

P13746-01-00-00 MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFIAVGYVDDTQFVRF

P13746-00-02-00 MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFIAVGYVDDTQFVRF

P13746-01-02-00 MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFIAVGYVDDTQFVRF

******************************************:*****************

Page 38: Protein Sequence Databases for Proteomics The good, the bad & the ugly US HUPO: Bioinformatics for Proteomics Nathan Edwards – March 12, 2006

38US HUPO: Bioinformatics for Proteomics

Swiss-Prot VarSplic Output

P13746-00-01-00 SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSS------DRKGGSYTQAASSDSAQ

P13746-01-01-00 SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSSGGEGVKDRKGGSYTQAASSDSAQ

P13746-00-00-00 SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSS------DRKGGSYTQAASSDSAQ

P13746-00-03-00 SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSS------DRKGGSYTQAASSDSAQ

P13746-01-03-00 SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSSGGEGVKDRKGGSYTQAASSDSAQ

P13746-00-04-00 SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSS------DRKGGSYTQAASSDSAQ

P13746-01-04-00 SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSSGGEGVKDRKGGSYTQAASSDSAQ

P13746-00-05-00 SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSS------DRKGGSYTQAASSDSAQ

P13746-01-05-00 SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSSGGEGVKDRKGGSYTQAASSDSAQ

P13746-01-00-00 SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSSGGEGVKDRKGGSYTQAASSDSAQ

P13746-00-02-00 SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSS------DRKGGSYSQAASSDSAQ

P13746-01-02-00 SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSSGGEGVKDRKGGSYSQAASSDSAQ

************************************* *******:*********

Page 39: Protein Sequence Databases for Proteomics The good, the bad & the ugly US HUPO: Bioinformatics for Proteomics Nathan Edwards – March 12, 2006

39US HUPO: Bioinformatics for Proteomics

Omnibus Database Redundancy Elimination

• Source databases often contain the same sequences with different descriptions

• Omnibus databases keep one copy of the sequence, and • An arbitrary description, or• All descriptions, or• Particular description, based on source preference

• Good definitions can be lost, including taxonomy

Page 40: Protein Sequence Databases for Proteomics The good, the bad & the ugly US HUPO: Bioinformatics for Proteomics Nathan Edwards – March 12, 2006

40US HUPO: Bioinformatics for Proteomics

Omnibus Database Redundancy Elimination

NCBI’s nr:Keeps all descriptions, separated by ^A

MSDB:Pecking order: PIR1-4, TrEMBL, GenBank, Swiss-Prot, NRL3D

IPI:All accessions, one description

Page 41: Protein Sequence Databases for Proteomics The good, the bad & the ugly US HUPO: Bioinformatics for Proteomics Nathan Edwards – March 12, 2006

41US HUPO: Bioinformatics for Proteomics

Description Elimination

• gi|12053249|emb|CAB66806.1| hypothetical protein [Homo sapiens]

• gi|46255828|gb|AAH68998.1| COMMD4 protein [Homo sapiens]

• gi|42632621|gb|AAS22242.1| COMMD4 [Homo sapiens]

• gi|21361661|ref|NP_060298.2| COMM domain containing 4 [Homo sapiens]

• gi|51316094|sp|Q9H0A8|COM4_HUMAN COMM domain containing protein 4

• gi|49065330|emb|CAG38483.1| COMMD4 [Homo sapiens]

Page 42: Protein Sequence Databases for Proteomics The good, the bad & the ugly US HUPO: Bioinformatics for Proteomics Nathan Edwards – March 12, 2006

42US HUPO: Bioinformatics for Proteomics

Description Elimination

• gi|2947219|gb|AAC39645.1| UDP-galactose 4' epimerase [Homo sapiens]

• gi|1119217|gb|AAB86498.1| UDP-galactose-4-epimerase [Homo sapiens]

• gi|14277913|pdb|1HZJ|B Chain B, Human Udp-Galactose 4-Epimerase: Accommodation Of Udp-N- Acetylglucosamine Within The Active Site

• gi|14277912|pdb|1HZJ|A Chain A, Human Udp-Galactose 4-Epimerase: Accommodation Of Udp-N- Acetylglucosamine Within The Active Site

• gi|2494659|sp|Q14376|GALE_HUMAN UDP-glucose 4-epimerase (Galactowaldenase) (UDP-galactose 4-epimerase)

• gi|1585500|prf||2201313AUDP galactose 4'-epimerase

Page 43: Protein Sequence Databases for Proteomics The good, the bad & the ugly US HUPO: Bioinformatics for Proteomics Nathan Edwards – March 12, 2006

43US HUPO: Bioinformatics for Proteomics

Description Elimination• gi|4261710|gb|AAD14010.1| chlordecone reductase [Homo sapiens]

• gi|2117443|pir||A57407 chlordecone reductase (EC 1.1.1.225) / 3alpha-hydroxysteroid dehydrogenase (EC 1.1.1.-) I [validated] – human

• gi|1839264|gb|AAB47003.1| HAKRa product/3 alpha-hydroxysteroid dehydrogenase homolog [human, liver, Peptide, 323 aa]

• gi|1705823|sp|P17516|AKC4_HUMAN Aldo-keto reductase family 1 member C4 (Chlordecone reductase) (CDR) (3-alpha-hydroxysteroid dehydrogenase) (3-alpha-HSD) (Dihydrodiol dehydrogenase 4) (DD4) (HAKRA)

• gi|7328948|dbj|BAA92885.1| dihydrodiol dehydrogenase 4 [Homo sapiens]

• gi|7328971|dbj|BAA92893.1|dihydrodiol dehydrogenase 4 [Homo sapiens]

Page 44: Protein Sequence Databases for Proteomics The good, the bad & the ugly US HUPO: Bioinformatics for Proteomics Nathan Edwards – March 12, 2006

44US HUPO: Bioinformatics for Proteomics

DNA to Protein Sequence

Derived from http://online.itp.ucsb.edu/online/infobio01/burge

Page 45: Protein Sequence Databases for Proteomics The good, the bad & the ugly US HUPO: Bioinformatics for Proteomics Nathan Edwards – March 12, 2006

45US HUPO: Bioinformatics for Proteomics

Translated sequences

• Gene models describe introns and exons• Start site?• Splice sites?• Alternative splicing?

• ESTs provide limited evidence of transcription only

• There is a lot we don’t know about what protein sequences result from a gene

• Recent revision of number of human genes suggest a bigger role for alternative splicing.

Page 46: Protein Sequence Databases for Proteomics The good, the bad & the ugly US HUPO: Bioinformatics for Proteomics Nathan Edwards – March 12, 2006

46US HUPO: Bioinformatics for Proteomics

Genome Browsers

• Link genomic, transcript, and protein sequence in a graphical manner• Genes, ESTs, SNPs, cross-species, etc.

• UC Santa Cruz• http://genome.ucsc.edu

• Ensembl• http://www.ensembl.org

• NCBI Map View• http://www.ncbi.nlm.nih.gov/mapview

Page 47: Protein Sequence Databases for Proteomics The good, the bad & the ugly US HUPO: Bioinformatics for Proteomics Nathan Edwards – March 12, 2006

47US HUPO: Bioinformatics for Proteomics

UCSC Genome Browser

• Shows many sources of protein sequence evidence in a unified display

• Can use EST accession as a location!

Page 48: Protein Sequence Databases for Proteomics The good, the bad & the ugly US HUPO: Bioinformatics for Proteomics Nathan Edwards – March 12, 2006

48US HUPO: Bioinformatics for Proteomics

Summary

• Protein sequence databases should be interpreted with as much care as mass spectra

• Use controlled vocabularies• Understand the structure of ontologies• Take advantage of computational

predictions• Look for sequence variants• Be careful with omnibus databases