Upload
ridho-arahman
View
16
Download
0
Embed Size (px)
DESCRIPTION
Bioinformatics Database
Citation preview
Journal of Biotechnology 124 (2006) 629–639
Review
Bioinformatics database infrastructure for biotechnology research
Eleanor J. Whitfield ∗, Manuela Pruess, Rolf ApweilerEMBLEBI, Wellcome Trust Genome Campus, Hinxton Hall, Hinxton, Cambs CB10 1SD, UK
Received 14 October 2005; received in revised form 6 March 2006; accepted 3 April 2006
Abstract
Many databases are available that provide valuable data resources for the biotechnological researcher. According to theircore data, they can be divided into different types. Some databases provide primary data, like all published nucleotidesequences, others deal with protein sequences. In addition to these two basic types of databases, a huge number of morespecialized resources are available, like databases about protein structures, protein identification, special features of genesand/or proteins, or certain organisms. Furthermore, some resources offer integrated views on different types of data, allowingthe user to do easy customised queries over large datasets and to compare different types of data.© 2006 Elsevier B.V. All rights reserved.
Keywords: Bioinformatics; Nucleic acid database; Protein database; Genomics database; Proteome database
Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 630 2.Nucleotide sequence databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 630 2.1.
EMBL/DDBJ/GenBank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 630 2.2.RefSeq . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 631 2.3.
Ensembl . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 631 2.4. Genomereviews . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 632 3. Protein sequence
databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 632 3.1.GenPept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 632 3.2. Entrez
protein . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 632 3.3.UniProt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 632
4. Specialized databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 633 4.1.
Model organism databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 633
∗ Corresponding author. Tel.: +44 1223494680; fax: +44 1223494468.
Email address: [email protected] (E.J. Whitfield).
01681656/$ – see front matter © 2006 Elsevier B.V. All rights
reserved. doi: 10.1016/j.jbiotec.2006.04.006
630 E.J. Whitfield et al. / Journal of Biotechnology 124 (2006) 629–639
5. Protein identification databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 633 5.1.GO/GOA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 633 5.2.IntAct . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 634 5.3.SWISS2DPAGE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 634 5.4.PRIDE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 634 5.5.ChEBI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 634
6. Structure databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 635 6.1.Protein data bank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 635 6.2.
Cambridge structural database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 635 6.3. RESID . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 635 7. Special features
databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 635 7.1. IntEnz. . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 635 7.2. TRANSFAC . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 636 7.3. EPD . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 636 7.4.IMGT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 636
8. Integrated and comparative databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 636 8.1.InterPro . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 636 8.2.
International protein index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 636 8.3. Integr8 . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 637 9.
Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 637References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 637
1. Introduction
If you knew that thujone, a terpenoid found inwormwood oil, gave absinthe, an emerald green liquor,its particular flavour and was the active component ofits claimed toxicity (Hold¨ et al., 2000), you may wantto investigate its ability to bind to the gammaaminobutyric acid A receptors or GABAA receptors inour brain (Kash et al., 2004), which can bring on anumber of brain disorders.
To do this, you can query the many databases that areavailable in the public domain provided by academic,bioinfomatic and nonacademic institutes. They range fromsimple sequence repositories with broad domains ofinterest, storing data with little or no manual interventionand therefore minimal detail, to expertly curated databasesthat cover all sequenced species and in which the originalsequence data is enhanced by the manual annotation offurther information. While all databases strive for completecoverage within their chosen scope, the domain of interestfor some users can transcend those of individual resources.This may reflect the users wish to combine different typesof information or from the inability of a single resource tocontain the full details of every query. It is important to
provide the users ofbiomolecular databaseswith a degree of integration
between these resources,as by nature they are allconnected in a scientificsense and each one ofthem provides importantdata to understandingbiological complexity.
2. Nucleotide sequence databases
Primary nucleotidesequence databases areessential to providesequences to the user asquickly as possible.These databases addlittle or no additionalinformation to thesequence records theycontain. Theredundancy of data andthe fact that entries in
EMBL/DDBJ/GenBank records cannot be updated,corrected or amended without the permission of theoriginal submitter has led to the creation of severalsecondary nucleotide sequence databases. Thesedatabases augment the annotation of completelysequenced genomes and are continually being devel
oped and improved,providing the user withaccurate and maintaineddatasets.
2.1.
EMBL/DDBJ/GenBank
The InternationalNucleotide SequenceDatabase (INSD)collaboration provides aprimary nucleotide
E.J. Whitfield et al. / Journal of Biotechnology 124 (2006) 629–639 631
sequence repository for the public domain. It is anarchive database that allows submissions from anumber of resources including individualresearchers, genome sequencing projects andpatent applications, and updates to entries fromthe original submitters. INSD is a joint effort ofthree partner databases; DNA Data Bank of Japan(DDBJ) (Tateno et al., 2005) at the NationalInstitute of Genetics (NIG), EMBL NucleotideSequence Database (EMBL) (Kanz et al., 2005) at
the European Bioinformatics Institute (EBI) (
Brooksbank et al., 2005) and GenBank (Bensonet al., 2005) at the National Centre ofBiotechnology Information (NCBI) (Wheeler etal., 2005). The three organisations synchronizetheir data on a daily basis to achieve optimalsynchrony and ensure worldwide coverage of allnucleotide sequence entries.EMBL/GenBank/DDBJ records includeindividual genes, whole genomes, RNA, thirdparty annotation, expressed sequence tags, highthroughput cDNAs and synthetic sequences.Largescale genomic sequencing has led to theexponential growth of these repositories, withover 59,828,564 records and 109,825,661,925nucleotides in EMBL release 84 of September2005. Due to its completeness and standing as aprimary data provider, EMBL/DDBJ/GenBank isthe initial source for many molecular biologydatabases. Protein sequence entries can be derivedfrom the translation of the coding sequenceannotation in a nucleotide entry.
2.2. RefSeq
The Reference Sequence (RefSeq) collection(Pruitt et al., 2005) aims to provide acomprehensive, integrated, nonredundant set ofsequences, including genomic DNA, transcript(RNA), and protein products, for major researchorganisms. RefSeq is based on data derived fromEMBL/DDBJ/GenBank supplemented byadditional sets of curated or predicted data inorganisms of particular scientific interest. The
aims of the RefSeq collectioninclude: explicitly linkednucleotide and proteinsequences, updates to reflectcurrent knowledge ofsequence data and biology andongoing curation by NCBIstaff and collaborators withreview status indicated oneach record. However, most ofthe entries are automaticallygenerated without any manualintervention or annotation sothis database should still beviewed mainly as a sequencerepository. Release 13 of
September 2005 includes1,899,454 proteins covering3060 organisms.
2.3. Ensembl
An example of a secondarynucleotide sequence genomicdatabase is Ensembl (Hubbardet al., 2005). This is a jointproject of the EBI and theWellcome Trust SangerInstitute (WTSI) to develop asoftware system that producesand maintains fast automaticannotation of raw genomicsequence on selected metazoangenomes. Ensembl is acomprehensive source ofstable annotation. Genes areannotated on evidence derivedfrom known protein, cDNAand EST sequences. Novelgenes are determined by thegene build system, thisincorporates a wide range ofmethods including ab initiogene predictions, homologyand gene prediction HMMs.All genes can be visualised thecontext of the genome,mapping genes to transcriptsto proteins. Data is augmentedwith alternative transcript andprotein splice patterns, dbSNPdata and DAS tracks ofexternal databases. The genebuild pipeline is constantlybeing developed to improvepredictions and generatesregular new versions of thegenomes. Release 34 inSeptember 2005 includes 8mammalian species, 6chordates and 5 othereukaryotes.
The Distributed Annotation System (DAS)(Dowell et al., 2001) specification was originallydesigned to allow the feature data for biologicalmolecules to be served in relation to a genomicsequence. The DAS server system is conceptuallya reference server, providing sequence data andits annotations, and an annotation server,providing coordinates for each feature andindicates a suitable DAS reference server fromwhich the corresponding sequence can beobtained. A DAS client is a powerful applicationthat is able to connect to at least one reference
server and any number ofannotation servers and mergethe information from theseservers in a unified display.
The Vertebrate GenomeAnnotation (Vega) database(Ashurst et al., 2005) isanother genomic database; acommunity resource forbrowsing manual annotationof finished sequences from a
variety of vertebrate genomes,including human, mouse, dogand zebrafish. Vega displaysonly manually annotated genestructures built usingtranscriptional evidence,which can be examined in thebrowser. The University ofCalifornia Santa Cruz(UCSC) Genome Browser(Karolchik
632 E.J. Whitfield et al. / Journal of Biotechnology 124 (2006) 629–639
et al., 2003) provides access to thereference sequence and workingdraft assemblies for a largecollection of genomes. TheGenome Browser zooms andscrolls over chromosomes,showing the work of annotatorsworldwide. The Gene Sortershows expression, homology andother information on groups ofgenes that can be related in manyways. The NCBI’s Entrez MapViewer (Wheeler et al., 2005)provides special browsingcapabilities for a subset oforganisms in Entrez genomes. MapViewer allows you to view andsearch an organism’s completegenome, display chromosomemaps, and zoom into progressivelygreater levels of detail, down to thesequence data for a region ofinterest.
2.4. Genome reviews
The goal of the GenomeReviews (Kersey et al., 2005)project is to provide an uptodate,standardised and comprehensivelyannotated view of the genomicsequence of organisms withcompletely deciphered genomes.Each Genome Review representsan enhanced version of the originalsequence of a completechromosome or plasmid, withadditional annotation importedfrom data sources that include theUniProt knowledgebase, GO, theGOA (GO Annotation) project,InterPro, and HoGenom. Crossreferences to 18 databases are alsoprovided. Annotations usedinconsistently among the originalsubmissions have been
standardised (forexample rRNA andtRNA annotations)and deleted in caseswhere the coverageis low, making iteasier to comparedata across severalgenomes. The datais completelysynchronised withthe fortnightlyUniProtKB releasesand evidence tagsare attached to mostfeature qualifiersindicating theprimary source ofthe information.Release 36 ofSeptember 2005 has239 completeprokaryote genomes,represented in 420entries. Furtherreleases will see theaddition ofeukaryote genomes,the first beingSaccharomycescerevisiae, betterrepresentation ofpseudogenes, andcomputationalanalysis to identifyncRNAs.
3. Protein sequence databases
Several proteinsequence databasesact as repositoriesof protein sequencesand, like the primary
nucleotide sequencedatabases, these areessential to providethe sequences to theuser as quickly aspossible. Thesedatabases add littleor no additionalinformation to thesequence recordsthey contain andgenerally make noeffort to provide anonredundantcollection ofsequences to users.When additionalinformation isannotated to asequence, thisgreatly increases thevalue of the resourcefor users. Expertbiologists validatesuch curated databefore being addedto the databases toensure that the datain these collectionsis highly reliable.There is also a largeeffort invested inmaintaining nonredundant datasetsby compiling allreports for a givenprotein sequenceinto a single record.
3.1. GenPept
The GenBankGene Products databank (GenPept)(Wheeler et al.,2005) is producedby the NCBI.
Entries in the database are derivedfrom translations of the codingsequences contained in thecollaborative nucleotide databaseand contain minimal annotation.The annotation in a GenPept entryhas been extracted from thecorresponding nucleotide entry andthe database does not containproteins derived from amino acidsequencing. The database isredundant as multiple records mayrepresent each protein; no attemptis made to group these records intoa single database entry.
3.2. Entrez protein
Entrez protein, a sequencerepository also produced by NCBI,
is compiled from avariety of sources. Italso containssequence data fromtranslations of thecoding sequencescontained in thecollaborativenucleotide databaseas well as proteinsequences submittedto ProteinInformation Resource(PIR),UniProtKB/SwissProt, ProteinResearch Foundation(PRF) and ProteinData Bank (PDB).
Additionalinformation exists asit has been extractedfrom the manuallycurated databasessuch asUniProtKB/SwissProt. As withGenPept, thesequence collection isredundant.
3.3. UniProt
The UniversalProtein Resource(UniProt) (Bairoch etal., 2005) is acomprehensivecatalogue of data
E.J. Whitfield et al. / Journal of Biotechnology 124 (2006) 629–639 633
on protein sequence and function,maintained by the UniProtconsortium. The consortium is acollaboration of the Swiss Instituteof Bioinformatics (SIB), theEuropean Bioinformatics Institute(EBI), and the Protein InformationResource (PIR). UniProt is comprised of three components. Firstly,the expertly curated UniProtKnowledgebase (UniProtKB) whichwill continue the work ofUniProtKB/SwissProt, UniPro
tKB/TrEMBL ( Boeckmann et al.,2003) and PIR (Wu et al., 2003).UniProtKB/SwissProt is a manuallyannotated database with informationextracted from literature and curatorevaluated computational analysis. Itcontains a minimal level ofredundancy and a high level ofintegration with other databases.UniProtKB/TrEMBL contains thetranslations of all coding sequencespresent in the collaborativenucleotide database and also proteinsequences extracted from theliterature or submitted toUniProtKB. Entries are enrichedwith automated classification andannotation. Records are awaitingfull manual annotation. PIRproduced the Protein SequenceDatabase (PSD) of functionallyannotated protein sequences, whichgrew out of the Atlas of ProteinSequence and Structure (1965–1978)edited by Margaret Dayhoff. PIRPSD is now an archive database asall sequences and annotations havebeen integrated into UniProtKB.Secondly, the UniProt archive(UniParc), into which new andupdated sequences are loaded on adaily basis. UniParc (Leinonen et al.,
2004) is acomprehensiverepository of proteinsequences, providinga mechanism bywhich the historicalassociation ofdatabase records andprotein sequencescan be tracked. It isnonredundant at thelevel of sequenceidentity, but maycontain semanticredundancies.Thirdly, the nonredundant UniProtReference clusters(UniRef) that providenonredundantreference datacollections based onthe UniProt knowledgebase in order toobtain completecoverage of sequencespace at severalresolutions: 100, 90and 50% sequencesimilarity. Updates ofUniProt are publiclyavailable on abiweekly schedule.The UniProt Release6.1 consists of:UniProtKB/SwissProt ProteinKnowledgebaseRelease 48.1 of 27September2005(contains 195,058sequence entries,comprising70,674,903 aminoacids abstracted from134,132 references)
andUniProtKB/TrEMBLProtein DatabaseRelease 31.1 of 27September2005(2,105,517 sequenceentries comprising680,464,593 aminoacids).
4. Specialized databases
4.1. Model organism databases
The Human GenomeOrganisation (HUGO) is theinternational organisation ofscientists involved in humangenetics, established in 1989 topromote and sustain internationalcollaboration in the field. As part ofHUGO, the Human GeneNomenclature Committee (HGNC)maintains Genew, a database ofapproved human gene names andsymbols (Wain et al., 2002). Theircurrent priority is assigningnomenclature to genes submittedfrom the Human Genome Project;symbols for over 20,000 genes areapproved. Scientists, journals anddatabases also request individualnew symbols. HGNC approvedsymbols are used by many databasesincluding UniProt, ensuring commonnomenclature across all human data.
Similar genecentric databases formodel organisms are available. TheMouse Genome Informatics (MGI)(Eppig et al., 2005) providesintegrated access to data on thegenetics, genomics, and biology ofthe laboratory mouse. FlyBase
(Drysdale andCrosby, 2005) andWormBase (Chen etal., 2005) arecomprehensivedatabases forinformation on thegenetics andmolecular biology ofDrosophila andCaenorhabditis,respectively. The RatGenome Database(RGD) (Twigger etal., 2002) curates andintegrates rat geneticand genomic data andprovides access tothis data to supportresearch using the ratas a genetic model forthe study of humandisease.
5. Protein identification databases
5.1. GO/GOA
GO (GeneOntology
Consortium, 2004)provides threestructured controlledvocabularies,describing themolecular function,biological roles andcellular locations ofgene products. Thedynamic controlledvocabulary can beapplied to allorganisms, evenwhile knowledge ofgene and proteinroles in cells is stillaccumulating andchanging. Manyresources haveadopted GOfacilitating theintegration ofannotation andencouraging thedevelopment of manysimilar projects inother domains. Anumber of theseprojects can beaccessed through theOpen BiologicalOntologies website.
634 E.J. Whitfield et al. / Journal of Biotechnology 124 (2006) 629–639
The GOA project (GOAnnotation) (Camon et al., 2004)is a combination of electronicmappings and manual curationassigning GO terms to all completeand incomplete proteomes thatexist in UniProtKB. Widespreadannotation of GO terms to proteinproducts by many resources helpsto promote the integration ofannotation across databases,supplementing the use of standardnames with the use of standardannotation vocabularies.
Monthly GOA releases providethe GO assignments to UniProtKB,and individual files of GO assignments to 272 nonredundantproteome sets for completegenomes are available. TheSeptember release providesautomatic and manual annotationof GO to 93,192 species, 205,928PubMed references are crossreferenced.
5.2. IntAct
IntAct (Hermjakob et al., 2004) isan open source protein interactiondatabase, repository and analysissystem. The repository is populatedwith data from project partners andcurated literature data. It providesboth textual and graphicalrepresentations of proteininteractions, maintains annotationstandards by intensive use ofcontrolled vocabularies to ensuredata consistency and allowsexploration of interaction networksusing GO annotations of theinteracting proteins. The Septemberrelease has nearly 36,000 proteinsand nearly 53,000 interactionsimported from the literature and
manually curated.These are searchableand viewable usingan interactivegraphical webapplication of proteinnetworks.
IntAct is amember of the IMEXconsortium, withBIND, DIP, MINTand MIPS. Usersubmitted data will beexchanged betweenpartners to provide anetwork of stable,comprehensiveresources ofinteraction data.
5.3. SWISS2DPAGE
Twodimensionalpolyacrylamide gelelectrophoresis (2DPAGE) and SodiumDodecyl SulfatePAGE (SDSPAGE)experimentsdistribute proteins ina gelbased system onthe basis of molecularweight and charge,providing proteinexpression data.SWISS2DPAGE(Hoogland et al.,2004) stores theresults of suchexperiments and addsa variety of crossreferences to other 2D PAGE databasesand toUniProtKB/Swiss
Prot. A SWISS2DPAGE entry alsocontains images ofthe gels and textualinformation such asphysiology,mapping procedures,experimental dataand references.Release 17.3, March2004 and updates upto 08April2005,contains 1265entries in 36reference maps fromhuman, mouse,Arabidopsisthaliana,Dictyosteliumdiscoideum,Escherichia coli,Saccharomycescerevisiae, andStaphylococcusaureus (N315). Thehuman and mouse2DPAGE databasesat the Danish Centre for HumanGenome Researchare intended to aidfunctional genomeanalysis in healthand disease. Theinformation fromeach gel is stored asits own database,accessible throughan interactive imageof the gel itself.
5.4. PRIDE
The PRoteomicsIDEntifications(PRIDE) (Martens etal., in press) database
is a centralized, standards compliant, public data repository forproteomics data. It has beendeveloped to provide the proteomicscommunity with a public repositoryfor protein and peptide identifications together with the evidencesupporting these identifications.PRIDE has been developed througha collaboration of the EBI and GhentUniversity in Belgium. The originalmotivation behind its developmentwas to provide a common dataexchange format and repository tosupport proteomics literature publications. This remit has grown withPRIDE, with the hope that it willprovide a reference set of tissuebased identifications for use by thecommunity. The future developmentof PRIDE has become closely linked
to HUPO PSI.Release 2.0 in July2005 includes a newand richer XMLschema.
5.5. ChEBI
Chemical Entitiesof Biological Interest(ChEBI), available atEuropeanBioinformaticsInstitute (EBI), is afreely availabledictionary of ‘smallmolecular entities’.ChEBI usesnomenclature,symbolism andterminology endorsed
by the InternationalUnion of Pure andApplied Chemistry(IUPAC) andNomenclatureCommittee of theInternational Unionof Biochemistry andMolecular Biology(NCIUBMB). Theterm ‘molecularentity’ encompassesany constitutionallyor isotopicallydistinct atom,molecule, ion, ionpair, radical, radicalion, complex,conformer,
E.J. Whitfield et al. / Journal of Biotechnology 124 (2006) 629–639 635
etc., identifiable as a separatelydistinguishable entity. Themolecular entities in question areeither products of nature orsynthetic products used tointervene in the processes of livingorganisms. ChEBI is organised inan ontological classification,whereby the relationships betweenmolecular entities or classes ofentities and their parents and/orchildren are specified. Release 13in July 2005 contains 5549 curatedcompounds.
6. Structure databases
6.1. Protein data bank
The worldwide Protein DataBank (wwPDB) (Berman et al.,2000) began in 1972 and is thesingle worldwide repository for theprocessing and distribution of over32,500 threedimensional structuresfor proteins, nucleic acids andcarbohydrates as of September2005. It is a collaboration of theResearch Collaboratory forStructural Bioinformatics (RCSB),the Macromolecular StructuralDatabase (MSDEBI), and theProtein Data Bank of Japan (PDBj).The protein structures in thedatabase are from Xraycrystallography and solution nuclearmagnetic resonance (NMR)experiments.
The archive’s growth has beenaccompanied by increases in bothdata content and the structuralcomplexity of individual entries.A further acceleration is expecteddue to developments in high
throughputstructuraldeterminationmethodologies andworldwide structuralgenomics effortswith an estimatedtripling orquadrupling in sizeover the next 5years. This has ledto PDB completelyoverhauling theirsubmission andbrowsing facilitiesin order to be able torespondappropriately.
6.2. Cambridge structural database
The CambridgeStructural Database(CSD) (Allen, 2002),principal product ofthe CambridgeCrystallographicData Centre, is arepository of smallmolecule crystalstructures. Release ofJan 2005 containedover 335,200 recordsof organic moleculesand metalorganiccompounds, with nopolypeptide orpolysaccharidelarger than 24 units.Most threedimensionalstructures wereidentified using eitherXray or neu
tron diffraction. CSDrecords results ofsingle crystal studiesand powderdiffraction studieswhich yield 3Datomic coordinatedata for at least allnonH atoms. Crystalstructure data iscaptured frompublications in theopen literature andprivateCommunications tothe CSD (via directdata deposition).
6.3. RESID
The RESIDDatabase of ProteinModifications(Garavelli, 2004) is acomprehensivecollection ofannotations andstructures for proteinmodificationsincluding aminoterminal, carboxylterminal and peptidechain crosslink posttranslationalmodifications.Release 42.00 in June2005 contains 384entries for predictedor observed co orposttranslationalmodifications of the23 encoded alphaamino acids. 317 ofthese modificationsare annotated inUniProtKB. In
addition to structural information,each record includes systematic andalternate names, atomic formulaeand masses, enzyme activitiesgenerating the modifications, 3Dmodels and structures, crossreferences (including GO andChEBI) and UniProt feature tableannotations.
7. Special features databases
7.1. IntEnz
IntEnz ( Fleischmann et al.,2004) is the name for the Integratedrelational Enzyme database and isthe most uptodate version of theEnzyme Nomenclature created
under the auspices ofthe NomenclatureCommittee of theIUBMB. The goal ofIntEnz is toincorporate data fromthe NCIUBMBEnzymeClassification list, theEnzymeNomenclaturedatabase (ENZYME)(Bairoch, 2000), andthe BraunschweigEnzyme Database(BRENDA) of
enzyme function (
Schomburg et al.,2004). Release 13 inAugust 2005 IntEnzcontains records for
every enzyme with anEC number. Eachrecord storesrecommended andalternative names,catalytic activity,cofactors, diseaseinformation, andcrossreferences withUniProtKB.ENZYME is arepository ofinformation relativeto the nomenclatureof enzymes, Release38, September 2005,and updates up to 26September2005(4563 entries).BRENDA providessimilar records, witha
636 E.J. Whitfield et al. / Journal of Biotechnology 124 (2006) 629–639
breakdown by species forreactions, activities, cofactors,inhibitors, and substrates.
7.2. TRANSFAC
TRANSFAC (Matys et al.,2003) is a database on eukaryoticcisacting regulatory DNAelements and transacting factorscovering the whole range fromyeast to human. Data is extractedfrom the original literature but inthe long term, a direct submissionsystem is hoped to be available.Regulatory sites in the individualgenes are mapped so they can bepositioned on the genome as awhole. A tool has been developedfor the identification of regulatoryelements in newly sequencedgenomes. Release 6.0 has 6627transcription factorbinding sitesand 1755 genes (1725 have sitesannotated).
7.3. EPD
The Eukaryotic PromoterDatabase EPD (Schmid et al., 2004)was designed and developed at theWeizmann Institute of Science inRehovot (Israel) and is currentlymaintained at ISREC inEpalinges/Lausanne (Switzerland).EPD is a specialized annotationdatabase based on EMBL DataLibrary providing information abouteukaryotic promoters extracted fromscientific literature or, starting fromrelease 73, compiled by a new insilico primer extension method.Release 83 was made available inJuly 2005.
7.4. IMGT
The internationalImMunoGeneTics(IMGT) Project(Lefranc et al., 2005)maintains a highquality integratedknowledge resourcespecialized inimmunoglobulins, Tcell receptors, majorhistocompatibilitycomplex (MHC),immunoglobulinsuperfamily andrelated proteins of theimmune system ofhuman and othervertebrate species.The collaborativenucleotide databaseentries fitting thesecategories areretrieved andannotated to a highstandard. IMGTconsists of sequencedatabases(IMGT/LIGMDB isa comprehensivedatabase ofimmunoglobulins andT cell receptors fromhuman and othervertebrates, withtranslation for fullyannotated sequences,IMGT/MHCDB,IMGT/PRIMERDB),genome database(IMGT/GENEDB)and structure
database(IMGT/3DstructureDB), Web resources(IMGT MariePaulepage) and interactivetools.
8. Integrated and comparative databases
8.1. InterPro
The identificationof possible DNAcoding regions can bededuced by similarityto previously characterised genes.Inferring biologicalfunction to a codingregion can be acomplicated process,which cannot alwaysbe achieved bysequence similaritysearches. Proteinsequencecomparisons oftenprovide the first cluesto the structure andfunction of novelproteins, as functionalconstraints are knownto persist inevolution. Proteindomain signaturedatabases areavailable foridentifying distantrelationships in novelsequences to a knownprotein family.InterPro (Mulder etal., 2005) is anintegrated resource of
protein families, domains andfunctional sites which amalgamatesthe efforts of the member databaseswhich are currently PROSITE (Huloet al., 2004), PRINTS (Attwood etal., 2004), Pfam (Bateman et al.,2004), ProDom (Bru et al., 2005)SMART (Letunic et al., 2004),TIGRFAMS (Haft and Selengut,2003), PIR SuperFamilies (Huang etal., 2003), SUPERFAMILY (Goughand Chothia, 2002), PANTHER (Miet al., 2005) and Gene3D (Pearl etal., 2005). InterProScan (Quevillonet al., 2005) combines the differentprotein recognition methods andscanning tools of each method intoone powerful searching resourceunifying the strength of theindividual signature database
methods to ensure thebest prediction ofprotein domains for aquery translation. Inthe absence ofbiochemicalcharacterisation of aprotein, domainpredictions can be agood guide to proteinfunction.
Release 11.0 ofJuly 2005 contains12,294 entries,representing 3240domains, 8753families, 230repeats, 29 activesites, 21 bindingsites and 21 post
translationalmodification sites.
8.2. International protein index
Despite thecompletedetermination of thegenome sequence ofseveral highereukaryotes, theirproteomes remainrelatively poorlydefined. Informationabout proteinsidentified by differentexperimental andcomputationalmethods is stored indifferent databases,
E.J. Whitfield et al. / Journal of Biotechnology 124 (2006) 629–639 637
meaning that no single resourceoffers full coverage of known andpredicted proteins. The InternationalProtein Index, IPI (Kersey et al.,2004) has been developed toaddress these issues and offerscomplete nonredundant data setsrepresenting the human, mouse, rat,zebrafish, Arabidopsis and chickenproteomes, built from theUniProtKB, Ensembl and RefSeqdatabases. Each IPI entry representsa cluster of entries from the sourcedatabases believed to represent thesame protein. One difficulty increating IPI is that there is noabsolute way of telling whether twoentries in molecular biologydatabases represent differentbiological entities or the same entityrendered differently owing to insilico or experimental artefacts. Toassemble IPI data sets, an automaticand pragmatic approach is chosen tobuild clusters through combiningknowledge already present in theprimary data sources (and in thecrossreferences between them) withthe results of protein sequence similarity comparisons. After a cluster isassembled, a master entry fromamong the cluster members ischosen, which supplies the IPI entrywith its sequence and annotation.Finally, an identifier is chosen foreach cluster.
8.3. Integr8
Integr8 (Kersey et al., 2005) is abrowser for information relating tocompleted genomes and proteomes,based on data contained in GenomeReviews, UniProtKB proteome setsand IPI. It provides access to species
descriptions, recentliterature, detailedstatistical overview ofthe genome andproteome, andsummary informationabout each completeproteome. Data froma variety of sources,including InterPro,CluSTr and GO, isintegrated. Integr8can be used toidentify putativeparalogs andorthologs and be usedto identify potentialregions of syntenybetween organisms.The relationships ofgenes in the contextof their genomicneighbours can beviewed, as well as thetranscripts andproteins they encode.The Inquisitor toolsallows a user todetermine if theirprotein sequence isavailable in Integr8,and if it is not willprovide its proteindomain architectureand identify theknown sequence ofhigh similarity. Theinformation isavailable forcomplete downloador a user configureddownload using theBioMart queryinterface. Release 24(August 2005) is builtfrom UniProt release
5.8 and InterProrelease 11.0 andcontains 217 bacterialspecies, 25eukaryotes and 21archea.
9. Discussion
Complete and uptodatedatabases of biological knowledgeare vital for informationdependentbiological and biotechnologicalresearch. Much of the value ofthese resources is as part of aninterconnected network of relateddatabases, and many maintaincrossreferences to other databases.These crossreferences provide thebasic platform for more advanceddata integration strategies.
The rapid accumulation ofgenome sequences for manyorganisms has turned attention to theidentification and function ofproteins encoded by these genomes.The increasing volume and varietyof protein sequences and functionalinformation available meansquerying a manually annotateddatabase, such as UniProtKB,provides a user with more criteria toperform a search – give me allproteins in mouse that arephosphorylated on serine, give meprotein domain architecture ofalternative splice isoforms, howmany proteins have been identifiedin the Drosophila melanogastergenome. This value added proteininformation is not available within a
sequence repositorydatabase. Crossreferences availablewithin a UniProtKBentry allow a user tolink out to manydatabases, includingthose for moreprotein specific data,a nucleotide databaseor an organismdatabase.
Comparativedatabases allow usersto identify gene orprotein orthologs so aquery for one proteincould, potentially,have a species wideresult. The use ofstandard identifiers,naming conventionsand controlledvocabularies,adoption of standardsfor data representation and exchange,and the use of datawarehousingtechnologies enablessuch outreachingresults.
References
Allen, F.H., 2002. TheCambridge structuraldatabase: a quarter ofa million crystalstructures and rising.Acta Crystallogr. B58, 380–388.
Ashurst, J.L., Chen, C.K.,Gilbert, J.G.R.,Jekosch, K., Keenan,S., Meidl, P., Searle,S.M., Stalker, J.,Storey, R., Trevanion,S., Wilming, L.,Hubbard, T., 2005. Thevertebrate genomeannotation (Vega)database. Nucl. AcidsRes. 33, D459–D465.
Attwood, T.K., Bradley,P., Gaulton, A.,Maudling, N.,Mitchell, A.L.,Moulton, G., 2004.The PRINTS proteinfingerprint database:functional andevolutionaryapplications. In:Dunn, M., Jorde, L.,Little, P.,Subramaniam, A.(Eds.), Encyclopaediaof Genomics.Proteomics andBioinformatics.
638 E.J. Whitfield et al. / Journal of Biotechnology 124 (2006) 629–639
Bairoch, A., 2000. The ENZYME database in2000. Nucl. Acids Res. 28, 304–305.
Bairoch, A., Apweiler, R., Wu, C.H., Barker,W.C., Boeckmann, B., Ferro, S., Gasteiger,E., Huang, H., Lopez, R., Magrane, M.,Martin, M.J., Natale, D.A., O’Donovan, C.,Redaschi, N., Yeh, L.S.L., 2005. Theuniversal protein resource (UniProt). Nucl.Acids Res. 33, D154–D159.
Bateman, A., Coin, L., Durbin, R., Finn, R.D.,Hollich, V., GriffithsJones, S., Khanna, A.,Marshall, M., Moxon, S., Sonnhammer,E.L.L., et al., 2004. The Pfam proteinfamilies database. Nucl. Acids Res. 32,D138–D141.
Berman, H.M., Westbrook, J., Feng, Z.,Gilliland, G., Bhat, T.N., Weissig, H.,Shindyalov, I.N., Bourne, P.E., 2000. Theprotein data bank. Nucl. Acids Res. 28, 235–242.
Benson, D.A., KarschMizrachi, I., Lipman, D.J.,Ostell, J., Wheeler, D.L., 2005. GenBank.Nucl. Acids Res. 33, D34–D38.
Boeckmann, B., Bairoch, A., Apweiler, R., Blatter,M.C., Estreicher, A., Gasteiger, E., Martin,M.J., Michoud, K., O’Donovan, C., Phan, I., etal., 2003. The SwissProt proteinknowledgebase and its supplement TrEMBL in2003. Nucl. Acids Res. 31, 365–370.
Brooksbank, C., Cameron, G., Thornton, J.,2005. The European BioinformaticsInstitute’s data resources: towards systemsbiology. Nucl. Acids Res. 33, D46–D53.
Bru, C., Courcelle, E., Carrere,` S., Beausse, Y.,Dalmar, S., Kahn, D., 2005. The ProDomdatabase of protein domain families: moreemphasis on 3D. Nucl. Acids Res. 33,D212–D215.
Camon, E., Magrane, M., Barrell, D., Lee, V.,Dimmer, E., Maslen, J., Binns, D., Harte, N.,Lopez, R., Apweiler, R., 2004. The GeneOntology Annotation (GOA) Database: sharingknowledge in Uniprot with Gene Ontology.Nucl. Acids Res. 32, D262–D266.
Chen, N., Harris, T.W., Antoshechkin, I., Bastiani,C., Bieri, T., Blasiar, D., Bradnam, K.,Canaran, P., Chan, J., Chen, C.K., et al., 2005.WormBase: a comprehensive data resource forCaenorhabditis biology and genomics. Nucl.Acids Res. 33, D383–D389.
Dowell, R.D., Jokerst, R.M., Day, A., Eddy,S.R., Stein, L., 2001. The distributedannotation system. BMC Bioinformatics 2,7.
Drysdale, R.A., Crosby, M.A., The FlyBase
Consortium, 2005. FlyBase: genes and genemodels. Nucl. Acids Res.33, D390–D395.
Eppig, J.T., Bult, C.J.,Kadin, J.A.,Richardson, J.E., Blake,J.A., The MouseGenome DatabaseGroup, 2005. TheMouse GenomeDatabase (MGD): fromgenes to mice—acommunity resource formouse biology. Nucl.Acids Res. 33, D471–D475.
Fleischmann, A., Darsow,M., Degtyarenko, K.,Fleischmann, W.,Boyce, S., Axelsen,K.B., Bairoch, A.,Schomburg, D., Tipton,K.F., Apweiler, R.,2004. IntEnz, theintegrated relationalenzyme database. Nucl.Acids Res. 32, D434–D437.
Garavelli, J.S., 2004. TheRESID database ofprotein modifications asa resource and annotationtool. Proteomics 4,1527–1533.
Gene OntologyConsortium, 2004. TheGene Ontology (GO)database andinformatics resource.Nucl. Acids Res. 32,D258–D261.
Gough, J., Chothia, C.,2002.SUPERFAMILY:HMMs representingall proteins of knownstructure. SCOPsequence searches,alignments and genomeassignments. Nucl.Acids Res. 30, 268–272.
Haft, D.H., Selengut, J.D.,White, O., 2003. TheTIGRFAMs databaseof protein families.
Nucl. Acids Res. 31,371–373.
Hermjakob, H., MontecchiPalazzi, L., Lewington,C., Mudali, S., Kerrien, S., Orchard, S.,Vingron, M., Roechert, B., Roepstorff, P.,Valencia, A., et al., 2004. IntAct: an opensource molecular interaction database. Nucl.Acids Res. 32, D452–D455.
Hold,¨ K.M., Sirisoma, N.S., Ikeda, T., Narahashi,T., Casida, J.E., 2000. Alphathujone (theactive component of absinthe): gammaaminobutyric acid type A receptor modulationand metabolic detoxification. Proc. Natl. Acad.Sci. U.S.A. 97, 3826–3831.
Hoogland, C., Mostaguir, K., Sanchez, J.C.,Hochstrasser, D.F., Appel, R.D., 2004.SWISS2DPAGE, ten years later.Proteomics, 4.
Huang, H., Barker, W.C., Chen, Y., Wu, C.H.,2003. iProClass: an integrated database ofprotein family, function and structureinformation. Nucl. Acids Res. 31, 390–392.
Hubbard, T., Andrews, D., Caccamo, M.,Cameron, G., Chen, Y., Clamp, M., Clarke,L., Coates, G., Cox, T., Cunningham, F., etal., 2005. Ensembl 2005. Nucl. Acids Res.33, D447–D453.
Hulo, N., Sigrist, C.J.A., Le Saux, V.,LangendijkGenevaux, P.S., Bordoli, L.,Gattiker, A., De Castro, E., Bucher, P.,Bairoch, A., 2004. Recent improvements tothe PROSITE database. Nucl. Acids Res.32, D134–D137.
Kanz, C., Aldebert, P., Althorpe, N., Baker, W.,Baldwin, A., Bates, K., Browne, P., van denBroek, A., Castro, M., Cochrane, G., et al.,2005. The EMBL nucleotide sequencedatabase. Nucl. Acids Res. 33, D29–D33.
Karolchik, D., Baertsch, R., Diekhans, M.,Furey, T.S., Hinrichs, A., Lu, Y.T., Roskin,K.M., Schwartz, M., Sugnet, C.W., Thomas,D.J., Weber, R.J., Haussler, D., Kent, W.J.,2003. The UCSC genome browser database.
Nucl. Acids Res. 31,51–54.
Kash, T.L., Trudell, J.R.,Harrison, N.L., 2004.Structural elementsinvolved in activationof the gammaaminobutyric acid typeA (GABAA) receptor.Biochem. Soc. Trans.32, 540–548.
Kersey, P.J., Duarte, J.,Williams, A.,Karavidopoulou, Y.,Birney, E., Apweiler,R., 2004. TheInternational ProteinIndex: an integrateddatabase for proteomicsexperiments.Proteomics 4, 1985–1988.
Kersey, P.J., Bower, L.,Morris, L., Horne, A.,Petryszak, R., Kanz, C.,Kanapin, A., Das, U.,Michoud, K., Phan, I., etal., 2005. Integr8 andGenome Reviews:integrated views ofcomplete genomes andproteomes. Nucl. AcidsRes. 33, D297–D302.
Lefranc, M.P., Giudicelli,V., Kaas, Q., Duprat, E.,JabadoMichaloud, J.,Scaviner, D., Ginestoux,C., Clement,´ O.,Chaume, D., Lefranc, G.,2005. IMGT, theinternational
ImMunoGeneTicsinformation system.Nucl. Acids Res. 33,D593–D597.
Leinonen, R., Diez, F.G.,Binns, D.,Fleischmann, W.,Lopez, R., Apweiler,R., 2004. UniProtarchive. Bioinformatics20, 3236–3237.
Letunic, I., Copley, R.R.,Schmidt, S., Ciccarelli,F.D., Doerks, T., Schultz,J., Ponting, C.P., Bork,P., 2004. SMART 4.0:towards genomic dataintegration. Nucl. AcidsRes. 32, D142–D144.
Martens, L., Hermjakob, H.,Jones, P., Taylor, C.,Gevaert, J., Vandekerckhove, J.,Apweiler, R., in press.PRIDE: The PRoteomicsIDEntifications databaseProteomics, PPP SpecialIssue.
Matys, V., Fricke, E.,Geffers, R., Gossling, E.,Haubrock, M., Hehl, R.,Hornischer, K., Karas,D., Kel, A.E., KelMargoulis, O.V., Kloos,D.U., Land, S., LewickiPotapov, B., Michael, H.,Munch, R., Reuter, I.,Rotert, S., Saxel, H.,Scheer, M., Thiele, S.,Win
E.J. Whitfield et al. / Journal of Biotechnology 124 (2006) 629–639 639
gender, E., 2003. TRANSFAC: transcriptional regulation, from patterns to profiles. Nucl. Acids Res. 31, 374–378.
Mi, H., LazarevaUlitsky, B., Loo, R.,Kejariwal, A., Vandergriff, J., Rabkin, S.,Guo, N., Mruganujan, A., Doremieux, O.,Campbell, M.J., Kitano, H., Thomas, P.D.,2005. The PANTHER database of proteinfamilies, subfamilies, functions andpathways. Nucl. Acids Res. 33, D284–D288.
Mulder, N.J., Apweiler, R., Attwood, T.K.,Bairoch, A., Bateman, A., Binns, D.,Bradley, P., Bork, P., Bucher, P., Cerutti, L.,et al., 2005. InterPro, progress and status in2005. Nucl. Acids Res. 33, D201–D205.
Pearl, F., Todd, A., Sillitoe, I., Dibley, M., Redfern,O., Lewis, T., Bennett, C., Marsden, R., Grant,A., Lee, D., et al., 2005. The CATH DomainStructure Database and related resourcesGene3D and DHS provide comprehensivedomain family information for genomeanalysis. Nucl. Acids Res. 33, D247–D251.
Pruitt, K.D., Tatusova, T., Maglott, D.R., 2005.NCBI Reference Sequence (RefSeq): acurated nonredundant sequence database ofgenomes, transcripts and proteins. Nucl.Acids Res. 33, D501–D504.
Quevillon, E., Silventoinen, V., Pillai, S., Harte,N., Mulder, N., Apweiler, R., Lopez, R.,2005. InterProScan: protein domainsidentifier. Nucl. Acids Res. 33, W116–W120.
Schmid, C.D., Praz, V.,Delorenzi, M., Perier,´R., Bucher, P., 2004.The EukaryoticPromoter DatabaseEPD: the impact of insilico primer extension.Nucl. Acids Res. 32,D82–D85.
Schomburg, I., Chang, A.,Ebeling, C., Gremse,M., Heldt, C., Huhn,G., Schomburg, D.,2004. BRENDA, theenzyme database:updates and major newdevelopments. Nucl.Acids Res. 32, D431–D433.
Tateno, Y., Saitou, N.,Okubo, K., Sugawara,H., Gojobori, T., 2005.DDBJ in collaborationwith masssequencingteams on annotation.Nucl. Acids Res. 33,D25–D28.
Twigger, S., Lu, J.,Shimoyama, M., Chen,D., Pasko, D., Long,H., Ginster, J., Chen,C.F., Nigam, R.,Kwitek, A., et al., 2002.Nucl. Acids Res. 30,125–128.
Wain, H.M., Lush, M.,Ducluzeau, F., Povey,S., 2002. Nucl. AcidsRes. 30, 169–171.
Wheeler, D.L., Barrett, T.,Benson, D.A., Bryant,S.H., Canese, K.,Church, D.M., DiCuccio,M., Edgar, R., Federhen,S., Helmberg, W., et al.,2005. Database resourcesof the National Centrefor BiotechnologyInformation. Nucl. AcidsRes. 33, D39–D45.
Wu, C.H., Yeh, L.S.,Huang, H., Arminski,L., CastroAlvear, J.,Chen, Y., Hu, Z.,Kourtesis, P., Ledley,R.S., Suzek, B.E., etal., 2003. The proteininformation resource.Nucl. Acids Res. 31,345–347.