Upload
others
View
10
Download
0
Embed Size (px)
Citation preview
Conference Paper
Many Genbank entries for completemicrobial genomes violate the Genbankstandard
Peter D. Karp*Bioinformatics Research Group, SRI International, EK223, 333 Ravenswood Avenue, Menlo Park, CA 94025, USA
*Correspondence to:P. D. Karp, BioinformaticsResearch Group, SRI International,EK223, 333 RavenswoodAvenue, Menlo Park, CA 94025,USA.E-mail: [email protected]
Abstract
A survey of Genbank entries for complete microbial genomes reveals that the majority do
not conform to the Genbank standard. Typical deviations from the Genbank standard
include records with information in incorrect fields, addition of extraneous and confusing
information within a field, and omission of useful fields. This situation results from two
principal causes: genome centres do not submit Genbank records in the proper form and
the Genbank, EMBL and DDBJ staffs do not enforce the database standards that they
have defined. Copyright # 2001 John Wiley & Sons, Ltd.
Keywords: genome annotation; Genbank; bioinformatics; database standards
Introduction
Genome annotation is a complex process with anumber of phases including gene finding, predictionof gene function, prediction of pathways andsubmission of the genome to the Genbank/EMBL/DDBJ databases (henceforth referred to simply asGenbank). If a submitted genome is not preparedaccording to the Genbank standard, the scientificcommunity will face significant barriers in accessingand manipulating the genome annotation that wasso painstakingly produced. This article presentsevidence that many complete genomes withinGenbank were not prepared according to theGenbank standard.
Genbank now contains 30 complete bacterialgenomes. As the number of complete genomesincreases, it becomes more and more importantthat data within Genbank are encoded in aconsistent and regular form that allows computerprograms to reliably extract information, sincemanual interpretation of those records becomesless and less feasible. For example, a computerprogram that attempts to search across manydifferent Genbank entries to find a given codingregion by gene name, or by gene-product name, orby the unique identifier assigned by a sequencing
project, must know what Genbank feature-tablequalifiers to search for each of these types ofinformation. In isolation, none of the examplespresented are that dramatic but, taken together, thescale and diversity of these malformed data createsa significant barrier to computational analysis ofGenbank.
The Genbank standard is neitherfollowed nor enforced
The genome centres that have submitted Genbankentries for complete genomes are not following theGenbank standard (which is available at http://www.ncbi.nlm.nih.gov/collab/FT/index.html) and theNCBI, EMBL and DDBJ groups that accept newGenbank entries are not enforcing that standard.Figure 1 shows excerpts from three Genbank entriesfor complete microbial genomes or chromosomes,each of which was prepared by a different sequen-cing group. The left side of the figure lists theoriginal entry; the right side of the figure shows acorrected version of the entry.
All of the entries in Figure 1 use different syntaxand semantics, and all violate the Genbank stan-dard in some way. In 1a, the product name is
Comparative and Functional GenomicsComp Funct Genom 2001; 2: 25–27.
Copyright # 2001 John Wiley & Sons, Ltd.
Fig
ure
1.
(1a–
3a)
Exce
rpts
from
thre
eG
enban
ken
trie
sth
atdo
not
confo
rmto
the
Gen
ban
kst
andar
d.(1
b–3b)
Corr
ecte
dve
rsio
ns
ofea
chen
try
that
do
confo
rmto
the
stan
dar
d
26 Conference Paper
Copyright # 2001 John Wiley & Sons, Ltd. Comp Funct Genom 2001; 2: 25–27.
prefixed with a variant of the gene name. Inexample 2a, the product qualifier simply repeatsthe gene name. The real product name, along withmuch other useful information, is buried in a textfield in a form that cannot be automatically parsedby a computer program. In the case of 3a, theunique ID is in the gene qualifier and the gene nameis appended to the product qualifier.
In addition, none of the entries has a labelqualifier containing the unique identifier associatedwith each coding region. Although the specificationdoes not require that the label qualifier be present,this unique identifier is useful for database linking.
A list of 11 non-conformant Genbank entries anda conversion of those entries to a form that doesmeet the standard is provided at http://www.ai.sri.com/pkarp/misc/gbkexample.html
Discussion
Although it is troubling that the sequencing projectsare not following the Genbank standard, it is evenmore troubling that the database staffs are notenforcing their own standards. An important role ofthe Genbank staff is ensuring that only high-qualitydata enter Genbank, which is the principal archiveof nucleotide-sequence information for the scientificcommunity. The Genbank staff should refuse toaccept entries that do not conform to the Genbankstandard. Although the staff might argue that theirresources are inadequate for policing every submis-sion to Genbank, we would argue that at least aminimal level of manual checking should beperformed for entries for complete genomes. Lite-rally 15 minutes of inspection would suffice to
identify many of the problems we have listed.Inspection of every coding sequence in a file isgenerally not necessary, because these files aretypically generated by programs that create thesame non-conformant fields in a systematic fashionfor every coding region.
Furthermore, some automated checks should beperformed on every incoming entry, such as veri-fying that the contents of the EC qualifier is a validEC number, verifying that the contents of the labelqualifier are unique across the entry, and verifyingthat a label qualifier is provided for every codingregion.
Some simple rules to remember when formulatingGenbank entries are:
$ Put each piece of information in the appropriatequalifier.
$ Supply as many qualifiers for each codingsequence as can reasonably be provided.
$ Do not attempt to be creative by addingadditional information into a given qualifier.For example, adding multiple synonyms for thegene name inside a given gene qualifier violatesthe specification and could produce erroneousresults in software that processes that qualifier.
See http://www.ai.sri.com/pkarp/misc/gbkexample.html for more examples of conformant Genbankentries.
Acknowledgements
This work was sponsored by Grant 1-R01-RR07861-01 from
the National Institutes of Health.
Conference Paper 27
Copyright # 2001 John Wiley & Sons, Ltd. Comp Funct Genom 2001; 2: 25–27.
Submit your manuscripts athttp://www.hindawi.com
Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014
Anatomy Research International
PeptidesInternational Journal of
Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014
Hindawi Publishing Corporation http://www.hindawi.com
International Journal of
Volume 2014
Zoology
Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014
Molecular Biology International
GenomicsInternational Journal of
Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014
The Scientific World JournalHindawi Publishing Corporation http://www.hindawi.com Volume 2014
Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014
BioinformaticsAdvances in
Marine BiologyJournal of
Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014
Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014
Signal TransductionJournal of
Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014
BioMed Research International
Evolutionary BiologyInternational Journal of
Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014
Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014
Biochemistry Research International
ArchaeaHindawi Publishing Corporationhttp://www.hindawi.com Volume 2014
Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014
Genetics Research International
Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014
Advances in
Virolog y
Hindawi Publishing Corporationhttp://www.hindawi.com
Nucleic AcidsJournal of
Volume 2014
Stem CellsInternational
Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014
Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014
Enzyme Research
Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014
International Journal of
Microbiology