Upload
lincoln
View
220
Download
2
Embed Size (px)
Citation preview
This inconvenience for the bench biolo-gist is disastrous for the bioinformaticist,who typically needs to aggregate data frommany online sources to create a data set forfurther analysis. When these data reside ondifferent servers using different data formatsand access methods, the first step is to write aset of software ‘scripts’ to fetch them, re-format them and place the extract into a localdatabase. This is not straightforward,because most online biological databaseswere designed to be accessed by humans, notby machines. Bioinformaticists often findthemselves writing scripts to parse theHTML source to extract the data whileignoring graphics links and explanatory text,a process called screen scraping.
Screen scraping is despised for variousreasons. First and foremost, it is brittle.Database managers are always tinkeringwith the user interface, adding a graphichere, moving a button there, to improve theuser experience. Each small change in a pop-ular web page breaks dozens of screen-scraping scripts, causing anguished criesand hair-tearing among the bioinformati-cists who depended on those scripts for theirresearch and the research of the wet labs theysupport. Second, it is unreliable. There is nopublished documentation of what a datasource’s web pages are supposed to contain,so bioinformaticists must guess from a few
examples. Finally, there is massive duplica-tion of effort. Almost every bioinformaticisthas written a parser for the National Centerfor Biotechnology Information (NCBI)BLAST service at least once, sometimesmany times. Because they are one-offs, thesescripts are generally undocumented and notwidely distributed. Most of them only workfor a short time because BLAST changesevery few months.
Bio* projects reduce the painThe bioinformatics community has respon-ded to the challenges posed by this ‘city-state’situation with the Bio* projects, a series offreely available open-source projects (www.open-bio.org), in which nearly a hundredsoftware engineers have developed re-usablecode libraries in the Perl, Java, Python andRuby programming languages (known asBioperl, BioJava, Biopython and Bioruby,respectively). These libraries automate com-mon bioinformatics tasks, such as manipu-lating DNA and protein sequences, and pro-vide methods for importing and exportingdata between data sources and among fileformats. To fetch a piece of data from a data-base, the bioinformaticist uses the Bio*libraries to do the fetch, put the informationin a standard format, and return the refor-matted data to her script. This preventsduplication of effort. No one will ever again
commentary
NATURE | VOL 417 | 9 MAY 2002 | www.nature.com 119
Lincoln Stein
During the Middle Ages and early Renais-sance, Italy was fragmented into dozens ofrival city-states controlled by such legendaryfamilies as the Estes, Viscontis and Medicis.Though picturesque, this political fragmen-tation was ultimately damaging to scienceand commerce because of the lack of stan-dardization in everything from weights andmeasures to the tax code to the currency tothe very dialects people spoke. A fragmentedand technologically weak society was vulner-able to conquest, and from the seventeenthto the nineteenth centuries Italy was domi-nated by invading powers.
The old city-states of Italy are an aptmetaphor for bioinformatics today. Thefield is dominated by rival groups, each pro-moting its web sites, services and data for-mats. Unarguably, this environment of cre-ative chaos has greatly enriched the field.But it has also created a significant hin-drance to researchers wishing to exploit thewealth of genome data to its fullest.
Despite its shaky beginning, the nation ofItaly was eventually forged through a combi-nation of violent and diplomatic efforts. It isnow a strong and stable component of a largereconomic unit, the European Union, withwhich it shares a common currency, a com-mon set of weights and measures, and a com-mon set of rules for national and internation-al commerce. My hope is that bioinformaticswill one day achieve the same degree ofstrength and stability by adopting a universalcode of conduct along the lines I propose here.
Screen scraping: mediaeval torture The promise and peril of the bioinformaticslandscape is clear to any bench biologistattempting to mine the human genome forinformation on, say, a favourite geneticregion. The online sources of these data eachprovide remarkable user interfaces and deeplyinterconnected data sets of great richness. Yeteach interface is different, both in the subsetof data presented and in organization. Theresearcher may find herself devoting as muchtime adjusting to differences in presentationof the data as she does actually thinking aboutthem. The situation is worse when comparinga human gene to its orthologue in anotherspecies. This brings the model organism data-bases into play, each of which has its own type of user interface and format. (See us.expasy.org/alinks.html; www.stat.wisc.edu/biosci/genome.html; and mbcf.dfci.harvard.edu/cmsmbr/biotools/biotools10.html for anidea of the scale of the problem.)
Creating a bioinformatics nationA web-services model will allow biological data to be fully exploited.
Bio*libraries
Bio*libraries
Bio*librariesService registry
Microarray service
Microarray service
GO
serv
ice
GO
serv
ice
SeqFetc
h serv
iceBLAT service
BLAT service
SeqFetch service
Figure 1 Moving towards a bioinformatics nation. Because each data provider (such as Flybase andUCSC) publishes data in an idiosyncratic form, the Bio* software package (Bio* libraries) was createdto massage data into a standard internal format. Unfortunately, Bio* needs to be fixed each time aprovider changes its formats. A web-services world would build on the successes of the Bio* projects bydefining standard interfaces to various types of computations and data formats. The Bio* libraries canbe written to recognize these interfaces, allowing them to interoperate easily with all data providers.A service registry would let data providers enter an electronic ‘address book’, allowing theBio*libraries to locate and interact with new data sources automatically.
© 2002 Macmillan Magazines Ltd
be forced to write a BLAST parser. But the Bio* libraries can’t solve the prob-
lem of the brittleness of online data sources.As soon as one of these web pages changes,the Bio* library breaks and has to be patchedup as quickly as possible. This works onlybecause the Bio* libraries contain a series ofadapter modules for each of the online data-bases, lovingly maintained by a group ofdedicated (and very busy) programmers.The Bio* libraries also cannot immediatelysolve the problems of the data providersthemselves. Whenever two providers needto exchange data, for example to sharesequence annotation data, they must agreeon an ‘exchange of hostages’ treaty in whichthey negotiate the terms and format of theexchange. Needless to say, this type of negoti-ation is awkward and time-consuming.
Strength through unityTo achieve seamless interoperability amongonline databases, data providers mustchange their ways. If they all had identicalways of representing biological data, stan-dard user interfaces and standard methodsfor scripts to access the information, theproblem of gathering and integrating infor-mation from diverse data sources wouldlargely evaporate. But such conformitywould destroy the creative aspect of onlinedatabases — and, indeed, no data providerwould willingly surrender to it (but see Box 1for a proposed code of conduct).
A more acceptable solution relies on theemerging technology of web services. A webservice is a published interface to a type ofdata or computation. It uses commonlyaccepted data formats and access methods,and a directory service that allows scripts tofind them. The web-services system shownin Fig. 1 allows several data sources to pro-vide the same type of service so, for example,the NCBI, the University of California SantaCruz (UCSC) and the European Bioinfor-matics Institute (EBI) could all provide asequence-similarity service. A script writtento use one service would work with them all,even though the three sites could implementthe service in radically different ways: theNCBI might use BLAST, UCSC might useBLAT, and the EBI the SSAHA search engine.
For bench researchers, the web-servicesworld might not look very different fromthe current one. Online databases wouldstill exist, each with its own distinctive char-acter and user interface. But bioinformati-cists would now be able to troll the onlinedatabases to aggregate data simply and reliably. Furthermore, software engineerscould create standard user interfaces thatwould work with any number of online datasources. This opens the door to genome‘portals’ for those researchers who prefer toaccess multiple data sources from a singlefamiliar environment.
A challenge for web services is to deter-
mine what data sources implement whichservices. The solution is a service registry. Toadvertise its services, a data source registers itsservices with a designated directory server onthe Internet. Then, when a script needs accessto a particular service, it consults the registryto find out what data sources provide the ser-vice. If the same service is provided by morethan one data source, the script consults theuser for help, or uses built-in rules to choosethe data source, then returns the results.
Although this proposal may seem a far cryfrom what happens now, the technologyexists to make it a reality. The World WideWeb Consortium, with industry heavy-weights such as IBM and Microsoft, is pro-moting an alphabet soup of standards:SOAP/XML, WSDL, UDDI and XSDL. Theseare already being used by some data providers:examples include those sites that publishgenomic annotation data using the distrib-uted annotation system format (www.bio-das.org) and the EBI’s SOAP-based bib-liographic query service (industry.ebi.ac.uk/openBQS). Also being developed is a highlypromising integration platform called Omni-
gene (omnigene.sourceforge.net); Integr8,an ambitious integration project run by theEBI (www.ebi.ac.uk); and a pharmaceuticalindustry-backed project called I3C (www.i3c.org). Finally, the caBIO project (ncicb.nci.nih.gov/NCICB/core/caBIO) at the USNational Cancer Institute (NCI) is a com-prehensive open-source project that aims tomake the NCI’s cancer databases available viaa web services architecture.
The risk, of course, is that like the mediae-val Italian city-states, each of these projectswill endorse its own idea of standardization,and a chaotic world of incompatible bio-informatics data standards will be replacedby a chaotic world of incompatible web-ser-vice standards. We can look forward to a bitof a struggle before one set of standardsachieves pre-eminence, but I have no doubtthat unity will be reached eventually. ■
Lincoln Stein is at the Cold Spring HarborLaboratory, Cold Spring Harbor, New York 11724,USA. This Commentary is adapted from a keynotespeech given by the author at the 2002 O’ReillyOpen Bioinformatics Conference in Tucson,Arizona, USA.
commentary
120 NATURE | VOL 417 | 9 MAY 2002 | www.nature.com
The downside of the web-services world is that itwon’t be achieved overnight. It will be some yearsbefore web services are stable and completeenough to replace current sites. Meanwhile,I propose the following bioinformatics dataprovider’s code of conduct, to maximize theusefulness and re-usability of a data source.
1. A web page is an interfaceAlthough web pages were designed for access bypeople, they can be accessed by scripts. Guide thistrend, don’t fight it. Online databases shouldprovide the rules for linking to pages, including thehours and frequency available for scripts todownload information.
2. An interface is a contractOnce a web page is in use by screen scrapers, itbecomes a contract between the data provider andthe data consumer. Changing the interface violatesthe contract. Document each interface and warn if itis unstable. If an interface needs to change, provideusers with plenty of advance warning. When awidely used interface is changed, give it a new URLand maintain the legacy interface for a while.
3. Choice is goodMake your information available in several formatsfor bioinformaticists with different needs andabilities. HTML is the least desirable format topublish data for use with bioinformatics scripts.Tab-delimited text is easily handled by scripts andis suitable for many types of simple data. XML isharder to parse but can convey much more complexinformation. A SOAP/XML interface may not beaccessible to novice bioinformaticists, but will begreatly appreciated by the more advanced ones.
4. Allow batch downloadsMake the whole data set available for batchdownload, breaking it up into logical, bite-sizedpieces if necessary. Many developers have foundthemselves writing scripts to download entiredatabases one web page at a time as the only routeto a full data set, which is wildly inefficient.
5. Use existing file formatsAvoid reinventing wheels: use existing file formatswhen possible. For example, there are alreadyenough formats to describe features on a sequence(GenBank, EMBL, GFF, BSML, Agave, GAME, DAS), soit is doubtful that the world needs another. There arealso good formats to describe genome assemblies,microarray experiments and results, three-dimensional structures and bibliographic references.
6. Design sensible formatsIf an existing format doesn’t support the informationthat a data source wants to publish, use commonsense in designing new formats. Use tab-delimitedtext if it is sufficient. If the data are hierarchical, XMLis a natural choice. It is better to start simple than tocreate a complex format to cover all contingencies.
7. Allow ad hoc queriesSupport a true query language. Researchers oftensearch for hidden relationships among biologicaldata, but browsing through a set of hyperlinkedpages may not be the best way. Many onlinedatabases have created web-based forms thatgenerate summary reports based on simple filters.Better still would be to make copies of yourdatabases available for direct access using thedatabase’s native query language. Although this is asignificant investment, it is well worth the effort.
Box 1: Data provider’s code of conduct
© 2002 Macmillan Magazines Ltd