Linking Databases in Biotechnology

Linking Databases in Biotechnology

SIMON B. JONES

INTRODUCTION

This paper is probably of a more general nature thanothers presented at the workshop. It does not have anexplicit focus on neuroinformatics, rather it presentsdevelopments in providing information to researchersin the full range of biotechnology disciplines, whichnecessarily includes neurological research.Research workers in biotechnology require access to

a wide range of quality information: nucleotide se-quences, protein structures, structure–effect relation-ships, genetic and microbial resources, current andcompleted projects, commercial developments, and thegeneral state of the art in their particular field. Atpresent they must turn to a number of widely differingsources for the different types of data. These sourcesmay be in printed or electronic form, and they may bemaintained by commercial or noncommercial organiza-tions.The work we are undertaking is aimed at providing a

harmonized access to the information sources, and inaddition at generating an environment in which theycan be sustained for the future. This applies especiallyto the small noncommercial databases which containvital information, but for which ongoing funding can behard to obtain.

THE PROJECTS

This paper relates to three European Union (EU)-funded projects as follows:

The Biotechnology Information Strategic Forum (BTSF)

The BTSF was established in December 1992 inrecognition of the need to examine emerging problemsin supplying biotechnology information from the infor-mation producer’s viewpoint, and to ensure that theproducts offered meet the users’ needs. The problemsaddressed include access and distribution, copyrightand legal issues, and intellectual and product owner-ship, together with the way in which the market shouldbest use the many new technical innovations that arebeing introduced. With support from the EU, the BTSFbrought together information providers and users in

biotechnology in Europe in a unique forum to examinethese issues. The BTSF has identified the need for:

c databases to be more accessible through beinglinked together in scientific packages;

c secure, efficient networks;c ‘‘even playing fields’’ in terms of the present unbal-

anced situation where subsidized American databasesare able to reach European users on European net-works to the detriment of European producers; and

c new financial infrastructures to guarantee theproduction of specialized databases which themselvesmight not generate sufficient income for their survival.

The Common Core Database (CCDB) Pilot Project

Three of Europe’s leading literature database produc-ers, CAB International, Elsevier Science, and the Insti-tut de l’Information Scientifique et Technique (INIST)are examining ways in which they can improve thequality and usefulness of the material they offer. Onereason for doing this is to improve their competitiveposition in relation to subsidized database production.Earlier research had clearly found that such improve-ments might be better achieved if the producers were towork together on common problems and to look at waysin which theymight share certain production activities.The producers have identified the common elements intheir data and have developed a mechanism for linkingand exchanging their data, taking into account theproducers’ different editorial policies.There were two choices for the linking and exchange

mechanism. The first would be to import the data intoone pooled database with a defined record. The disad-vantage of this approach would be that it would lock theproducers into a single location and structure and itmight require extensive reworking to add differenttypes of data in the future. The second approach is toput the databases separately into a ‘‘tank’’ in which therecords can be extracted as required using softwarewhich ‘‘knows’’ about the structure of each. The advan-tage of this is that it need not anticipate all futurerequirements as the parsing software can be applied toother data types. The consortium has implemented thesecond approach.There are differences between the CCDB approach

NEUROIMAGE 4, S59–S60 (1996)ARTICLE NO. 0054

S59 1053-8119/96 $18.00Copyright r 1996 by Academic Press, Inc.

All rights of reproduction in any form reserved.

and the cross-file searching capabilities which arealready available on some database hosts. In the CCDBapproach, we are anticipating that there will in futurebe bidirectional links between files of a variety ofdifferent types, with few fields in common between thefiles. We are also deliberately including the rich charac-ter sets which are used by the database producers fortheir internal needs, but which are not currently re-flected in the corresponding on-line products. The rea-son for this is that they are essential for the anticipatedexchanges of data, and for display of the informationthrough graphical interfaces which are capable of show-ing the full range of characters.

ADLIB–Advanced Database Linkages in Biotechnology

This project is to take forward the ideas developed inthe Common Core Database Pilot Project and produce aworking demonstration of the overall concept of linkeddatabases which will be tested by an extended commu-nity of users across Europe. ADLIB is the first demon-stration project in the Biotechnology Programme of theEU’s Fourth Framework. If successful, it is likely thatADLIB will lead to a commercial product.The partners are participating on a shared-cost basis

and encompass literature database producers (CABInternational, Elsevier Science, and INIST), large fac-tual databases (EMBL), small factual databases(INSERM, CERDIC), project and commercial data-bases (KNAW, Bio-Commerce Data), and primary pub-lishers (Wiley, Springer-Verlag, and Kluwer plusElsevier Science). Scientific and strategic support isprovided by ASFRA. The consortium has establishedlinks to important groups of users, specifically theEuropean Molecular Biology Network, the PharmaDocumentationRing, and theCEFICScience&Technol-ogy Working Party, which represents the Europeanchemical companies.

THE LINKED DATABASE MODEL

The model which is likely to emerge from the projectsis as follows:

c A ‘‘tank’’ of databases available both as a physicallycolocated collection on one or more established hostsand as a looser federation of databases accessible overnetworks.

c Sophisticated parsing systems which can enable

links between data of similar and dissimilar types toenable researchers to follow through a research ‘‘story.’’

c A charging model which enables both academicand commercial users access to the system, and whichallows both subscription and pay-as-you-go paymenttypes.

c Security guaranteed to commercial users throughthe intermediary of the trusted hosts.

c Sustainability of the production of the small data-bases through their use in this system to complete theresearch picture and through eased access.

c Efficiency in database production through sharingof common aspects and knowledge.

ISSUES

To reach the desired outcome, the ADLIB consortiumwill need to address these key issues:

c The technical feasibility of the concept, togetherwith the mechanism for turning it into a viable systemfor end users.

c The ability of networks, particularly in Europe, tohandle the proposed ‘‘federation’’ of the databases: willit be necessary to work first with colocated systems andthen move to networked linkages as the bandwidth ofthe networks increases?

c The implementation of a practical mechanism forsupporting small databases.

c The technical feasibility of a secure pay-as-you-goservice in a networked environment which is accessibleto all users and could provide a single access and billingpoint to the entire system.

c The fact that there is currently no firm legal basisto prevent the copying and redistribution of electronicproducts made available over networks. This is a majordisincentive to commercial producers putting theirdatabases up on these environments and so steps mustbe taken to ensure that products are not freely copiedand redistributed.

If ADLIB reaches a successful conclusion, then it isvery likely that opportunities will be sought to extendthe models created into other fields where high qualityinformation is a key component of the research andwhere that information needs to be traced through anumber of different sources in order to give a completepicture.

S60 SUPPLEMENT

Documents

Linking Databases in Biotechnology