2

Click here to load reader

Not who you know, but what you know

Embed Size (px)

Citation preview

Page 1: Not who you know, but what you know

Data integration and access con-jures up many different visions in

the minds of both researchers andbioinformaticians – but all agree thatit is a good idea. However, often this is not easily translated into an infor-mation technology system or infra-structure that is easy to maintain,widely used and accessible to thosewho need to use it.

At the gateway to the post-genomics era, it is becoming moreand more obvious that the integra-tion and efficient use of data withincommercial and academic arenas willbe essential for future research to becarried out in the most cost-effectiveand productive fashion.

This is especially true for the phar-maceutical and biotechnology indus-tries where strong competition andshrinking therapeutic markets areforcing companies to rethink theirstrategies for drug development,leading away from ‘serendipitous’discovery to what is being called‘knowledge-led research’. This modelrelies on the efficient management ofnew data, of existing knowledge andits efficient dissemination within the organization (Fig. 1). Implemen-tation of this model not only requiresinnovative bioinformatics solutionsbut also considerable culture changesin the way that employees both accessand feed back data or discoveries tothe organization’s knowledge-base.It was this idea that formed the basisfor a recent meeting*.

The current landscapeIt is obvious that commercial and

academic researchers currently haveunprecedented access to huge, rapidlyexpanding data resources for the studyof disease and fundamental biologi-cal systems. Following on from this,Steve Gardner (Synomics, Cambridge,UK) ‘painted’ a landscape in whichnovel research would be carried outin a radically different way in the next millennium. This radical changewould come about through some orall of the following resources:

(1) the whole human genomesequence will have been published inmoderate detail;

(2) approximately half a millionindividual human variations that con-tribute to disease susceptibility will bemapped and their effects quantified;

(3) several other model organisms’genomes, including many of thoseused in the laboratory, will be knownand greater than 150 pathogenicorganisms’ genomes and severalcrop-plant genomes will be mappedand sequenced in detail;

(4) there will be a much greater setof tissue samples with known gene-and protein-expression profiles cor-related to phenotype and clinical history;

(5) canonical libraries of chemicalstructures will be tested using micro-arrays to develop statistical models oftoxicological and expression effectsin specific tissues;

(6) significant progress will be madein the computational simulation ofdiseases and whole-organ systems,allowing the effect of variation at different intervention points to bepredicted; and

(7) new technological developments

in areas such as high-throughputsequencing (HTS), whole-genomeassociation studies, mHTS, combina-torial chemistry and automatedADME (drug absorption, distribution,metabolism and excretion) will pro-vide ever-greater quantities of complexinterrelated information to draw on.

Although most of this informationis directly relevant to many researchareas, it is the exponential growth(and its increased availability on elec-tronic systems) that might becomewhat one speaker referred to as a ‘tidal wave’, threatening to engulfor overwhelm all but the most determined researchers.

Data or knowledge?An emerging theme, particularly

in bioinformatics and data inte-gration, is the realization of the need to differentiate between ‘data’ and‘knowledge’. Data can be thought ofas large collections of primary obser-vations (e.g. the draft sequence of thehuman genome or a set of expressionprofiles). Knowledge, however, con-sists of organized networks of relatedfacts of use to a research project or commercial entity, which caninclude intellectual property rights orpatentable material.

The rapid generation of new dataobjects has long necessitated theirstorage in large ‘data warehouses’, inwhich they are mined for trends

FORUM

TIBTECH JUNE 2000 (Vol. 18) 0167-7799/00/$ – see front matter © 2000 Elsevier Science Ltd. All rights reserved. PII: S0167-7799(00)01451-7 227

Not who you know, but what you know

Figure 1Knowledge-led R&D requires a flexible infrastructure that is capable of the integration of awide range of data and processes. Abbreviation: HTS, high-throughput screening. (Reproducedwith permission from Synomics, Cambridge, UK).

trends in Biotechnology

Project-based interfaceand

data visualization

Genomic and

proteomic data

Protein andDNA sequenceand structure

Pharmacogenomics

HTSinformation

Chemical-structure

information

Pharmacologyand

toxicology

Patent andbibliographyinformation

Integrationand

decisionsupport

*The 2nd Cambridge Healthtech Institute’sIntegrated Bioinformatics conference was held inZürich, Switzerland, 19–21 January 2000.

Meetingreport

TTEC June 2000 4/5/00 11:07 am Page 227

Page 2: Not who you know, but what you know

Ihave just been reminded of my firsthigh-school dance. Sexes equally

divided and fidgeting on oppositesides of the dance f loor, each withexpectant looks on their faces. Eachwanting to couple and tango, but notquite knowing how to make the firstmove. On each side of the roomthere are varying levels of maturity,but no-one wants to miss out onsomething, whatever that somethingmight be. A similar feeling was pres-ent at a recent two-day conferenceon data mining in bioinformatics*.

Finally, one brazen soul, usually amale in my day but thankfully now

likely to be of either sex, makes amove forward. Slowly everybodyjoins in and a good time is had by all. The two sides of the room at theEuropean Bioinformatics Institutewere the biologists and bioinformati-cians on one side and the computerscientists on the other. The ‘brazensoul’ driving us to the dance floorwas the promise of being able toanalyse the vast amount of gene-expression data being collectedworldwide using microarray tech-nology. There will definitely be moredances in the future now that wehave got to know our opposite num-bers a little better. Part of the grow-ing up process is learning about yourpartner, and that was the major focusof the meeting; but let us start withthe brazen soul.

Microarray technology as thebrazen soul

As Paul Spellman (Stanford Uni-versity, CA, USA) so aptly showed,we are at a new frontier in biology,and this is wonderfully captured inFig. 1. Figure 1 shows the completeexpression pattern of yeast, consist-ing of over 6000 genes, covering allaspects of the cell cycle, sporulationand nutritional variation, as well asstress responses to heat shock andoxidative stress (the latter two beingremarkably similar). The dataset contains approximately 2.5 millionindependent, noisy and in some way correlated observations; a perfectpartner for someone interested indata mining.

Databases and data miningParticipants on one side of the

room were given several lessons inthe principles of data mining, whichare themselves still evolving withinthe field of computer science. Theopening address by Heikki Mannila

and associations. However, as thesewarehouses grow rapidly in size, datastorage becomes increasingly risky inthat the data will be of variable orunverified quality, leading to falseleads and associations.

This has led to the ‘curated’ data-base, where data is ‘quality assured’by curators or editors before incor-poration (e.g. the Bioknowledgelibrary presented by William Payne,Proteome, Beverly, MA, USA). Theadvantages of this type of system arerapid access to well-linked datasetsand search algorithms resulting inreduced time spent in libraryresearch. However, this too mighthave the disadvantage of maskingsubtle features or associations in theprimary data because it is dependenton the human judgement of the editor to decide what is sufficientlyimportant to be incorporated. Theremight also be many datasets whereinthere is not a strong enough linkagebetween data items (as there is inmembers of the same gene family orreceptor type) for this to be a reality.

Knowledge accessIn order to capitalize on systems

used for the integration of distributeddatabase systems and knowledge bases,these systems, regardless of their

implementation, should support several key functions:• allow data and knowledge to be

effectively combined for researchand management functions;

• be able to support decision mak-ing (e.g. in effective target identi-fication);

• ensure maximum usage of existingpersonnel skills and experience(i.e. the ‘human knowledge base’);and

• aid cross-discipline communicationand data exchange.

Several systems were presentedusing a variety of established tech-nologies, such as the use of CORBA(Tim Clark, Millennium Pharma-ceuticals, Cambridge, MA, USA),and proprietary integration systemssuch as the Sequence Retrieval System (SRS; Reinhard Schneider,LION Bioscience AG, Heidelberg,Germany). In principle, both thesesystems enable linkage between dis-parate datasets and access to these setsvia several established and novel visu-alization tools. The inclusion of accessto familiar tools (e.g. BLAST or otherdatabase-search tools) will be veryimportant in ensuring that researchersreadily take up the new system asopposed to viewing it as an additionalburden they have to master.

Changes in cultureFrom the presentations made at

this meeting, it became clear thatthere is a considerable culture changeunder way in the manner in which datais managed in large companies andresearch bodies. This change encom-passes three levels: (1) strategic plan-ning, (2) bioinformatics implementationand (3) the actual research process.

At a strategic level, to allow bio-informatics specialists to implementsystems that can easily grow andevolve with the research base. At thebioinformatics level, the implemen-tation of new visualization and analysis tools, while rememberingthat data integration is not the endpoint – the end point is the use towhich you want to put the data. Fur-ther, at the research level, where datasubmission to a central knowledgebase becomes an easy and regulartask. In addition, data submission andmining should be strongly coupledwith a culture of the exchange of ideasand feedback with bioinformaticsgroups forming a positive feedbackloop for future developments.

Mark StrivensMedical Research Council Mouse Genome Centre,

Harwell, Oxon, UK OX11 0RD.(E-mail: [email protected])

FORUM

Bioinformatics meets data mining:time to dance?

228 0167-7799/00/$ – see front matter © 2000 Elsevier Science Ltd. All rights reserved. PII: S0167-7799(00)01430-X TIBTECH JUNE 2000 (Vol. 18)

*The conference on Data Mining in Bioinfor-matics was held at the European BioinformaticsInstitute, Hinxton, UK, 10–12 November 1999.

Meetingreport

TTEC June 2000 4/5/00 11:07 am Page 228