Challenges and Opportunities for Biological Language Modelling in Biomedical High-Throughput Genomic and Proteomic Informatics

Appl Bioinformatics 2004; 3 (2-3): 77-80EDITORIAL FOREWORD 1175-5636/04/0002-0077/$31.00/0

© 2004 Adis Data Information BV. All rights reserved.

Challenges and Opportunities for BiologicalLanguage Modelling in BiomedicalHigh-Throughput Genomic andProteomic Informatics

Bioinformatics boils down to the interplay between three enti- plethora of ‘facts’ and mountains of data, it is worthwhile to pauseties: concepts, methods and data. Most contributions in bioin- and consider for a moment the relationships that so interestedformatics are attempts to draw or create relationships among these Popper in the context of today’s high-throughput, high-informa-three entities. This includes contributions in analysis informatics, tion science wherein relativism deserves further critical analysis.with methods and models designed to aid interpretation, and The rate at which biological information is now being accumu-contributions in database informatics, with data structures, com- lated has outstripped the imagination of everyone in Popper’smon data elements and ontologies. Our concepts are, of course, the generation and even of my own generation in our first years inrepresentation of whatever ephemeral knowledge about the world graduate school, not all that many years ago. Unforeseen, inciden-we carry in our brains. Data from experiments are the representa- tal utility of data collected for some purpose are everywhere.tion of what we all agree (more or less) to share similar concepts High-throughput genetic, genomic and proteomic data platformsabout. Such representations are used to encode, condense, sum- are matched with computational abilities to store and retrievemarise or display information about ‘reality’. Methods provide a information accumulated for whatever purpose – specific or not –conceptual framework in which we can explore, share, communi- with or without a strong or weak relationship to any particularcate and critique how well those concepts are supported by ‘reali- hypothesis and without any directed purpose such as determiningty’. In database informatics, knowledge structures can sometimes whether all geese on the planet are white. In this, there appears tobe imposed on us by the way data are stored and in the ways in be a certain amount of freedom, and tolerance of diversity ofwhich we are, or are not, provided access. As part of best practices thought and method. Restrictions are rarely placed or at least,of development, such hard-wiring of paradigms should be avoided perhaps, are difficult to enforce; for example, in spite of efforts towhenever possible. formalise submission requirements for microarray data,[3-5] some

Sir Karl Popper was a purist on the ideals of knowledge; he held journals still accept, as supplementary material, fold-change ratiothat truth, knowledge and facts were independent of any one of expression values as ‘data’ and not as ‘results’. How we decideindividual human mind. Specifically, he posited in his theory of to pull various data streams together, and into what conceptualverisimilitude (Popper, 1963)[1] that we (hopefully) evolve contin- framework we package them, can have a large impact on howuously on an asymptotic approach to truth, with our understanding researchers ‘see’ the data. The conceptual frameworks that provideobtaining increasing accuracy in the relations between language the best perceived matches between models and data are perceivedand reality, ever seeking but never obtaining verification. The to also exhibit the highest utility in application. Researchers infailure of positive verification stems not from the metaphysical bioinformatics and computational biology must be diligent in theirnature of truth, but, instead, due to the intrusion of limitations process evaluations to ensure that the process of matching data to

imparted by a language representation of thought (semantics). models (model fitting) is not confused with the process of match-

Popper’s original formulation of verisimilitude was defeated inde- ing models and methods to conceptual representations (modelpendently due to errant bookkeeping on Popper’s account of construction). These are both important but distinct processes withfalsity by Tichy and Miller (see the entire story in Watkins, the former informing the latter. Two classes of measures of the1997[2]), but, in an age where we are, indeed, overwhelmed by a ‘match’ of data to models include external predictiveness and

78

goodness-of-fit. Measures of the unbiased external generalisability frameworks borrowed directly from computational linguistics.[6]

of predictiveness (on validation datasets) should be preferred over How tempting it must have been to imagine that the genomegoodness-of-fit because the measures are the most independent of sequences (which can be represented by letters) and protein se-concepts and, thus, can best report on the utility of a model when quences (which again can be represented by letters) were both,applied to unseen data, whereas goodness-of-fit measures wrap the perhaps, actually also constructed of words (with meanings thatmodel into the measure. One such framework is provided by were context-dependent) and sentences, complete with potentiallybiological language modelling (BLMs), in which concepts from knowable grammatical rules, which, if catalogued and studied,language technologies have found applications in bioinformatics. could be used to create a comprehensive and global understandingThis is based on the observation that biological sequences (nucleo- of the workings of molecular cellular biology! More recently, thetide and protein) can be ‘read’ and ‘understood’ in terms of focus has moved toward syntactic methods and a more structuredbiological function, in analogy to the meaning in language. The model-based view of sequence data, and the daily utility of algo-usefulness of this analogy lies in two aspects: (i) it allows the rithms generated from this work by far outstrips any other compu-direct use of methodologies developed for applications for speech tational utilisation in the bioinformatics realms. One need onlyand language to bioinformatics; and (ii) it allows researchers from view the gains in performance provided by the application ofdisparate backgrounds to relate, because we all use language and hidden Markov models to sequence searching (for example, PSI-are therefore amenable to concepts derived from language. BLAST[7]) to appreciate the amount of potential in this conceptual

framework.With the explosion of data, the rate of accumulated ‘knowl-edge’ may not be keeping pace, due to the expansion of relativism Some of the conceptual representations afforded by BLM, suchand an apparent dearth of application of widely adopted objective as the sequence motif, will be familiar to most practisingcriteria for evaluating conceptual frameworks. I have mentioned bioinformaticists. Others employ concepts that will be foreign totwo types of objective criteria, and have argued for preferring most practising bioinformaticists, such as n-grams (also known asthose that report on the externally valid (unbiased) generalisable k-tuples). A number of research areas exist where a workingpredictiveness. A third criterion, which is partially emulated by the knowledge of BLM would appear to benefit basic and appliedpredictiveness criterion, is, of course, utility in application, but this genomic and proteomic research. The first hope, of course, is tocriterion is hard to measure because it cannot, unfortunately, be comprehend the language of the genes. The study of structural anddivorced from irrational factors like precedent, fads, and the functional elements in genes is greatly enhanced with BLM.personal drive of researchers seeking industrial or biomedical Protein folding is similarly enriched. These problems fall into theapplication of their knowledge. category some recognise as more ‘computational biology’ than

‘bioinformatics’. Sequence-based genome research can be aidedThe question must be asked, then: is BLM a mere analogy forby language models. Of particular note is the very large and openother inferences that could be carried out without reference to theirquestion of the structural and functional significance of highlyparticular packaging of concepts? Is it a useful framework withinconserved messages encoded in noncoding transcripts and so-which pattern discovery (hypothesis generation), hypothesis test-called ‘junk’ DNA sequences. With whole human, chimp anding and communication about this knowledge can efficiently bemouse genomes completed, and others anticipated soon, the play-studied with a gain that extends beyond that which can be achievedground of biological language modellers seems large and invitingwithout BLMs? Or is it merely another way of representingindeed.knowledge that might better be accomplished within some other

framework? The answer to the question is, of course, ultimately The study of gene-gene interactions, protein-protein interac-dependent on and determined by the practical utility of the results tions and protein-gene interactions can certainly also be facilitatedgenerated by such endeavours. At one level, BLM can help keep by BLMs. For example, researchers who perform microarrayus cognisant of the role that the processes by which and structures experiments are naturally concerned with the real possibility ofwith which we represent and encode what we think we know, what cross-hybridisation among targets and probes (nonspecificwe think we are studying, influence our ability to achieve our own hybridisation). Consider that RNA interference is now being de-stated goals. BLMs are useful then from the perspective that veloped as a potential tool for modulating over-expressed genes inanything that causes researchers to think about how they are cancers. This begs the question then of what information (i.e.thinking about their data is useful. Bioinformatics of course had messages), if any, does information on cross-hybridisation amongsome of its beginnings in language analogies, with conceptual RNA species (which we know occurs) provide to the rules and

© 2004 Adis Data Information BV. All rights reserved. Appl Bioinformatics 2004; 3 (2-3)

Editorial 79

dynamic workings of distinct regulatory circuits of human tissues?Could anti-sense RNAs be used to modulate or time proteinmessages? These are big open questions, ripe for BLM, but to bemore fully representational, the hybridisation kinetics among se-quences might prove useful as well to identify potential hiddensignalling in the transcriptome. Another big open question of genetalk is the mystery behind the control of alternative splicing. Thesequestions might just bend to the application of BLM synthesised

‘Reality’ ConceptsData

ModelsMethods

Fig. 1. Models, concepts and even data represent approximations of reali-ties.with other conceptual representational frameworks. Researchers

who typically cluster genes using microarray data in hopes ofother diseases is cacophonic, the genomic and proteomic noise afinding co-regulated gene sets will appreciate that expressionmixture of errant timing of correct phrases, correct timing of errantcorrelations are merely one, fairly superficial, conceptual dimen-phrases, and instruments playing out of tune.sion of information that might be available in whole-genome

The questions raised about BLM in this editorial are not limitedexpression studies; perhaps language modelling at the level ofto BLMs. They could be, and perhaps even should be, applied tosemantics and syntactic models may prove useful.all types of knowledge representation, analysis and modelling inClearly, BLMs can be enabling conceptual representations, asbioinformatics. The first overarching challenge to all conceptualwas shown recently at the Biological Language Conference 2003frameworks that hope to make lasting contributions – measured byheld at Carnegie Mellon University, Pittsburgh, USA (http://their gain through applications and new knowledge discoveries –flan.blm.cs.cmu.edu/meeting2003/). The organisers of the confer-is, perhaps ironically, the challenge of communicating the promiseence, brought together researchers who use the language analogyof the new techniques that originate from within or betweenand those who use other bioinformatics approaches. This issue ofconceptual frameworks. The second is a warning on the limits withApplied Bioinformatics contains a selection of papers from thiswhich complex algorithms can be made to reproduce, with highconference. As you will see from this issue, the papers presented atfidelity, many of the emergent properties associated with ourthe Biological Language Conference provide ample evidence ofperceived biological ‘reality’. Additionally, our senses will morethe enabling capability of BLMs. Building on the momentum oflikely fail to discriminate, in such cases of apparent success,BLMs, what alternatives might exist in the future? Where, whenbetween instances where we have matched model with modeland how do these conceptual representations constrain us frominstead of concept with concept or data with data. The hope ofunderstanding biological reality? Are these mere, incomplete,matching data to ‘reality’ without the filters of concepts andconceptual conveniences that can be used to convince our brains,models (figure 1) seems unrealistic. This speaks to the importancefor the time being, that we can capture, represent and understandof staying ‘on task’, and not falling into the trap of accepting our‘reality’?models as reality, and the utility of maintaining vigilance in our

My perceived, and most immediate answer, is a hopeful ‘yes’.ability to know the difference.

Bioinformatics and computational biology, as disciplines and sci-Applied Bioinformatics is one venue in which the exploration ofences, require honesty in stated limitations of knowledge,

such limits is heartily encouraged.metacognition on limitations in capabilities and, above all, com-James Lyons-Weilerparative evaluation of the utility of information provided by meth-Associate Editor, North Americaods and conceptual representations. Opportunities for synthesis

across conceptual frameworks have proven especially fruitful.Acknowledgements

One good example is the relatively recent combination of proba-bilistic methods with syntactic constructs. Given that timing of I would like to acknowledge Mark Kon and Judith Klein-Seetharaman, for

engaging discussions on models and their role in bioinformatics, and Allenexpression and messenging and RNA, DNA and protein interac-Rodrigo for input.tions is so important for harmonic cellular functioning, BLM

analogies can be seen as an approximation of the symphony ofReferencescellular molecular dynamics. The challenge for this generation of

1. Popper KR. Conjectures and refutations: the growth of scientific knowledge.language modelers may be to model these dynamics as a musical London: Routledge, 2002composition of a self-organised orchestra, which in cancer and 2. Watkins J. Obituary of Karl Popper, 1902–1994. Proc Br Acad 1997; 94: 645-84


80

3. Microarray Gene Expression Data (MGED). A guide to microarray experiments: 6. Searls DB. Representing genetic information with formal grammars. Humanan open letter to the scientific journals [letter]. Lancet 2002 Sep 28; 360: 1019 Genome Program, US Department of Energy, DOE Human Genome Program

4. Ball CA, Sherlock G, Parkinson H, et al. Standards for microarray data [letter]. Contractor-Grantee Workshop IV; 1994 Nov 13-17; Santa Fe (NM).Science 2002 Oct 18; 298 (5593): 539

7. Altschul SF, Madden TL, Schaffer AA, et al. Gapped BLAST and PSI-BLAST: a5. Brazma A, Hingamp P, Quackenbush J, et al. Minimum information about anew generation of protein database search programs. Nucleic Acids Res 1997;microarray experiment (MIAME): toward standards for microarray data. Nat

Genet 2001 Dec; 29 (4): 365-71 25: 3389-402


Documents

Challenges and Opportunities for Biological Language Modelling in Biomedical High-Throughput Genomic and Proteomic Informatics