Brief Bioinform 2005 Scherf 287 97

Matthias Scherf, PhD

joined Genomatix Software

GmbH in 2000, where he is

Head of Discovery. He did his

postdoctoral work in the group

of Dr Werner at the GSF

where he developed the first

specific approach for genome

wide promoter prediction in

mammalian genomes. He has

over 15 years of experience in

pattern recognition, artificial

intelligence and medicine.

Anton Epple

received a Masters Degree in

Biology from LMU Munich and

completed postgraduate

studies in Computer Science at

the Technische Universitat

Munchen in 2001. His research

interests include the design of

software for natural language

processing, and in particular

information extraction

techniques for systems biology.

He is currently a scientist at

Genomatix GmbH.

ThomasWerner, PhD

is CEO and CSO of Genomatix

Software GmbH. Since 1998 he

has been a full-time

bioinformatics researcher at

the GSF-National Research

Centre for Environment and

Health in Neuherberg,

Germany, focusing on the

analysis of genomic sequences

with special emphasis on aspect

of the regulation of

transcription. He founded

Genomatix Software GmbH in

1997 and it has rapidly

developed a unique expertise

and advanced software for

genomic research.

Keywords: literature/textmining, gene regulation,promoter analysis, integratedanalysis

Matthias Scherf,

Genomatix Software GmbH,

Landsberger Strasse 6,

Munich, D-80339, Germany

Tel: 49 89 5997660Fax:49 89 59976655E-mail: [email protected]

The next generation ofliterature analysis: Integrationof genomic analysis intotext miningM. Scherf, A. Epple and T. WernerDate received (in revised form): 27th May 2005

Abstract

Text-mining systems are indispensable tools to reduce the increasing flux of information in

scientific literature to topics pertinent to a particular interest in focus. Most of the scientific

literature is published as unstructured free text, complicating the development of data

processing tools, which rely on structured information. To overcome the problems of free

text analysis, structured, hand-curated information derived from literature is integrated in

text-mining systems to improve precision and recall. In this paper several text-mining

approaches are reviewed and the next step in development of text-mining systems, which is

based on a concept of multiple lines of evidence, is described: results from literature analysis

are combined with evidence from experiments and genome analysis to improve the accuracy

of results and to generate additional knowledge beyond what is known solely from literature.

INTRODUCTIONThe annual worldwide production of

information in publications is estimated to

be 8 terabytes in books, 25 terabytes in

newspapers, 20 terabytes in magazines and

2 terabytes in journals.1 It would take five

years to read the new scientific material

that is produced every day. This rapid

growth of information is observed in the

field of biomedicine as well: over 15

million entries maintained by the

National Library of Medicine2 are

available today in MedLine, which is the

primary source of free textual information

data in biomedical literature. Thousands

of new entries are added every day.

Consequently, information processing

systems must be applied to restrict the

available information to that fraction

which is pertinent to a particular topic or

more precisely even to a particular

context within a topic. A crucial

requirement for such systems is the ability

to analyse and extract information from

unstructured text. The systems available

today put their main focus on the analysis

of abstracts from scientific papers; abstracts

summarise the results of the scientific

work in a compact way and are the

predominant source available in electronic

form.

The challenge that must be addressed in

developing systems for the analysis of text

databases is founded in Zipfs law.3 It

states that few instances (words) cover

most of the text, while most instances

appear very seldom. Our own findings

confirm this statement. In more than 15

million abstracts contained in MedLine,

more than 40 per cent of the words occur

only once. This number illustrates the

lack of standards in unstructured text.

Obviously, even text on the same or

similar topics is not similar with respect to

the wording. As a consequence, general

rules are difficult to set up for the

classification of a text, eg by topic.

Information processing methods,

however, are based on finite sets of well-

defined instructions to accomplish a

certain task. Thus, the development of

algorithms to analyse unstructured text

remains challenging.

The authors are not aware of any

& HENRY STEWART PUBLICATIONS 1467-5463. BR IEF INGS IN BIOINFORMATICS . VOL 6. NO 3. 287297. SEPTEMBER 2005 2 8 7

by guest on June 20, 2014http://bib.oxfordjournals.org/

Dow

nloaded from

generally applicable approach in the field

of biomedicine and molecular biology

where the number of relevant documents

found by text mining reaches the number

of detected documents (100 per cent

precision) and the number of relevant

documents detected reaches the total

number of relevant documents (100 per

cent recall). However, many different

approaches exist, focusing on specific tasks

and knowledge domains.

The next section starts with a brief

overview of the different tasks and

challenges in text mining of biomedical

literature. Our focus will be on the tasks

of identification, description and

classification of relations between

biological entities from free text. The

methods available for this task are

described. Next, the basic biological

concepts that are used to improve and

classify text analysis results by the

integration of content from structured

information resources derived from

literature by human experts are explained.

For further reading, the reader is referred

to Blaschke et al.,4 Shatkay and Feldman,5

Hirschman et al.,6 Dickman,7 De Bruijn

and Martin,8 Grivell,9 Andrade and

Bork10 and Schulze-Kremer11 for in-

depth reviews on methods in text mining

and natural language processing in the

domain of biomedicine and molecular

biology.

The subsequent sections will illustrate

how the principle of data integration in

text-mining systems can be extended to

genome analysis. Our focus will be on the

biological process of transcription

regulation as one example for genome

analysis resulting in gene relations.

TEXT MINING INMOLECULAR BIOLOGYAND BIOMEDICINEText mining deals with the analysis of text

and the extraction of information. One

very common task of text mining in the

field of biomedicine and molecular

biology is to identify and analyse relations

between biological entities such as genes,

proteins or diseases from free text. The

process of text mining for this task implies

the following steps:

Identification of biological entities.

Identification of entity relations.

Classification of entity relations.

Every step of the text-mining process can

be addressed with several different

methods. Consequently a large variety of

method can be combined to solve the

various aspects of literature mining. The

combinatorial possibilities are also

reflected in the number of currently

available tools for literature analysis.

However, all of these tools address more

or less the same tasks of identification and

analysis of gene relations.

Since the strength and weaknesses of

the different tools differ according to the

user queries, an objective and fair

comparison is hard to achieve on the level

of integrated tools. Thus we will focus on

the underlying methods which can be

applied for the three steps described above

to offer the reader the basis for a method-

oriented way of evaluation.

Step 1: Identification ofbiological entitiesBiomedical literature contains a special

category of entities that refer to gene and

protein names, chemical compounds,

diseases, tissues, cellular components or

other predefined biological concepts.

Therefore, identification of such

biological entities in text is a crucial first

step and essential for any subsequent

analysis.

The major challenge in entity

identification is the synonym/homonym

problem: biological entities such as genes

not only have different names, ie

synonyms (eg CSEN, DREAM,

KCHIP3, MGC18289 and KCNIP3)

with different typographical variants (eg

Kcnip3, KCNIP-3 and KCNIP 3), but the

names can also be ambiguous. An

example for ambiguity is the abbreviation

NRL which is used for natural rubber

Precision and Recall arecrucial measurementsfor literature analysis

2 8 8 & HENRY STEWART PUBLICATIONS 1467-5463. BRIEF INGS IN BIOINFORMATICS . VOL 6. NO 3. 287297. SEPTEMBER 2005

Scherf, Epple and Werner


Dow

nloaded from

latex as well as neural retina leucine

zipper gene.The synonym/homonym problem

Owing to the synonym/homonym

problem, the task of entity recognition

requires an identification and

disambiguation step. Identification

strategies range from methods that use ad

hoc rules about typical syntactic structures

of entity identifiers12 to algorithms that

search identifiers of a given dictionary

with exact and inexact pattern matching

methods.13 Principally, dictionaries are

used in combination with pattern

recognition approaches. The dictionaries

are based on publicly available sources of

standardised, structured data annotated by

human experts. Examples for sources are

HUGO Gene Nomenclature

Committee,14,15 sequence annotation

databases such as LocusLink,16 protein

databases such as Swiss-Prot,17 Gene

Ontology (GO)18 or medical pathological

terms UMLS.19

Although a dictionary-based search

allows a robust algorithm-based

identification of entities, it is naturally

restricted to the terms included in the

dictionary. The disambiguation step

implies a classification method deciding

whether the text where the entity has

been identified refers to the expected

topic. A large variety of methods, ranging

from machine learning,20 support vector

machines21 and hidden Markov models22

to Bayesian learning and decision

trees,23,24 have been applied to this task.

It is difficult to compare methods for

identification and disambiguation, since

the published methods generally focus on

different kinds of biological entities and

are often trained and tested on preselected

text sets from certain domains. For

example, the authors of the PROPER

method12 reported over 90 per cent in

precision and recall but calculated these

values based on only 30 abstracts of papers

about the SH3 domain.

Regardless of the different methods

applied, most authors agree that

interaction with experts is indispensable

to obtain good results in identification

and disambiguation. These experts define

the dictionaries for entity identifiers and

create simple, generally applicable, rules

for disambiguation. Thus, the

performance of the various methods relies

to a significant fraction on the excellence

of the biological experts involved rather

than on the algorithm employed to

encode and use the expert knowledge.

Step 2: Identification ofrelations between entitiesThe next goal after identification and

disambiguation of biological entities is the

detection of relations between the entities

as the text describes them. The most

straightforward approach for this task is to

assume relations between entities based on

co-occurrence in a text. The probability

of an established relation between entities

depends to some extent on the location of

the entities within a text. The weakest

assumption about a relation is due to a co-

occurrence of entities anywhere in a text.

If two entities co-occur within the same

sentence, a true relation becomes more

likely, while the coverage might decrease

simultaneously.

Sophisticated approaches try to further

improve the analysis on sentence level by

employing dictionaries and rule-based

analysis techniques.2528 The dictionaries

contain words related to the description

of relations; the rules are designed for the

analysis of sentence or phrase structures.

These approaches lead to a better

precision of results but decrease the recall

owing to the restricted set of vocabulary

and sentence structures. While all of these

methods take the textual and sometimes

grammatical context into consideration,

none of them truly integrates the

biological context (other than what was

manually entered into the dictionaries).

An ideal system would consider biological

restrictions, eg rule out impossible

combinations of entities such as

information about bacterial genes

involved in operon structures in relation

to mammalian genes pertinent to

chromatin organisation.


The next generation of literature analysis


Dow

nloaded from

Step 3: Classification ofentity relations

Gene relationsdepend on theirfunctional context

Relations between biological entities are

not fixed but change according to the

functional context in which an entity

applies. The biological mechanisms and

the environment in which the entity was

observed generally specify the functional

context of a biological entity.

Consequently, the description of a

functional context is usually distributed in

multiple sentences, figures and tables, and

in-depth expert knowledge is required to

decode the functional context from

publications. Text-mining systems might

support the identification of single aspects

of a functional context such as a tissue

type, but it is still impossible to

automatically elucidate the complex

dependencies between the components of

a functional context. A relation might

thus be described and correctly identified

by steps 1 and 2 from a text, but the

functional context in which the relation

was observed might not correspond to the

topic of interest. Prominent examples for

relations that change according to the

functional context are signal transduction

pathways such as the MAP kinase system,

which can trigger a number of different

transcriptional activators.29

The most common way to consider

functional context in text mining is the

introduction of structured, hand-curated

information about biological entities.

Available sources such as GO18 or

KEGG30 assign biological entities to

classes, eg of biological functions and

pathways. MESH31 assigns domains like

diseases or anatomy to publications.

Integrateindependent linesof evidence

These sources can be used to establish

different lines of evidence for an entity

relation derived from text. A certain

disease can be assigned to a relation if the

paper in which the two entities were

identified has been assigned to the disease

via MESH. If both genes of an identified

relation belong to the same class in GO,

then this class is assigned to the relation.

The natural internal consistency of

biological facts and findings make such an

approach possible. True connections of

biological entities are characterised by at

least two hallmarks: they do not conflict

with each other (within the correct

context!) and they are always present on

several levels. For example, two proteins

reported to interact functionally (such as

an enzyme and its substrate) necessarily

also can be shown to interact physically

(eg in yeast two hybrid systems). In

addition, they are necessarily co-expressed

in at least one cell type. Often such co-

expression is also evident from common

regulatory structures in the corresponding

gene promoters. In short, isolated findings

that are not supported on other levels

(genome, transcriptome, proteome,

metabolome) or are in conflict with other

findings are usually much less likely to be

true than findings consistently supported

by several independent lines of evidence.

This is a very general and a very powerful

basic biological concept that can be used

to enhance the results of any information

retrieval system.

In the following section we will discuss

the extension of the line of evidence

approach towards literature-independent

data from genomic analysis.

COMBINATION OF TEXTMINING AND GENOMICANALYSISSources such as GO or MESH contain

specific, high-quality annotations

primarily derived from the literature by

experts. Such annotations can be

complemented by literature-independent

data, eg from laboratory or in-silico

experiments, which confirm text-mining

results and assign additional functional,

cellular or molecular context to entity

relations. It should be noted that only

independent lines of evidence provide

support. If three groups report the same

results based on very similar experiments,

this is only incremental evidence of one

line. However, if a physical interaction

deduced from a functional assay is

supported by direct demonstration of

physical interactions (proteinprotein or

proteinDNA/RNA), then two lines of

evidence are established. This is much

more difficult to realise based solely on




Dow

nloaded from

text mining, but can readily be achieved

when text mining is combined with other

sources such as genomics or proteomics-

based sequence analyses.

Below, the integration of literature

independent data is illustrated with several

examples. Special emphasis is placed on

the regulation of gene transcription since

it defines important proteingene

relations on molecular level.

Transcription regulation

Transcriptionregulation

Gene transcription is regulated in part by

nuclear factors (proteins) that recognise

short DNA sequence motifs, called

transcription factor binding sites (TFBSs).

TFBSs are in most cases located upstream

of the first exon of a transcript in so-called

promoter and enhancer regions.

Identification of TFBSs in regulatory

regions of transcripts can confirm

important relations between transcription

factors and genes and add considerably to

annotation of the biological context of

genes. As a consequence, the analysis of

transcription regulation first requires the

annotation of regulatory regions for a

gene and the identification of TFBSs in

the annotated regulatory regions.

Annotation of genes and theirregulatory regions

Genes can havealternative promoters

The human genome currently is

annotated with 23,245 gene loci (NCBI

Build 34). For these loci 43,975

transcripts are known. About 45 per cent

(10,368) of the genes have alternative

transcripts ranging from 2 to 40. In

addition 6,418 of the annotated loci have

two or more promoters, ie alternative

promoters. Figure 1 summarises the

distribution of genes with alternative

transcripts.

Alternative transcripts of a gene differ

according to alternative splicing (see, eg,

gene LIPT1 and Figure 2a), alternative

termination (see, eg, gene FBLN1) or

alternative first exons (see, eg, gene

CYP19A1 and Figure 2b). This flexibility

of alternative transcripts reflects the

various biological contexts in which a

gene might be functionally involved.

Since publications in general describe

only genes and not their transcript/

regulatory regions, it is not possible to

identify the functional context by text-

mining methods. On the other hand, the

flexibility of alternative transcripts needs

to be understood to truly comprehend

disease processes, especially for

individualised diagnostics of chronic

diseases and cancer. Only this knowledge

will allow addressing the correct genetic

mechanism in the pertinent context. The

aromatase gene, which is the terminal

enzyme responsible for oestrogen

biosynthesis in mammals, provides a good

example to illustrate this point. Aromatase

has at least six different alternative

promoters that regulate the production of

the same gene product (exon II-X always

remain the same). Dysregulation of

aromatase promoters is found in severe

diseases, especially breast cancer.32

Aromatase in normal breast tissues is

mainly regulated by promoter 1.4. In

Figure 1: Organisationof the human genome:45 per cent of all geneshave alternativetranscripts andalternative promoters

!

! !

!

!

"

"

!

! "

!

!

Figure 2: Alternative splicing (a) andalternative promoters (b). Alternativetranscripts occur by alternative splicing ofone or several exons from a single primarytranscript or by transcription starting fromalternative promoters (P: promoter, E: exon)




Dow

nloaded from

breast cancer tissues, over-activation of

promoter 1.3 and 1.2 is often observed

(Figure 3). This shift in promoter usage

does not affect the coding region at all; all

transcripts encode the exact same protein.

Therefore, full understanding of this

disease mechanism and subsequent design

of an effective therapy requires detailed

knowledge of transcription activation via

alternative promoters.

Analysis of regulatory regionsIdentification of regulatory mechanisms

by promoter provides a crucial link

between the static nucleotide sequence of

the genome and the dynamic aspects of

gene regulation and expression.

Furthermore, it provides a unique way to

define functional context based on co-

regulation mechanisms that cannot be

derived by literature analysis.

Promoter modulesdefine biologicalfunctions

Activation of transcription is triggered

by binding of transcription factors to the

promoter sequence of a transcript.

However, in mammalian systems this is

usually not achieved by individual

transcription factors but by characteristic

combinations of factors. Similar TFBSs

patterns within the promoters of

transcripts are expressed in the same tissue

under similar conditions. Thus, the

organisation of promoter motifs represents

a footprint or framework of the

transcriptional regulatory mechanisms at

work in a specific biological context,

consequently providing information

about signal and tissue-specific control of

expression.

Software that allows detection and

characterisation of individual binding sites

is available from several sources, including

MatInspector,33,34 Signal Scan,35,36

MATRIX SEARCH37 or MATCH.38

A collection of functional binding sites

for high-quality prediction is derived

from the literature and included in

MatInspector.33 TRANSFAC is another

source which provides extensive

information about transcription factors

derived from literature.39,40

Although binding site detection is

important in higher organisms, it is

generally not sufficient for the elucidation

of promoter function since, in more

complex systems, the functional TFBSs

within promoters are organised

hierarchically41,42 (Figure 4). This

hierarchical organisation increases the

specificity and selectivity of gene

regulation via TFBSs.43,44 Combinatorial

biology appears to be the key to

understand regulation in higher

organisms, where promoter function is

determined to a large extent by the

functional context within which the

binding sites are located.

The smallest entities on the level of

TFBSs combinations that can be assigned

to a particular biological function are

called promoter modules. Promoter

modules are defined as two or more

individual elements that act in a

coordinated way (either synergistically or

antagonistically) and are arranged within a

defined distance and in sequential order

(Figure 4).44 Work to date suggests that

Figure 3: Gene structure of human aromatase (Cyp19A1). Datamodified from Clyne et al.32

Figure 4: Promoters in higher eukaryotesare organised hierarchically and elementsthat control a specific pattern of expressionmay also be found in other promotersexpressed under similar circumstances




Dow

nloaded from

promoter modules can be pathway or cell

type specific41 and, in this regard, can

mediate the transcriptional response to

specific signal transduction pathways,45,46

cell type-specific expression, and events

central to developmental regulation.47 A

given promoter module may show a

robust stimulus-specific response in one

tissue, but may not be functional in

another cell type.

Combination ofpromoter analysis andliterature analysis

Although the inclusion of transcription

factorgene relations in combination

with literature mining has been realised in

the BiblioSphere PathwayEdition,34 more

sophisticated promoter analyses have not

yet been implemented within text-mining

tools. However, given the biological

relevance outlined above, this can be

expected to be added in the near future to

biological text mining.

The power of such a combinatorial

approach is illustrated with a recent

example.48 During an microarray-based

analysis of genes involved in the response

of astrocytes to expression of the HIV-1

protein Nef, a set of nine genes was

identified as relevant (BCL2L1, CDC42,

HCK, Jak2, JNK, MAPK1, RAC1,

STAT3, Vav1). Unfortunately, most of

these genes are involved in cell cycle

regulation resulting in an extraordinary

large body of related literature. Figure 5

illustrates the strategy and the results of

our approach. Even a co-citation

network analysis restricting networks to

genes co-cited with at least five of the

nine genes can restrict the gene list only

from the initial 2,846 to 440 genes. In

contrast automatic promoter framework

analysis of the promoters of the nine

initial genes yielded a framework

consisting of four TFBSs, present in the

promoters of three of the nine genes

(BCL2L1, HCK, RAC1). This network

selected 159 promoters out of 36,000

human annotated promoters. The

molecular evidence (159) was then

crossed with the literature around the

nine initial genes (2,846), which resulted

in a network of 18 genes where all

connection could be verified as

functional. All of these 18 genes are thus

directly relevant for the initial query, the

response of astrocytes to the viral Nef

protein of HIV-1.

In summary, promoter analysis

provides information on transcription

factorgene relations on the molecular

level independent of literature data.

Relations between transcription factors

and genes can be associated in a context-

dependent manner. Moreover,

information on alternative promoters and

transcripts provides a detailed view on

different biological contexts in which a

gene can function. While this will clearly

reduce the recall from the literature it

can dramatically increase the context-

dependent precision, which is the most

important parameter for the usefulness of

data mining in general.

Further approaches to integrateliterature-independent dataAnother approach to integrate

experimental, non-literature based

information on relations between

biological entities is to integrate

Figure 5: Scheme illustrating the combination of literature mining andpromoter analysis on the example of a group of nine genes initiallyidentified in a microarray-based study of astrocyte response to the HIV-1Nef gene48




Dow

nloaded from

information about proteinprotein

interactions. There are a number of

databases available containing such

information, derived from experiments

such as yeast/mammalian two hybrid. An

example is the DIP database.49 Here,

information from a variety of sources is

combined to create a single, consistent set

of proteinprotein interactions. The text-

mining system Chilibot50 uses this source

to integrate additional relations in its

literature-mining results.

Four approaches tocharacterise generelations

Literature-mining systems today often

are used for the interpretation of

expression array data. However, the data

from expression arrays also provide gene

relations, defined by genes with similar

expression profiles under defined

experimental conditions (cell type,

treatment, etc). This source is not directly

correlated with gene regulation analysis

since co-expressed genes are not

necessarily co-regulated. Moreover,

effects from different biological

mechanisms such as post-transcriptional

regulation via microRNA, RNA stability,

etc are cumulated in the expression signal.

GeneExpressionOmnibus51,52 offers a

large collection of documented results

from expression array experiments that

might be integrated in literature mining

systems in the future.

CONCLUSIONRelations between biological entities are

conditional and may change when the

same genes are considered in a different

functional context. As a consequence,

every relation between entities must be

qualified with the functional context in

which the relation was observed.

Moreover, it is impossible to make

general statements whether a relation

detected by literature mining is a true or

a false relation without considering the

observed context.

This context-dependency of relations

also precludes any quantitative

comparison of the content of the various

databases underlying the discussed

methods. Pure numbers cannot answer

the crucial question how well such

relations are qualified with respect to their

biological context. It is safe to assume that

all of the methods have ample basic

information to build on. The only

external indicator we could identify at

least in a qualitative manner is the

assessment of how many different lines of

evidence are combined by the systems.

Although the quality of the final results

still depends crucially on how such

integration procedures are implemented,

the concept of multiple lines of evidence

at least allows for using the principle of

biological consistency discussed above.

There is no doubt that text-mining

methods are powerful tools to further

understand biological principles by

problem-oriented preselection of

publications about biological entities and

their relations. The main challenge in text

mining remains coping with free

unstructured text and its individual

properties, which are characterised by

Zipfs law. Text mining in molecular

biology and biomedicine is complicated

by multiple layers of problems. They

range from the identification of biological

entities over disambiguation (synonyms

and homonyms) and identification of

relations all the way to the interpretation

of the functional context. Numerous

approaches in the field of text mining

show that the identification and

disambiguation of biological entities

already give remarkable results in

precision and coverage, while the analysis

of sentences and text to discover relations

or biological concepts is still a challenge.

Fortunately, the connection of

molecular biology and biomedical

literature to biology not only complicates

the task, but also offers several

opportunities unique to the field. The

biological consistency also includes the

interaction of the cellular transcriptional

and translational machinery with the

genome and the transcriptome. Since we

do have access to several genome

sequences as well as considerable parts of

the transcriptomes (via cDNA/expressed

sequence tag approaches), biological

knowledge mining is no longer restricted

Further independentlines of evidence




Dow

nloaded from

to literature alone. Every phenomenon

described in the literature necessarily has a

molecular foundation within the genomic

sequence. Although our current

knowledge does not allow understanding

all of these molecular correlations, a great

deal of information can already be derived

from genomic sequences, especially about

transcriptional regulation as detailed

above. Moreover, as the biological

principles governing gene regulation seem

to be very general, it becomes possible to

use sequence analyses not only to confirm

knowledge from the literature, but also to

derive new relationships beyond current

knowledge.

Focusing on the aspect of confirming

results from text mining by other

biological data, the following picture

emerges. Starting from the co-occurrence

of biological entities in a text, four

approaches can be identified from the

available applications to confirm, classify

or discard the relation:

In-depth analysis on sentence orphrase level. Approaches range from

the application of general syntax rules

such as co-occurrences of entities in

the same sentences up to application

of in-depth analysis by syntactic and

semantic parsers. The results generally

cause a decrease of coverage, with an

increase in precision.

Hand annotation of preselectedsentences by curators. This approach is

applicable independent from a certain

scientific topic but slow.

Integration of hand-curated,structured data sources on gene classes

and text annotations.

Integration of experimental resultseither from laboratory or in-silico

analyses.

While the first three methods are

mainly text-driven, the integration of

results from experiments and in-silico

analyses introduces literature-independent

data. This represents an independent level

of defining functional context to evaluate

relations of biological entities. It becomes

increasingly clear that text mining will be

only one tool for information retrieval

and management in biomedical research.

Only the combination with other

methods and information sources will lead

to the best possible structuring and

compilation of biological knowledge.

This does not come as a surprise because

the whole concept of systems biology is

based on this notion. The most immediate

consequence is that text mining of

biomedical literature cannot be

outsourced from biology to other

disciplines but has to be carried out in

tight interaction with biologists.

References

1. Lyman, P. and Hal, R. V. (2003), How muchinformation? (URL: http://www.sims.berkeley.edu/how-much-info-2003/).

2. Wheeler, D. L., Chappey, C., Lash. A. E. et al.(2003), Database resources of the NationalCenter of Biotechnology, Nucleic Acids Res.,Vol. 31, pp. 193195.

3. Li, W. (1992), Random texts exhibit Zipfs-law-like word frequency distribution, IEEETrans Information Theory, Vol. 38(6), pp.18421845.

4. Blaschke, C., Hirschman, L. and Valencia, A.(2002), Information extraction in molecularbiology, Brief. Bioinformatics, Vol. 3,pp. 154165.

5. Shatkay, H. and Feldman, R. (2003), Miningthe biomedical literature in the genomic era:An overview, J. Comput. Biol., Vol. 10,pp. 821855.

6. Hirschman, L., Park, J. C., Tsujii, J. et al.(2002), Accomplishments and challenges inliterature data mining for biology,Bioinformatics, Vol. 18, pp. 15531561.

7. Dickman, S. (2003), Tough mining: Thechallenges of searching the scientific literature,PLoS Biol., Vol. 1, p. E48.

8. de Bruijn, B. and Martin, J. (2002), Getting tothe (c)ore of knowledge: Mining biomedicalliterature, Int. J. Med. Inf., Vol. 67, pp. 718.

9. Grivell, L. (2002), Mining the bibliome:Searching for a needle in a haystack? Newcomputing tools are needed to effectively scanthe growing amount of scientific literature foruseful information, EMBO Rep., Vol. 3,pp. 200203.

10. Andrade, M. A. and Bork, P. (2000),




Dow

nloaded from

Automated extraction of information inmolecular biology, FEBS Lett., Vol. 476,pp. 1217.

11. Schulze-Kremer, S. (2002), Ontologies formolecular biology and bioinformatics, In SilicoBiol., Vol. 2, pp. 179193.

12. Fukuda, K., Tamura, A., Tsunoda, T. andTakagi, T. (1998), Toward informationextraction: identifying protein names frombiological papers, in Proceedings of the 3rdPacific Symposium on Biocomputing,4th9th January, Hawaii, pp. 705716.

13. Krauthammer, M., Rzhetsky, A., Morozov, P.and Friedman, C. (2000), Using BLAST foridentifying gene and protein names in journalarticles, Gene, Vol. 256, pp. 245252.

14. Wain, H. M., Lush, M. J., Ducluzeau, F. et al.(2004), Genew: The Human GeneNomenclature Database, 2004 updates, NucleicAcids Res., Vol. 32 (Database issue),pp. D255257.

15. Wain, H. M., Bruford, E. A., Lovering, R. C.et al. (2002), Guidelines for Human GeneNomenclature, Genomics, Vol. 79(4),pp. 464470.

16. Maglott, D. R., Katz, K. S., Sicotte, H. andPruitt, K. D. (2000), NCBIs LocusLink andRefSeq, Nucleic Acids Res., Vol. 28(1),pp. 126128.

17. Bairoch, A. and Apweiler, R. (1998), TheSWISS-PROT protein sequence data bankand its supplement TrEMBL in 1998, NucleicAcids Res., Vol. 26, pp. 3842.

18. Ashburner, M., Ball, C. A., Blake, J. A. et al(2000), Gene ontology: Tool for theunification of biology, The Gene OntologyConsortium, Nat. Genet., Vol. 25, pp. 2529.

19. Bodenreider, O. (2004), The Unified MedicalLanguage System (UMLS): Integratingbiomedical terminology, Nucleic Acids Res.,Vol. 32, pp. 267270.

20. Tanabe, L. and Wilbur, W. J. (2002), Tagginggene and protein names in biomedical text,Bioinformatics, Vol. 18, pp. 11241132.

21. Kazama, J., Makino, T., Ohta, Y. and Tsujii, J.(2002), Tuning support vector machines forbiomedical named entity recognition, inProceedings of the Natural LanguageProcessing in the Biomedical Domain,Association for Computational Linguistics,Philadelphia, pp. 18.

22. Nobata, C., Collier, N. and Tsujii, J. (1999),Automatic term identification andclassification in biology texts, in Proceedingsof the Natural Language Pacific RimSymposium, Beijing, November,pp. 369375.

23. Hatzivassiloglou, V., Duboue, P. A. andRzhetsky, A. (2001), Disambiguatingproteins, genes, and RNA in text: A machine

learning approach, Bioinformatics, Vol. 17,pp. S97S106.

24. Novichkova, S., Egorov, S. and Daraselia, N.(2003), MedScan, a natural languageprocessing engine for MEDLINE abstracts,Bioinformatics, Vol. 19(13). pp. 16991706.

25. Friedman, C., Kra, P., Yu, H. et al. (2001),GENIES: A natural-language processingsystem for the extraction of molecularpathways from journal articles, Bioinformatics,Vol. 17, pp. 7482.

26. Park, J. C., Kim, H. S. and Kim, J. J. (2001),Bidirectional incremental parsing forautomatic pathway identification withcombinatory categorial grammar, inProceedings of the 6th Pacific Symposium onBiocomputing, 3rd7th January, Hawaii,pp. 396407.

27. Pustejovsky, J., Castano, J., Zhang, J. et al.(2002), Robust relational parsing overbiomedical literature: extracting inhibitrelations, in Proceedings of the 7th PacificSymposium, 3rd7th January, Hawaii,pp. 362373.

28. Novichkova, S., Egorov, S. and Daraselia, N.(2003), MedScan, a natural languageprocessing engine for MEDLINE abstracts,Bioinformatics, Vol. 19, pp. 16991706.

29. Kolch, W., Calder, M. and Gilbert, D. (2005),When kinases meet mathematics: The systemsbiology of MAPK signalling, FEBS Lett., Vol.579(8), pp. 18911895.

30. Ogata, H., Goto, S., Sato, K. et al. (1999),KEGG: Kyoto Encyclopedia of Genes andGenomes, Nucleic Acids Res., Vol. 27,pp. 2934.

31. Golbeck, J. (2003), The National CancerInstitutes thesaurus and ontology, J. WebSemantics, Vol. 1, pp. 7580.

32. Clyne, C. D., Kovacic, A., Speed, C. J. et al.(2004), Regulation of aromatase expression bythe nuclear receptor LRH-1 in adipose tissue,Mol. Cell Endocrinol., Vol. 215(12),pp. 3944.

33. Quandt, K., Frech, K., Karas, H. et al. (1995),MatInd and MatInspector: New fast andversatile tools for detection of consensusmatches in nucleotide sequence data, NucleicAcids Res., Vol. 23, pp. 48784884.

34. URL: http://www.genomatix.de/

35. Prestridge, D. S. (2000), Computer softwarefor eukaryotic promoter analysis, Methods Mol.Biol,. Vol. 130, pp. 265295.

36. URL: http://bimas.dcrt.nih.gov/molbio/signal

37. Chen, Q. K., Hertz, G. Z. and Stormo, G. D.(1995), MATRIX SEARCH 1.0: A computerprogram that scans DNA sequences fortranscriptional elements using a database of




Dow

nloaded from

weight matrices, Comput. Appl. Biosci. , Vol.11, pp. 63566.

38. Kel, A. E., Gossling, E., Reuter, I. et al.(2003), MATCHTM: A tool for searchingtranscription factor binding sites in DNAsequences, Nucleic Acids Res., Vol. 31,pp. 35763579.

39. Heinemeyer, T., Chen, X., Karas, H. et al.(1999), Expanding the TRANSFAC databasetowards an expert system of regulatorymolecular mechanisms, Nucleic Acids Res.,Vol. 27, pp. 318322.

40. URL: http://transfac.gbf.de/TRANSFAC/

41. Klingenhoff, A., Frech, K., Quandt, K. andWerner, T. (1999), Functional promotermodules can be detected by formal modelsindependent of overall nucleotide sequencesimilarity, Bioinformatics, Vol. 15,pp. 180186.

42. Klingenhoff, A., Frech, K. and Werner, T.(2002), Regulatory modules shared withingene classes as well as across gene classes can bedetected by the same in silico approach, InSilico Biol., Vol. 2, pp. S17S26.

43. Werner, T. (1999), Models for prediction andrecognition of eukaryotic promoters, Mamm.Genome, Vol. 10, pp. 168175.

44. Firulli, A. B. and Olson, E. N. (1997),Modular regulation of muscle genetranscription: A mechanism for muscle celldiversity, Trends Genet, Vol. 13, pp. 364369.

45. Boehlk, S., Fessele, S., Mojaat, A. et al. (2000),ATF and Jun transcription factors, actingthrough an Ets/CRE promoter module,mediate lipopolysaccharide inducibility of thechemokine RANTES in monocytic MonoMac 6 cells, Eur. J. Immunol,. Vol. 30,pp. 11021112.

46. Fessele, S., Boehlk, S., Mojaat, A. et al. (2001),Molecular and in silico characterization of apromoter module and C/EBP element thatmediate LPS-induced RANTES/CCL5expression in monocytic cells, FASEB J.,Vol. 15, pp. 577579.

47. Wang, Q., Sigmund, C. D. and Lin, J. J.(2000), Identification of cis elements in thecardiac troponin T gene conferring specificexpression in cardiac muscle of transgenicmice, Circ. Res., Vol. 86, pp. 478484.

48. Kramer-Hammerle, S., Hahn, A., Brack-Werner, R. and Werner, T. (2005),Elucidating effects of long-term expression ofHIV-1 Nef on astrocytes by microarray,promoter, and literature analyses, Gene, June13, available online: PMID: 15958282.

49. Xenarios, I., Salwinski, L., Duan, X. J. et al.(2002), DIP: The Database of InteractingProteins. A research tool for studying cellularnetworks of protein interactions, Nucleic AcidsRes., Vol. 30, pp. 303305.

50. Hao, C. and Burt, M. S. (2004), Content-richbiological network constructed by miningPubMed abstracts, BMC Bioinformatics, Vol. 5,p. 147.

51. Barrett, T., Suzek, T. O., Troup, D. B. et al.(2005), NCBI GEO: Mining millions ofexpression profiles database and tools,Nucleic Acids Res., Vol. 33 (Database issue),pp. D562566.

52. Edgar, R., Domrachev, M. and Lash, A. E.(2002) Gene Expression Omnibus: NCBIgene expression and hybridization array datarepository, Nucleic Acids Res., Vol. 30(1),pp. 207210.




Dow

nloaded from

Documents

Brief Bioinform 2005 Scherf 287 97