Upload
jsm789
View
14
Download
5
Embed Size (px)
DESCRIPTION
The next generation ofliterature analysis: Integrationof genomic analysis intotext mining
Citation preview
Matthias Scherf, PhD
joined Genomatix Software
GmbH in 2000, where he is
Head of Discovery. He did his
postdoctoral work in the group
of Dr Werner at the GSF
where he developed the first
specific approach for genome
wide promoter prediction in
mammalian genomes. He has
over 15 years of experience in
pattern recognition, artificial
intelligence and medicine.
Anton Epple
received a Masters Degree in
Biology from LMU Munich and
completed postgraduate
studies in Computer Science at
the Technische Universitat
Munchen in 2001. His research
interests include the design of
software for natural language
processing, and in particular
information extraction
techniques for systems biology.
He is currently a scientist at
Genomatix GmbH.
ThomasWerner, PhD
is CEO and CSO of Genomatix
Software GmbH. Since 1998 he
has been a full-time
bioinformatics researcher at
the GSF-National Research
Centre for Environment and
Health in Neuherberg,
Germany, focusing on the
analysis of genomic sequences
with special emphasis on aspect
of the regulation of
transcription. He founded
Genomatix Software GmbH in
1997 and it has rapidly
developed a unique expertise
and advanced software for
genomic research.
Keywords: literature/textmining, gene regulation,promoter analysis, integratedanalysis
Matthias Scherf,
Genomatix Software GmbH,
Landsberger Strasse 6,
Munich, D-80339, Germany
Tel: 49 89 5997660Fax:49 89 59976655E-mail: [email protected]
The next generation ofliterature analysis: Integrationof genomic analysis intotext miningM. Scherf, A. Epple and T. WernerDate received (in revised form): 27th May 2005
Abstract
Text-mining systems are indispensable tools to reduce the increasing flux of information in
scientific literature to topics pertinent to a particular interest in focus. Most of the scientific
literature is published as unstructured free text, complicating the development of data
processing tools, which rely on structured information. To overcome the problems of free
text analysis, structured, hand-curated information derived from literature is integrated in
text-mining systems to improve precision and recall. In this paper several text-mining
approaches are reviewed and the next step in development of text-mining systems, which is
based on a concept of multiple lines of evidence, is described: results from literature analysis
are combined with evidence from experiments and genome analysis to improve the accuracy
of results and to generate additional knowledge beyond what is known solely from literature.
INTRODUCTIONThe annual worldwide production of
information in publications is estimated to
be 8 terabytes in books, 25 terabytes in
newspapers, 20 terabytes in magazines and
2 terabytes in journals.1 It would take five
years to read the new scientific material
that is produced every day. This rapid
growth of information is observed in the
field of biomedicine as well: over 15
million entries maintained by the
National Library of Medicine2 are
available today in MedLine, which is the
primary source of free textual information
data in biomedical literature. Thousands
of new entries are added every day.
Consequently, information processing
systems must be applied to restrict the
available information to that fraction
which is pertinent to a particular topic or
more precisely even to a particular
context within a topic. A crucial
requirement for such systems is the ability
to analyse and extract information from
unstructured text. The systems available
today put their main focus on the analysis
of abstracts from scientific papers; abstracts
summarise the results of the scientific
work in a compact way and are the
predominant source available in electronic
form.
The challenge that must be addressed in
developing systems for the analysis of text
databases is founded in Zipfs law.3 It
states that few instances (words) cover
most of the text, while most instances
appear very seldom. Our own findings
confirm this statement. In more than 15
million abstracts contained in MedLine,
more than 40 per cent of the words occur
only once. This number illustrates the
lack of standards in unstructured text.
Obviously, even text on the same or
similar topics is not similar with respect to
the wording. As a consequence, general
rules are difficult to set up for the
classification of a text, eg by topic.
Information processing methods,
however, are based on finite sets of well-
defined instructions to accomplish a
certain task. Thus, the development of
algorithms to analyse unstructured text
remains challenging.
The authors are not aware of any
& HENRY STEWART PUBLICATIONS 1467-5463. BR IEF INGS IN BIOINFORMATICS . VOL 6. NO 3. 287297. SEPTEMBER 2005 2 8 7
by guest on June 20, 2014http://bib.oxfordjournals.org/
Dow
nloaded from
generally applicable approach in the field
of biomedicine and molecular biology
where the number of relevant documents
found by text mining reaches the number
of detected documents (100 per cent
precision) and the number of relevant
documents detected reaches the total
number of relevant documents (100 per
cent recall). However, many different
approaches exist, focusing on specific tasks
and knowledge domains.
The next section starts with a brief
overview of the different tasks and
challenges in text mining of biomedical
literature. Our focus will be on the tasks
of identification, description and
classification of relations between
biological entities from free text. The
methods available for this task are
described. Next, the basic biological
concepts that are used to improve and
classify text analysis results by the
integration of content from structured
information resources derived from
literature by human experts are explained.
For further reading, the reader is referred
to Blaschke et al.,4 Shatkay and Feldman,5
Hirschman et al.,6 Dickman,7 De Bruijn
and Martin,8 Grivell,9 Andrade and
Bork10 and Schulze-Kremer11 for in-
depth reviews on methods in text mining
and natural language processing in the
domain of biomedicine and molecular
biology.
The subsequent sections will illustrate
how the principle of data integration in
text-mining systems can be extended to
genome analysis. Our focus will be on the
biological process of transcription
regulation as one example for genome
analysis resulting in gene relations.
TEXT MINING INMOLECULAR BIOLOGYAND BIOMEDICINEText mining deals with the analysis of text
and the extraction of information. One
very common task of text mining in the
field of biomedicine and molecular
biology is to identify and analyse relations
between biological entities such as genes,
proteins or diseases from free text. The
process of text mining for this task implies
the following steps:
Identification of biological entities.
Identification of entity relations.
Classification of entity relations.
Every step of the text-mining process can
be addressed with several different
methods. Consequently a large variety of
method can be combined to solve the
various aspects of literature mining. The
combinatorial possibilities are also
reflected in the number of currently
available tools for literature analysis.
However, all of these tools address more
or less the same tasks of identification and
analysis of gene relations.
Since the strength and weaknesses of
the different tools differ according to the
user queries, an objective and fair
comparison is hard to achieve on the level
of integrated tools. Thus we will focus on
the underlying methods which can be
applied for the three steps described above
to offer the reader the basis for a method-
oriented way of evaluation.
Step 1: Identification ofbiological entitiesBiomedical literature contains a special
category of entities that refer to gene and
protein names, chemical compounds,
diseases, tissues, cellular components or
other predefined biological concepts.
Therefore, identification of such
biological entities in text is a crucial first
step and essential for any subsequent
analysis.
The major challenge in entity
identification is the synonym/homonym
problem: biological entities such as genes
not only have different names, ie
synonyms (eg CSEN, DREAM,
KCHIP3, MGC18289 and KCNIP3)
with different typographical variants (eg
Kcnip3, KCNIP-3 and KCNIP 3), but the
names can also be ambiguous. An
example for ambiguity is the abbreviation
NRL which is used for natural rubber
Precision and Recall arecrucial measurementsfor literature analysis
2 8 8 & HENRY STEWART PUBLICATIONS 1467-5463. BRIEF INGS IN BIOINFORMATICS . VOL 6. NO 3. 287297. SEPTEMBER 2005
Scherf, Epple and Werner
by guest on June 20, 2014http://bib.oxfordjournals.org/
Dow
nloaded from
latex as well as neural retina leucine
zipper gene.The synonym/homonym problem
Owing to the synonym/homonym
problem, the task of entity recognition
requires an identification and
disambiguation step. Identification
strategies range from methods that use ad
hoc rules about typical syntactic structures
of entity identifiers12 to algorithms that
search identifiers of a given dictionary
with exact and inexact pattern matching
methods.13 Principally, dictionaries are
used in combination with pattern
recognition approaches. The dictionaries
are based on publicly available sources of
standardised, structured data annotated by
human experts. Examples for sources are
HUGO Gene Nomenclature
Committee,14,15 sequence annotation
databases such as LocusLink,16 protein
databases such as Swiss-Prot,17 Gene
Ontology (GO)18 or medical pathological
terms UMLS.19
Although a dictionary-based search
allows a robust algorithm-based
identification of entities, it is naturally
restricted to the terms included in the
dictionary. The disambiguation step
implies a classification method deciding
whether the text where the entity has
been identified refers to the expected
topic. A large variety of methods, ranging
from machine learning,20 support vector
machines21 and hidden Markov models22
to Bayesian learning and decision
trees,23,24 have been applied to this task.
It is difficult to compare methods for
identification and disambiguation, since
the published methods generally focus on
different kinds of biological entities and
are often trained and tested on preselected
text sets from certain domains. For
example, the authors of the PROPER
method12 reported over 90 per cent in
precision and recall but calculated these
values based on only 30 abstracts of papers
about the SH3 domain.
Regardless of the different methods
applied, most authors agree that
interaction with experts is indispensable
to obtain good results in identification
and disambiguation. These experts define
the dictionaries for entity identifiers and
create simple, generally applicable, rules
for disambiguation. Thus, the
performance of the various methods relies
to a significant fraction on the excellence
of the biological experts involved rather
than on the algorithm employed to
encode and use the expert knowledge.
Step 2: Identification ofrelations between entitiesThe next goal after identification and
disambiguation of biological entities is the
detection of relations between the entities
as the text describes them. The most
straightforward approach for this task is to
assume relations between entities based on
co-occurrence in a text. The probability
of an established relation between entities
depends to some extent on the location of
the entities within a text. The weakest
assumption about a relation is due to a co-
occurrence of entities anywhere in a text.
If two entities co-occur within the same
sentence, a true relation becomes more
likely, while the coverage might decrease
simultaneously.
Sophisticated approaches try to further
improve the analysis on sentence level by
employing dictionaries and rule-based
analysis techniques.2528 The dictionaries
contain words related to the description
of relations; the rules are designed for the
analysis of sentence or phrase structures.
These approaches lead to a better
precision of results but decrease the recall
owing to the restricted set of vocabulary
and sentence structures. While all of these
methods take the textual and sometimes
grammatical context into consideration,
none of them truly integrates the
biological context (other than what was
manually entered into the dictionaries).
An ideal system would consider biological
restrictions, eg rule out impossible
combinations of entities such as
information about bacterial genes
involved in operon structures in relation
to mammalian genes pertinent to
chromatin organisation.
& HENRY STEWART PUBLICATIONS 1467-5463. BR IEF INGS IN BIOINFORMATICS . VOL 6. NO 3. 287297. SEPTEMBER 2005 2 8 9
The next generation of literature analysis
by guest on June 20, 2014http://bib.oxfordjournals.org/
Dow
nloaded from
Step 3: Classification ofentity relations
Gene relationsdepend on theirfunctional context
Relations between biological entities are
not fixed but change according to the
functional context in which an entity
applies. The biological mechanisms and
the environment in which the entity was
observed generally specify the functional
context of a biological entity.
Consequently, the description of a
functional context is usually distributed in
multiple sentences, figures and tables, and
in-depth expert knowledge is required to
decode the functional context from
publications. Text-mining systems might
support the identification of single aspects
of a functional context such as a tissue
type, but it is still impossible to
automatically elucidate the complex
dependencies between the components of
a functional context. A relation might
thus be described and correctly identified
by steps 1 and 2 from a text, but the
functional context in which the relation
was observed might not correspond to the
topic of interest. Prominent examples for
relations that change according to the
functional context are signal transduction
pathways such as the MAP kinase system,
which can trigger a number of different
transcriptional activators.29
The most common way to consider
functional context in text mining is the
introduction of structured, hand-curated
information about biological entities.
Available sources such as GO18 or
KEGG30 assign biological entities to
classes, eg of biological functions and
pathways. MESH31 assigns domains like
diseases or anatomy to publications.
Integrateindependent linesof evidence
These sources can be used to establish
different lines of evidence for an entity
relation derived from text. A certain
disease can be assigned to a relation if the
paper in which the two entities were
identified has been assigned to the disease
via MESH. If both genes of an identified
relation belong to the same class in GO,
then this class is assigned to the relation.
The natural internal consistency of
biological facts and findings make such an
approach possible. True connections of
biological entities are characterised by at
least two hallmarks: they do not conflict
with each other (within the correct
context!) and they are always present on
several levels. For example, two proteins
reported to interact functionally (such as
an enzyme and its substrate) necessarily
also can be shown to interact physically
(eg in yeast two hybrid systems). In
addition, they are necessarily co-expressed
in at least one cell type. Often such co-
expression is also evident from common
regulatory structures in the corresponding
gene promoters. In short, isolated findings
that are not supported on other levels
(genome, transcriptome, proteome,
metabolome) or are in conflict with other
findings are usually much less likely to be
true than findings consistently supported
by several independent lines of evidence.
This is a very general and a very powerful
basic biological concept that can be used
to enhance the results of any information
retrieval system.
In the following section we will discuss
the extension of the line of evidence
approach towards literature-independent
data from genomic analysis.
COMBINATION OF TEXTMINING AND GENOMICANALYSISSources such as GO or MESH contain
specific, high-quality annotations
primarily derived from the literature by
experts. Such annotations can be
complemented by literature-independent
data, eg from laboratory or in-silico
experiments, which confirm text-mining
results and assign additional functional,
cellular or molecular context to entity
relations. It should be noted that only
independent lines of evidence provide
support. If three groups report the same
results based on very similar experiments,
this is only incremental evidence of one
line. However, if a physical interaction
deduced from a functional assay is
supported by direct demonstration of
physical interactions (proteinprotein or
proteinDNA/RNA), then two lines of
evidence are established. This is much
more difficult to realise based solely on
2 9 0 & HENRY STEWART PUBLICATIONS 1467-5463. BRIEF INGS IN BIOINFORMATICS . VOL 6. NO 3. 287297. SEPTEMBER 2005
Scherf, Epple and Werner
by guest on June 20, 2014http://bib.oxfordjournals.org/
Dow
nloaded from
text mining, but can readily be achieved
when text mining is combined with other
sources such as genomics or proteomics-
based sequence analyses.
Below, the integration of literature
independent data is illustrated with several
examples. Special emphasis is placed on
the regulation of gene transcription since
it defines important proteingene
relations on molecular level.
Transcription regulation
Transcriptionregulation
Gene transcription is regulated in part by
nuclear factors (proteins) that recognise
short DNA sequence motifs, called
transcription factor binding sites (TFBSs).
TFBSs are in most cases located upstream
of the first exon of a transcript in so-called
promoter and enhancer regions.
Identification of TFBSs in regulatory
regions of transcripts can confirm
important relations between transcription
factors and genes and add considerably to
annotation of the biological context of
genes. As a consequence, the analysis of
transcription regulation first requires the
annotation of regulatory regions for a
gene and the identification of TFBSs in
the annotated regulatory regions.
Annotation of genes and theirregulatory regions
Genes can havealternative promoters
The human genome currently is
annotated with 23,245 gene loci (NCBI
Build 34). For these loci 43,975
transcripts are known. About 45 per cent
(10,368) of the genes have alternative
transcripts ranging from 2 to 40. In
addition 6,418 of the annotated loci have
two or more promoters, ie alternative
promoters. Figure 1 summarises the
distribution of genes with alternative
transcripts.
Alternative transcripts of a gene differ
according to alternative splicing (see, eg,
gene LIPT1 and Figure 2a), alternative
termination (see, eg, gene FBLN1) or
alternative first exons (see, eg, gene
CYP19A1 and Figure 2b). This flexibility
of alternative transcripts reflects the
various biological contexts in which a
gene might be functionally involved.
Since publications in general describe
only genes and not their transcript/
regulatory regions, it is not possible to
identify the functional context by text-
mining methods. On the other hand, the
flexibility of alternative transcripts needs
to be understood to truly comprehend
disease processes, especially for
individualised diagnostics of chronic
diseases and cancer. Only this knowledge
will allow addressing the correct genetic
mechanism in the pertinent context. The
aromatase gene, which is the terminal
enzyme responsible for oestrogen
biosynthesis in mammals, provides a good
example to illustrate this point. Aromatase
has at least six different alternative
promoters that regulate the production of
the same gene product (exon II-X always
remain the same). Dysregulation of
aromatase promoters is found in severe
diseases, especially breast cancer.32
Aromatase in normal breast tissues is
mainly regulated by promoter 1.4. In
Figure 1: Organisationof the human genome:45 per cent of all geneshave alternativetranscripts andalternative promoters
!
! !
!
!
"
"
!
! "
!
!
Figure 2: Alternative splicing (a) andalternative promoters (b). Alternativetranscripts occur by alternative splicing ofone or several exons from a single primarytranscript or by transcription starting fromalternative promoters (P: promoter, E: exon)
& HENRY STEWART PUBLICATIONS 1467-5463. BR IEF INGS IN BIOINFORMATICS . VOL 6. NO 3. 287297. SEPTEMBER 2005 2 9 1
The next generation of literature analysis
by guest on June 20, 2014http://bib.oxfordjournals.org/
Dow
nloaded from
breast cancer tissues, over-activation of
promoter 1.3 and 1.2 is often observed
(Figure 3). This shift in promoter usage
does not affect the coding region at all; all
transcripts encode the exact same protein.
Therefore, full understanding of this
disease mechanism and subsequent design
of an effective therapy requires detailed
knowledge of transcription activation via
alternative promoters.
Analysis of regulatory regionsIdentification of regulatory mechanisms
by promoter provides a crucial link
between the static nucleotide sequence of
the genome and the dynamic aspects of
gene regulation and expression.
Furthermore, it provides a unique way to
define functional context based on co-
regulation mechanisms that cannot be
derived by literature analysis.
Promoter modulesdefine biologicalfunctions
Activation of transcription is triggered
by binding of transcription factors to the
promoter sequence of a transcript.
However, in mammalian systems this is
usually not achieved by individual
transcription factors but by characteristic
combinations of factors. Similar TFBSs
patterns within the promoters of
transcripts are expressed in the same tissue
under similar conditions. Thus, the
organisation of promoter motifs represents
a footprint or framework of the
transcriptional regulatory mechanisms at
work in a specific biological context,
consequently providing information
about signal and tissue-specific control of
expression.
Software that allows detection and
characterisation of individual binding sites
is available from several sources, including
MatInspector,33,34 Signal Scan,35,36
MATRIX SEARCH37 or MATCH.38
A collection of functional binding sites
for high-quality prediction is derived
from the literature and included in
MatInspector.33 TRANSFAC is another
source which provides extensive
information about transcription factors
derived from literature.39,40
Although binding site detection is
important in higher organisms, it is
generally not sufficient for the elucidation
of promoter function since, in more
complex systems, the functional TFBSs
within promoters are organised
hierarchically41,42 (Figure 4). This
hierarchical organisation increases the
specificity and selectivity of gene
regulation via TFBSs.43,44 Combinatorial
biology appears to be the key to
understand regulation in higher
organisms, where promoter function is
determined to a large extent by the
functional context within which the
binding sites are located.
The smallest entities on the level of
TFBSs combinations that can be assigned
to a particular biological function are
called promoter modules. Promoter
modules are defined as two or more
individual elements that act in a
coordinated way (either synergistically or
antagonistically) and are arranged within a
defined distance and in sequential order
(Figure 4).44 Work to date suggests that
Figure 3: Gene structure of human aromatase (Cyp19A1). Datamodified from Clyne et al.32
Figure 4: Promoters in higher eukaryotesare organised hierarchically and elementsthat control a specific pattern of expressionmay also be found in other promotersexpressed under similar circumstances
2 9 2 & HENRY STEWART PUBLICATIONS 1467-5463. BRIEF INGS IN BIOINFORMATICS . VOL 6. NO 3. 287297. SEPTEMBER 2005
Scherf, Epple and Werner
by guest on June 20, 2014http://bib.oxfordjournals.org/
Dow
nloaded from
promoter modules can be pathway or cell
type specific41 and, in this regard, can
mediate the transcriptional response to
specific signal transduction pathways,45,46
cell type-specific expression, and events
central to developmental regulation.47 A
given promoter module may show a
robust stimulus-specific response in one
tissue, but may not be functional in
another cell type.
Combination ofpromoter analysis andliterature analysis
Although the inclusion of transcription
factorgene relations in combination
with literature mining has been realised in
the BiblioSphere PathwayEdition,34 more
sophisticated promoter analyses have not
yet been implemented within text-mining
tools. However, given the biological
relevance outlined above, this can be
expected to be added in the near future to
biological text mining.
The power of such a combinatorial
approach is illustrated with a recent
example.48 During an microarray-based
analysis of genes involved in the response
of astrocytes to expression of the HIV-1
protein Nef, a set of nine genes was
identified as relevant (BCL2L1, CDC42,
HCK, Jak2, JNK, MAPK1, RAC1,
STAT3, Vav1). Unfortunately, most of
these genes are involved in cell cycle
regulation resulting in an extraordinary
large body of related literature. Figure 5
illustrates the strategy and the results of
our approach. Even a co-citation
network analysis restricting networks to
genes co-cited with at least five of the
nine genes can restrict the gene list only
from the initial 2,846 to 440 genes. In
contrast automatic promoter framework
analysis of the promoters of the nine
initial genes yielded a framework
consisting of four TFBSs, present in the
promoters of three of the nine genes
(BCL2L1, HCK, RAC1). This network
selected 159 promoters out of 36,000
human annotated promoters. The
molecular evidence (159) was then
crossed with the literature around the
nine initial genes (2,846), which resulted
in a network of 18 genes where all
connection could be verified as
functional. All of these 18 genes are thus
directly relevant for the initial query, the
response of astrocytes to the viral Nef
protein of HIV-1.
In summary, promoter analysis
provides information on transcription
factorgene relations on the molecular
level independent of literature data.
Relations between transcription factors
and genes can be associated in a context-
dependent manner. Moreover,
information on alternative promoters and
transcripts provides a detailed view on
different biological contexts in which a
gene can function. While this will clearly
reduce the recall from the literature it
can dramatically increase the context-
dependent precision, which is the most
important parameter for the usefulness of
data mining in general.
Further approaches to integrateliterature-independent dataAnother approach to integrate
experimental, non-literature based
information on relations between
biological entities is to integrate
Figure 5: Scheme illustrating the combination of literature mining andpromoter analysis on the example of a group of nine genes initiallyidentified in a microarray-based study of astrocyte response to the HIV-1Nef gene48
& HENRY STEWART PUBLICATIONS 1467-5463. BR IEF INGS IN BIOINFORMATICS . VOL 6. NO 3. 287297. SEPTEMBER 2005 2 9 3
The next generation of literature analysis
by guest on June 20, 2014http://bib.oxfordjournals.org/
Dow
nloaded from
information about proteinprotein
interactions. There are a number of
databases available containing such
information, derived from experiments
such as yeast/mammalian two hybrid. An
example is the DIP database.49 Here,
information from a variety of sources is
combined to create a single, consistent set
of proteinprotein interactions. The text-
mining system Chilibot50 uses this source
to integrate additional relations in its
literature-mining results.
Four approaches tocharacterise generelations
Literature-mining systems today often
are used for the interpretation of
expression array data. However, the data
from expression arrays also provide gene
relations, defined by genes with similar
expression profiles under defined
experimental conditions (cell type,
treatment, etc). This source is not directly
correlated with gene regulation analysis
since co-expressed genes are not
necessarily co-regulated. Moreover,
effects from different biological
mechanisms such as post-transcriptional
regulation via microRNA, RNA stability,
etc are cumulated in the expression signal.
GeneExpressionOmnibus51,52 offers a
large collection of documented results
from expression array experiments that
might be integrated in literature mining
systems in the future.
CONCLUSIONRelations between biological entities are
conditional and may change when the
same genes are considered in a different
functional context. As a consequence,
every relation between entities must be
qualified with the functional context in
which the relation was observed.
Moreover, it is impossible to make
general statements whether a relation
detected by literature mining is a true or
a false relation without considering the
observed context.
This context-dependency of relations
also precludes any quantitative
comparison of the content of the various
databases underlying the discussed
methods. Pure numbers cannot answer
the crucial question how well such
relations are qualified with respect to their
biological context. It is safe to assume that
all of the methods have ample basic
information to build on. The only
external indicator we could identify at
least in a qualitative manner is the
assessment of how many different lines of
evidence are combined by the systems.
Although the quality of the final results
still depends crucially on how such
integration procedures are implemented,
the concept of multiple lines of evidence
at least allows for using the principle of
biological consistency discussed above.
There is no doubt that text-mining
methods are powerful tools to further
understand biological principles by
problem-oriented preselection of
publications about biological entities and
their relations. The main challenge in text
mining remains coping with free
unstructured text and its individual
properties, which are characterised by
Zipfs law. Text mining in molecular
biology and biomedicine is complicated
by multiple layers of problems. They
range from the identification of biological
entities over disambiguation (synonyms
and homonyms) and identification of
relations all the way to the interpretation
of the functional context. Numerous
approaches in the field of text mining
show that the identification and
disambiguation of biological entities
already give remarkable results in
precision and coverage, while the analysis
of sentences and text to discover relations
or biological concepts is still a challenge.
Fortunately, the connection of
molecular biology and biomedical
literature to biology not only complicates
the task, but also offers several
opportunities unique to the field. The
biological consistency also includes the
interaction of the cellular transcriptional
and translational machinery with the
genome and the transcriptome. Since we
do have access to several genome
sequences as well as considerable parts of
the transcriptomes (via cDNA/expressed
sequence tag approaches), biological
knowledge mining is no longer restricted
Further independentlines of evidence
2 9 4 & HENRY STEWART PUBLICATIONS 1467-5463. BRIEF INGS IN BIOINFORMATICS . VOL 6. NO 3. 287297. SEPTEMBER 2005
Scherf, Epple and Werner
by guest on June 20, 2014http://bib.oxfordjournals.org/
Dow
nloaded from
to literature alone. Every phenomenon
described in the literature necessarily has a
molecular foundation within the genomic
sequence. Although our current
knowledge does not allow understanding
all of these molecular correlations, a great
deal of information can already be derived
from genomic sequences, especially about
transcriptional regulation as detailed
above. Moreover, as the biological
principles governing gene regulation seem
to be very general, it becomes possible to
use sequence analyses not only to confirm
knowledge from the literature, but also to
derive new relationships beyond current
knowledge.
Focusing on the aspect of confirming
results from text mining by other
biological data, the following picture
emerges. Starting from the co-occurrence
of biological entities in a text, four
approaches can be identified from the
available applications to confirm, classify
or discard the relation:
In-depth analysis on sentence orphrase level. Approaches range from
the application of general syntax rules
such as co-occurrences of entities in
the same sentences up to application
of in-depth analysis by syntactic and
semantic parsers. The results generally
cause a decrease of coverage, with an
increase in precision.
Hand annotation of preselectedsentences by curators. This approach is
applicable independent from a certain
scientific topic but slow.
Integration of hand-curated,structured data sources on gene classes
and text annotations.
Integration of experimental resultseither from laboratory or in-silico
analyses.
While the first three methods are
mainly text-driven, the integration of
results from experiments and in-silico
analyses introduces literature-independent
data. This represents an independent level
of defining functional context to evaluate
relations of biological entities. It becomes
increasingly clear that text mining will be
only one tool for information retrieval
and management in biomedical research.
Only the combination with other
methods and information sources will lead
to the best possible structuring and
compilation of biological knowledge.
This does not come as a surprise because
the whole concept of systems biology is
based on this notion. The most immediate
consequence is that text mining of
biomedical literature cannot be
outsourced from biology to other
disciplines but has to be carried out in
tight interaction with biologists.
References
1. Lyman, P. and Hal, R. V. (2003), How muchinformation? (URL: http://www.sims.berkeley.edu/how-much-info-2003/).
2. Wheeler, D. L., Chappey, C., Lash. A. E. et al.(2003), Database resources of the NationalCenter of Biotechnology, Nucleic Acids Res.,Vol. 31, pp. 193195.
3. Li, W. (1992), Random texts exhibit Zipfs-law-like word frequency distribution, IEEETrans Information Theory, Vol. 38(6), pp.18421845.
4. Blaschke, C., Hirschman, L. and Valencia, A.(2002), Information extraction in molecularbiology, Brief. Bioinformatics, Vol. 3,pp. 154165.
5. Shatkay, H. and Feldman, R. (2003), Miningthe biomedical literature in the genomic era:An overview, J. Comput. Biol., Vol. 10,pp. 821855.
6. Hirschman, L., Park, J. C., Tsujii, J. et al.(2002), Accomplishments and challenges inliterature data mining for biology,Bioinformatics, Vol. 18, pp. 15531561.
7. Dickman, S. (2003), Tough mining: Thechallenges of searching the scientific literature,PLoS Biol., Vol. 1, p. E48.
8. de Bruijn, B. and Martin, J. (2002), Getting tothe (c)ore of knowledge: Mining biomedicalliterature, Int. J. Med. Inf., Vol. 67, pp. 718.
9. Grivell, L. (2002), Mining the bibliome:Searching for a needle in a haystack? Newcomputing tools are needed to effectively scanthe growing amount of scientific literature foruseful information, EMBO Rep., Vol. 3,pp. 200203.
10. Andrade, M. A. and Bork, P. (2000),
& HENRY STEWART PUBLICATIONS 1467-5463. BR IEF INGS IN BIOINFORMATICS . VOL 6. NO 3. 287297. SEPTEMBER 2005 2 9 5
The next generation of literature analysis
by guest on June 20, 2014http://bib.oxfordjournals.org/
Dow
nloaded from
Automated extraction of information inmolecular biology, FEBS Lett., Vol. 476,pp. 1217.
11. Schulze-Kremer, S. (2002), Ontologies formolecular biology and bioinformatics, In SilicoBiol., Vol. 2, pp. 179193.
12. Fukuda, K., Tamura, A., Tsunoda, T. andTakagi, T. (1998), Toward informationextraction: identifying protein names frombiological papers, in Proceedings of the 3rdPacific Symposium on Biocomputing,4th9th January, Hawaii, pp. 705716.
13. Krauthammer, M., Rzhetsky, A., Morozov, P.and Friedman, C. (2000), Using BLAST foridentifying gene and protein names in journalarticles, Gene, Vol. 256, pp. 245252.
14. Wain, H. M., Lush, M. J., Ducluzeau, F. et al.(2004), Genew: The Human GeneNomenclature Database, 2004 updates, NucleicAcids Res., Vol. 32 (Database issue),pp. D255257.
15. Wain, H. M., Bruford, E. A., Lovering, R. C.et al. (2002), Guidelines for Human GeneNomenclature, Genomics, Vol. 79(4),pp. 464470.
16. Maglott, D. R., Katz, K. S., Sicotte, H. andPruitt, K. D. (2000), NCBIs LocusLink andRefSeq, Nucleic Acids Res., Vol. 28(1),pp. 126128.
17. Bairoch, A. and Apweiler, R. (1998), TheSWISS-PROT protein sequence data bankand its supplement TrEMBL in 1998, NucleicAcids Res., Vol. 26, pp. 3842.
18. Ashburner, M., Ball, C. A., Blake, J. A. et al(2000), Gene ontology: Tool for theunification of biology, The Gene OntologyConsortium, Nat. Genet., Vol. 25, pp. 2529.
19. Bodenreider, O. (2004), The Unified MedicalLanguage System (UMLS): Integratingbiomedical terminology, Nucleic Acids Res.,Vol. 32, pp. 267270.
20. Tanabe, L. and Wilbur, W. J. (2002), Tagginggene and protein names in biomedical text,Bioinformatics, Vol. 18, pp. 11241132.
21. Kazama, J., Makino, T., Ohta, Y. and Tsujii, J.(2002), Tuning support vector machines forbiomedical named entity recognition, inProceedings of the Natural LanguageProcessing in the Biomedical Domain,Association for Computational Linguistics,Philadelphia, pp. 18.
22. Nobata, C., Collier, N. and Tsujii, J. (1999),Automatic term identification andclassification in biology texts, in Proceedingsof the Natural Language Pacific RimSymposium, Beijing, November,pp. 369375.
23. Hatzivassiloglou, V., Duboue, P. A. andRzhetsky, A. (2001), Disambiguatingproteins, genes, and RNA in text: A machine
learning approach, Bioinformatics, Vol. 17,pp. S97S106.
24. Novichkova, S., Egorov, S. and Daraselia, N.(2003), MedScan, a natural languageprocessing engine for MEDLINE abstracts,Bioinformatics, Vol. 19(13). pp. 16991706.
25. Friedman, C., Kra, P., Yu, H. et al. (2001),GENIES: A natural-language processingsystem for the extraction of molecularpathways from journal articles, Bioinformatics,Vol. 17, pp. 7482.
26. Park, J. C., Kim, H. S. and Kim, J. J. (2001),Bidirectional incremental parsing forautomatic pathway identification withcombinatory categorial grammar, inProceedings of the 6th Pacific Symposium onBiocomputing, 3rd7th January, Hawaii,pp. 396407.
27. Pustejovsky, J., Castano, J., Zhang, J. et al.(2002), Robust relational parsing overbiomedical literature: extracting inhibitrelations, in Proceedings of the 7th PacificSymposium, 3rd7th January, Hawaii,pp. 362373.
28. Novichkova, S., Egorov, S. and Daraselia, N.(2003), MedScan, a natural languageprocessing engine for MEDLINE abstracts,Bioinformatics, Vol. 19, pp. 16991706.
29. Kolch, W., Calder, M. and Gilbert, D. (2005),When kinases meet mathematics: The systemsbiology of MAPK signalling, FEBS Lett., Vol.579(8), pp. 18911895.
30. Ogata, H., Goto, S., Sato, K. et al. (1999),KEGG: Kyoto Encyclopedia of Genes andGenomes, Nucleic Acids Res., Vol. 27,pp. 2934.
31. Golbeck, J. (2003), The National CancerInstitutes thesaurus and ontology, J. WebSemantics, Vol. 1, pp. 7580.
32. Clyne, C. D., Kovacic, A., Speed, C. J. et al.(2004), Regulation of aromatase expression bythe nuclear receptor LRH-1 in adipose tissue,Mol. Cell Endocrinol., Vol. 215(12),pp. 3944.
33. Quandt, K., Frech, K., Karas, H. et al. (1995),MatInd and MatInspector: New fast andversatile tools for detection of consensusmatches in nucleotide sequence data, NucleicAcids Res., Vol. 23, pp. 48784884.
34. URL: http://www.genomatix.de/
35. Prestridge, D. S. (2000), Computer softwarefor eukaryotic promoter analysis, Methods Mol.Biol,. Vol. 130, pp. 265295.
36. URL: http://bimas.dcrt.nih.gov/molbio/signal
37. Chen, Q. K., Hertz, G. Z. and Stormo, G. D.(1995), MATRIX SEARCH 1.0: A computerprogram that scans DNA sequences fortranscriptional elements using a database of
2 9 6 & HENRY STEWART PUBLICATIONS 1467-5463. BRIEF INGS IN BIOINFORMATICS . VOL 6. NO 3. 287297. SEPTEMBER 2005
Scherf, Epple and Werner
by guest on June 20, 2014http://bib.oxfordjournals.org/
Dow
nloaded from
weight matrices, Comput. Appl. Biosci. , Vol.11, pp. 63566.
38. Kel, A. E., Gossling, E., Reuter, I. et al.(2003), MATCHTM: A tool for searchingtranscription factor binding sites in DNAsequences, Nucleic Acids Res., Vol. 31,pp. 35763579.
39. Heinemeyer, T., Chen, X., Karas, H. et al.(1999), Expanding the TRANSFAC databasetowards an expert system of regulatorymolecular mechanisms, Nucleic Acids Res.,Vol. 27, pp. 318322.
40. URL: http://transfac.gbf.de/TRANSFAC/
41. Klingenhoff, A., Frech, K., Quandt, K. andWerner, T. (1999), Functional promotermodules can be detected by formal modelsindependent of overall nucleotide sequencesimilarity, Bioinformatics, Vol. 15,pp. 180186.
42. Klingenhoff, A., Frech, K. and Werner, T.(2002), Regulatory modules shared withingene classes as well as across gene classes can bedetected by the same in silico approach, InSilico Biol., Vol. 2, pp. S17S26.
43. Werner, T. (1999), Models for prediction andrecognition of eukaryotic promoters, Mamm.Genome, Vol. 10, pp. 168175.
44. Firulli, A. B. and Olson, E. N. (1997),Modular regulation of muscle genetranscription: A mechanism for muscle celldiversity, Trends Genet, Vol. 13, pp. 364369.
45. Boehlk, S., Fessele, S., Mojaat, A. et al. (2000),ATF and Jun transcription factors, actingthrough an Ets/CRE promoter module,mediate lipopolysaccharide inducibility of thechemokine RANTES in monocytic MonoMac 6 cells, Eur. J. Immunol,. Vol. 30,pp. 11021112.
46. Fessele, S., Boehlk, S., Mojaat, A. et al. (2001),Molecular and in silico characterization of apromoter module and C/EBP element thatmediate LPS-induced RANTES/CCL5expression in monocytic cells, FASEB J.,Vol. 15, pp. 577579.
47. Wang, Q., Sigmund, C. D. and Lin, J. J.(2000), Identification of cis elements in thecardiac troponin T gene conferring specificexpression in cardiac muscle of transgenicmice, Circ. Res., Vol. 86, pp. 478484.
48. Kramer-Hammerle, S., Hahn, A., Brack-Werner, R. and Werner, T. (2005),Elucidating effects of long-term expression ofHIV-1 Nef on astrocytes by microarray,promoter, and literature analyses, Gene, June13, available online: PMID: 15958282.
49. Xenarios, I., Salwinski, L., Duan, X. J. et al.(2002), DIP: The Database of InteractingProteins. A research tool for studying cellularnetworks of protein interactions, Nucleic AcidsRes., Vol. 30, pp. 303305.
50. Hao, C. and Burt, M. S. (2004), Content-richbiological network constructed by miningPubMed abstracts, BMC Bioinformatics, Vol. 5,p. 147.
51. Barrett, T., Suzek, T. O., Troup, D. B. et al.(2005), NCBI GEO: Mining millions ofexpression profiles database and tools,Nucleic Acids Res., Vol. 33 (Database issue),pp. D562566.
52. Edgar, R., Domrachev, M. and Lash, A. E.(2002) Gene Expression Omnibus: NCBIgene expression and hybridization array datarepository, Nucleic Acids Res., Vol. 30(1),pp. 207210.
& HENRY STEWART PUBLICATIONS 1467-5463. BR IEF INGS IN BIOINFORMATICS . VOL 6. NO 3. 287297. SEPTEMBER 2005 2 9 7
The next generation of literature analysis
by guest on June 20, 2014http://bib.oxfordjournals.org/
Dow
nloaded from