65
Text Mining 1 Running head: TEXT MINING Text Mining Mark Sharp Rutgers University, School of Communication, Information and Library Studies

text_mining.doc

  • Upload
    butest

  • View
    1.460

  • Download
    0

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: text_mining.doc

Text Mining 1

Running head: TEXT MINING

Text Mining

Mark Sharp

Rutgers University, School of Communication, Information and Library Studies

Page 2: text_mining.doc

Text Mining 2

Abstract

The general idea of text mining – getting small "nuggets" of desired information out of

"mountains" of textual data without having to read it all – is nearly as old as information retrieval

(IR) itself. Currently text mining is enjoying a surge of interest fueled by the popularity of the

Internet, the success of bioinformatics, and a rebirth of computational linguistics. It can be

viewed as one of a class of nontraditional IR strategies which attempt to treat entire text

collections holistically, avoid the bias of human queries, objectify the IR process with principled

algorithms, and "let the data speak for itself." These strategies share many techniques such as

semantic parsing and statistical clustering, and the boundaries between them are fuzzy.

Therefore in this paper several related concepts are briefly reviewed in addition to text mining

proper, including data mining, machine learning, natural language processing, text

summarization, template mining, theme finding, text categorization, clustering, filtering, text

visualization, and text compression. Current text mining systems per se appear to be fairly

primitive, but to have the following goals which may serve as a useful definition to distinguish

text mining from other IR concepts: (1) to operate on large, natural language text collections; (2)

to use principled algorithms more than heuristics and manual filtering; (3) to extract

phenomenological units of information (e.g., patterns) rather than or in addition to documents;

(4) to discover new knowledge. Interest in text mining for biomedical research purposes is

especially pervasive and can be viewed as a major new frontier in bioinformatics. Text mining

systems designed for use with science and technology text databases such as MEDLINE

currently seem to have an undue emphasis on expert human filtering which contradicts goal (2).

Whether this represents premature surrender to difficulty or a necessary temporary expedient

remains to be seen.

Page 3: text_mining.doc

Text Mining 3

Page 4: text_mining.doc

Text Mining 4

Text Mining

Why Text Mining?

It has become a cliché to describe information space and the challenge of navigating it in

dramatic, even histrionic terms ("explosion," "avalanche," "flood," and the like), especially with

regard to scientific, technical, and scholarly literature. We moderns may like to think we are the

first to face this problem, but scientists have always complained about keeping up with their

literature (Saracevic, 2001). The promise of better science through better information technology

has been a major theme in information science since Vannevar Bush (1945) proposed his famous

Memex machine to deal with the "growing mountain of research."

Text mining is data mining applied to textual data. Text is "unstructured, amorphous, and

difficult to deal with" but also "the most common vehicle for formal exchange of information."

Therefore, the "motivation for trying to extract information from it is compelling – even if success

is only partial …. Whereas data mining belongs in the corporate world because that's where most

databases are, text mining promises to move machine learning technology out of the companies

and into the home" as an increasingly necessary Internet adjunct (Witten & Frank, 2000) – i.e., as

"web data mining" (Hearst, 1997). Laender, Ribeiro-Neto, da Silva, and Teixeira (2001) provide a

current review of web data extraction tools.

Text mining is one of a class of what I will call "nontraditional information retrieval (IR)

strategies." The goal of these strategies is to reduce the effort required of users to obtain useful

information from large computerized text data sources. Traditional IR often simultaneously

retrieves both "too little" information and "too much" text (Humphreys, Demetriou, & Gaizauskas,

2000). The nontraditional strategies represent a "broader definition of IR" and the view that "a

truly useful system must go beyond simple retrieval" (Liddy, 2000). I see them as treating the

Page 5: text_mining.doc

Text Mining 5

entire database or collection more holistically, recognizing that the selectivity of anthropogenic

queries has a downside or bias which can be counterproductive to obtaining the best information,

and attempting to "objectify" the IR process with principled algorithms.1 I like to think that they

try to "let the data speak for itself."

When I started to research this paper I made a list of all the IR concepts (traditional and

non-) that were explicitly related to text mining by the first wave of authorities I identified. It was

a daunting list (Table 1), but I thought it would be possible to rule them all either "in" or "out" and

thus define their boundaries and hierarchical relationships to text mining. However, it soon

became clear that the boundaries were fuzzy, the hierarchy was a mass of convoluted loops, and

even seemingly outlandish claims to text mining relevance had, on closer inspection, a grain of

truth.2 Therefore I decided to try to cover them all instead of focusing on text mining proper,

whatever that turned out to be. Fortunately, time and literature resource limitations intervened to

significantly curtail this plan. Hopefully the result will serve as a sensible compromise.

History of Text Mining

H. P. Luhn (1958), in a seminal paper on automatic abstracting, noted "the resolving power

of significant words" in primary text. Lauren B. Doyle (1961) also captured the spirit of text

mining and related methods when he said that "natural characterization and organization of

information can come from analysis of frequencies and distributions of words in libraries"

1 E.g., "'Objectivity' [means] the results solely depend on the outcome of the linguistic processing algorithms and statistical calculations" (Dorre, Gerstl, & Seiffert, 1999). I recognize that such computational exotica, stripped of their mathematical mystique, "can be regarded as a form of transformed cognitive structure" (Ingwersen & Willett, 1995) and are therefore ultimately just as human and arbitrary as the traditional methods. But I also believe that there can be degrees of objectivity (operationally defined as general validity or utility) and that in general abstract computational approaches will tend to be more objective.

2 There is one website, however, that goes too far. Greenfield (2001) lists virtually every text processing and database technology I have ever heard of under the title "Text Mining." As a kind of rite of passage into the subject, Patrick Perrin asked me to look at it and tell him if all of that was really text mining, so apparently it's somewhat notorious in the field.

Page 6: text_mining.doc

Text Mining 6

("libraries" representing what we would now more generally call collections or corpora). Text

mining per se may be new, but the dream of training a computer to extract information from

"mountains" of textual data is nearly as old as IR itself.

Don R. Swanson (1988) articulated the idea that the scientific literature should be regarded

as a natural phenomenon worthy of "exploration, correlation, and synthesis." He contrasted

scientists' attitudes toward information usage with those of intelligence analysts.

'To the working scientist or engineer, time spent gathering information or writing reports is often regarded as a wasteful encroachment on time that would otherwise be spent producing results that he believes to be new' [Weinberg et al, 1963] …. The intelligence analyst, by contrast, is much more intimate with the available base of recorded information. New knowledge, or finished intelligence, is seen as emerging from large numbers of individually unimportant but carefully hoarded fragments that were not necessarily recognized as related to one another at the time they were acquired. Use of stored data is intensively interactive; "information retrieval" is an inadequate and even misleading metaphor. The analyst is continually interacting with units of stored data as though they were pieces selected from a thousand scrambled jigsaw puzzles. Relevant patterns, not relevant documents, are sought.

Swanson called upon scientists to be more like intelligence analysts; to "take seriously the idea that

new knowledge is to be gained from the library as well as the laboratory [and] to develop attitudes

toward information indistinguishable from attitudes toward research itself."

Not content to lecture scientists from a theoretical pedestal, by the time these words were

published Swanson had already put the idea into practice by developing a system to discover

meaningful new knowledge in the biomedical literature (see references in Swanson & Smalheiser,

1999). Software now called ARROWSMITH and freely available on the web

(http://kiwi.uchicago.edu) helps by finding common keywords and phrases in "complementary and

noninteractive" sets of articles or "literatures" and juxtaposing representative citations likely to

reveal interesting co-occurrences. Two literatures are "complementary if together they can reveal

useful information not apparent in the two sets considered separately" – e.g., one may reveal a

Page 7: text_mining.doc

Text Mining 7

natural relationship between A and B, and the other a relationship between B and C, so that

together they suggest a relationship between A and C. The two literatures are "noninteractive" if

their articles do not cross-cite and are not co-cited elsewhere in the literature. Swanson has

discovered at least three biomedically important relationships using this system: between fish oil

and Raynaud's syndrome, magnesium and migraines and epilepsy, and arginine and somatomedin

C (Lindsay & Gordon, 1999). Most recently he has used it to identify several dozen viruses as

potential bioweapons (Swanson, Smalheiser, & Bookstein, 2001).

Swanson's system remains far from fully automated, it is highly medical domain-specific,

and to my knowledge Swanson has never referred to it as text mining. But I believe it meets the

criteria at least partially (see below), and Swanson has been recognized as an early pioneer by self-

described text mining practitioners Marti Hearst (1999) and Ronald Kostoff (1999). I would like

to go further and propose that, because of the ideas he expressed in his 1988 JASIS paper,

Swanson is the father of modern text mining.

What is Text Mining?

Text mining per se is new and is still defining itself. It "has the peculiar distinction of

having a name and a fair amount of hype but as yet almost no practitioners" (Hearst, 1999), and

most of the information about it on the web is "misleading" (Perrin, 2001). The mining metaphor

"implies extracting precious nuggets of ore from otherwise worthless rock" (Hearst, 1999), "gold

hidden in … mountains of textual data" (Dorre, Gerstl, & Seiffert, 1999), or the idea that "the

computer rediscovers information that was encoded in the text by its author" (IBM, 1998b).

Hearst (1997, 1999) has argued for a narrow definition of text mining which distinguishes

it from "information access" (traditional IR). Traditional IR is concerned primarily with the

Page 8: text_mining.doc

Text Mining 8

retrieval of documents (perhaps it should be called "DR"!) relevant to a user's information need,

but getting the desired information out of the documents is left entirely up to the user. According

to Hearst, data mining (of which text mining is a subtype, see below) not only deals directly with

the information, it tries to discover or derive new information from the data (text) which was

previously unknown even to the author(s) of the data (text[s]). She says "data mining is

opportunistic, whereas information access is goal-driven" and that IR tricks such as clustering,

finding terms for query expansion, and co-citation analysis are not text mining, although they can

aid it by improving the target dataset. Thus, IR can be viewed as a complementary technique

supporting text mining, rather than its broader term.

Text mining always involves (a) getting some texts relevant to the domain of interest

(traditional IR); (b) representing the content of the text in some medium useful for processing

(natural language processing, statistical modeling, etc.); and (c) doing something with the

representation (finding associations, dominant themes, etc.) (Perrin, 2001).

IBM is marketing a product named "Intelligent Miner for Text" (IBM, 1998a,b; Dorre et

al, 1999). It is a set of tools which "can be seen as information extractors which enrich

documents with information about their contents" in the form of structured metadata. "Features"

are classes of data which can be extracted, such as the language of the text, proper names, dates,

currency amounts, abbreviations, and "multiword terms" (significant phrases). The feature

extraction component is "fully automatic – the vocabulary is not predefined." It may operate on

single documents or on collections of documents. Word counts are based on normalization to

canonical forms (e.g., surgeries, surgical, and surgically might all be normalized to surgery).

The phrase extractor "uses a set of simple heuristics… based on a dictionary containing part-of-

speech information for English words [and] simple pattern matching to find expressions having

Page 9: text_mining.doc

Text Mining 9

the noun phrase structures characteristic of technical terms. This process is much faster than

alternative approaches." There is also a clustering tool, a classification tool, and a search

engine/web crawler. The clustering similarity measure is based on "lexical affinities" –

correlated groups of words which appear frequently within a short distance of each other and

which can be used to label the clusters.

Lindsay and Gordon (1999) and Kostoff (1999) have extended Swanson's approach

without calling it text mining, but Kostoff's other work explicitly uses that label and so he serves

as a kind of bridge. Swanson's system is essentially as follows: MEDLINE searches are done on

two subjects (say, magnesium and migraines) and the results (titles or abstracts) are dumped into

ARROWSMITH, which generates a list of all significant words and phrases common to the two

result sets, and uses this information to "juxtapose pairs of text passages for the user to consider

as possibly complementary" (Swanson & Smalheiser, 1999). Lindsay and Gordon (1999) added

lexical frequency statistics (tf*idf) to rank the common words and phrases by probable

discriminatory value, but their system, like Swanson's, still requires "human filters" at several

points.

Kostoff and co-workers have published several papers on the Web describing various text

mining systems and applications. Losiewicz, Oard, and Kostoff (2000) describe a "TDM [text

data mining] architecture that unifies information retrieval from text collections, information

extraction from individual texts, knowledge discovery in databases, knowledge management in

organizations, and visualization of data and information." What they mean by "unifies" is

unclear, but this statement clearly betokens a broad view of text mining, almost as a synonym for

the entire family of nontraditional IR strategies. The "TDM architecture" they describe includes

subsystems for data collection (source selection and text retrieval), data warehousing

Page 10: text_mining.doc

Text Mining 10

(information extraction and data storage), and data exploitation (data mining and presentation).

It thus appears to be a system for extracting and analyzing metadata. The authors discuss

linguistic analysis and numerous exotic pattern-finding techniques, but these appear to be long-

range goals. Current work focuses on the more pedestrian challenges of relevance feedback

("simulated nucleation"), bibliometrics, and phrase extraction and statistics. The system is "time

and labor intensive" by the authors' own admission, "requires the close involvement of technical

domain experts(s)" at every level of processing, and aims for a "main output [consisting of]

technical experts who have had their horizon and perspectives broadened substantially through

participation in the data mining process. The data mining tools, techniques and tangible products

are of secondary importance…"

Kostoff, Toothman, Eberhart, and Humenik (2000) connect text mining to "database

tomography," a system for phrase extraction and proximity analysis. The authors capture the

spirit of text mining when they say "techniques that identify, select, gather, cull, and interpret

large amounts of technological information semi-autonomously can expand greatly the

capabilities of human beings…" The idea of "tomography" also evokes text visualization, an

important nontraditional IR strategy related to text mining (see below). The authors cite

unpublished studies showing that in "real-world text mining applications" there is a "strong de-

coupling of the text mining research performer from the text mining user. The performer tended

to focus on exotic automated techniques, to the relative exclusion of the components of judgment

necessary for user credibility and acceptance." Users tended to favor simpler techniques, even if

it meant "reading copious numbers of articles." Database tomography aims to couple text mining

research and technology more closely with the user through "heavy involvement of topical

domain experts (either users or their proxies)" in the development of "strategic database maps"

Page 11: text_mining.doc

Text Mining 11

on the "front end." "The authors believe that this is the proper use of automated techniques for

text mining: to augment and amplify the capabilities of the expert by providing insights to the

database structure and contents, not to replace the experts by a combination of machines and

non-experts."

Kostoff and DeMarco (2001) define science and technology text mining as "the

extraction of information from technical literature." It has three components: information

retrieval (gathering relevant documents), information processing, and information integration.

"Information processing is the extraction of patterns from the retrieved records" by bibliometrics,

computational linguistics, and clustering. "Information integration is the synergistic combination

of the information processing computer output with the [human] reading of the retrieved relevant

records. The information processing output serves as a framework for the analysis, and the

insights from reading the records enhance the skeleton structure to provide a logical integrated

product." Again, "substantial manual labor" is noted, and technical details are not given, leaving

doubt as to what kind of and how much "computational linguistics" and "clustering" were

actually implemented. This work was also published under the title "Citation mining: Integrating

text mining and biliometrics for research user profiling" by Kostoff, del Rio, Humenik, Garcia,

and Ramirez (2001).

In all of Kostoff's articles, there is a disturbingly high ratio of shifting, florid, technical

jargon and speculation to actual accomplishment. He seems to be re-inventing several well

established techniques such as relevance feedback, co-citation analysis, and phrase extraction,

giving them flashy new names, and failing to cite prior work by others. It is often unclear where

the boundary is between the computer and human filtering, particularly in Kostoff's phrase

extraction process. Given the authors' constant emphasis on the importance of human judgment

Page 12: text_mining.doc

Text Mining 12

it seems likely that they have not automated the phrase selection process at all, and therefore

have not added anything to classical word proximity analysis for phrase identification.

Unrestricted human filtering or intervention in what are supposed to be algorithmic processes is,

in some sense, a form of "fudging" or "cheating." It is antithetical to the goals of standardizing

and objectifying the IR process, and it is hard to see how it contributes anything progressive to

text mining research. This is not to disagree with Kostoff about the importance of domain

expertise and user credibility and acceptance, only to caution against using such concerns as a

figleaf for excessively primitive IR technology.

Based on the foregoing, I propose the following criteria for a true text mining system.

The keywords are highlighted.

It must operate on large, natural language text collections.

It must use principled algorithms more than heuristics and manual filtering.

It must extract phenomenological units of information (e.g., patterns) rather than or in

addition to documents.

It must discover new knowledge.

It is to be expected that different systems will meet these criteria to different extents.

Currently Swanson's and Kostoff's systems are on shaky ground on at least the first two, possibly

three. Perhaps text mining, by these criteria, is still more dream than reality. So let's look at

some related concepts.

Data Mining

It seems fairly noncontroversial that text mining is a subdiscipline of the broader and

slightly older field of data mining, the subdiscipline which deals with textual data. An

Page 13: text_mining.doc

Text Mining 13

intermediate evolutionary lexical form, in fact, is "text data mining" (Hearst, 1999; Losiewicz et al,

2000). The mining metaphor implying "extracting precious nuggets of ore from otherwise

worthless rock" is actually more appropriate for text mining than for data mining, which tends to

deal with trends and patterns across whole databases (Hearst, 1999).

Data mining is considered a synonym for "knowledge discovery in databases" (KDD) by

some writers (e.g. Hearst, 1999) and as a narrower term by others (e.g. Liddy, 2000). The most

cited definition of KDD is that given by Fayyad, Piatesky-Shapiro, and Smyth (1996, cited by Qin,

2000, and Hearst, 1997): the nontrivial process of identifying valid, novel, potentially useful, and

ultimately understandable patterns in data. "Information archaeology" is a synonym for both data

mining and KDD, according to Hearst (1999). Two unusually practical, down-to-earth books on

data mining are Witten and Frank (2000) and Han and Kamber (2001) (Perrin, 2001).

Data mining usually deals with structured data, but text is usually fairly unstructured. The

crux of the text mining problem, then, can be viewed as imposing structure on text to make it

amenable to the analytic techniques of data mining. This is often conceptualized as extracting

metadata from text (Losiewicz et al, 2000).

Machine Learning

Data mining is based on a variety of computational techniques, some of which fall under

the rubric of machine learning. Examples are decision trees, neural networks, and association rules

(clustering). In this context, machine learning involves "the acquisition of structural descriptions

from examples [which] can be used for prediction, explanation, and understanding." When the

description can be used to classify the examples, all three are enabled, unlike purely statistical

modeling which only supports prediction. By some views, however, machine learning is little

Page 14: text_mining.doc

Text Mining 14

more than practical statistics as it evolved in the field of computer science; i.e., with an emphasis

on searching "through a space of possible concept descriptions for one that fits the data" (Witten &

Frank, 2000).

From a broader artificial intelligence (AI) perspective, machine learning is one of the four

capabilities needed for an AI system such as a robot to pass the "Turing test" – that is, to appear

logical, rational, and intelligent to an intelligent human interrogator. In this context machine

learning involves the ability "to adapt to new circumstances and to detect and extrapolate patterns"

(Russell & Norvig, 1995).

From a biomedical research perspective, Mjolsness and DeCoste (2001) define machine

learning is "the study of computer algorithms capable of learning to improve their performance of

a task on the basis of their own previous experience" primarily through pattern recognition and

statistical inference. They see a legitimate future role for it in "every element of scientific method,

from hypothesis generation to model construction to decisive experimentation." Text mining

could help with the "high data volumes" involved in literature searching. However, most work to

date has focused on experimental data reduction such as visualization of high-dimensional vector

data resulting from gene expression microarray studies (see footnote 6, p. 25).

Natural Language Processing

Natural language processing (NLP) or understanding (NLU) is the branch of linguistics

which deals with computational models of language. A brief history is given by Bates (1995).

Its motivations are both scientific (to better understand language) and practical (to build

intelligent computer systems). NLP has several levels of analysis: phonological (speech),

morphological (word structure), syntactic (grammar), semantic (meaning of multiword

Page 15: text_mining.doc

Text Mining 15

structures, especially sentences), pragmatic (sentence interpretation), discourse (meaning of

multi-sentence structures), and world (how general knowledge affects language usage) (Allen,

1995). When applied to IR, NLP could in principle combine the computational (Boolean, vector

space, and probabilistic) models' practicality with the cognitive model's willingness to wrestle

with meaning. NLP can differentiate how words are used such as by sentence parsing and part-

of-speech tagging, and thereby might add discriminatory power to statistical text analysis.

Clearly, NLP could be a powerful tool for text mining. Interest in it for that purpose is

widespread but the jury remains out.

Rau (1988) described an early NLP system named SCISOR which was developed by

General Electric. Limited applicability to "constrained domains" was emphasized; SCISOR was

programmed to deal only with information on corporate mergers. Input (news stories, etc.) was

described as being converted to "conceptual format" permitting natural language interrogation

(i.e., question answering) and summarization. SCISOR employed a parallel strategy of top-down

(expectation-driven conceptual analysis) and bottom-up (partial linguistic analysis) parsing.

Parsing is the identification of subjects, verbs, objects, phrases, modifiers, etc., within sentences.

Computerized parsing of free text "is an extremely difficult and challenging problem," according

to Rau. The two parsers in SCISOR interacted with a domain-specific knowledge base

containing grammatical and lexical information. The double parsing strategy of SCISOR

allowed flexibility to perform in-depth analysis when complete grammatical and lexical

knowledge is available, and superficial analysis when unknown words and syntax are

encountered, giving the system robustness. The top-down parser could also be used for text

skimming (looking for particular pieces of information).

However, semantic analysis "is very expensive and furthermore depends on a lot of

Page 16: text_mining.doc

Text Mining 16

domain-dependent knowledge that has to be constructed manually or obtained from other sources"

(IBM, 1998a). Early NLP's image also suffered from the poor performance of phrase-based

indexing in comparison with stemmed single words in the Cranfield and SMART tests (Salton,

1992). Interest in NLP revived when request-oriented (as opposed to document-oriented) IR came

of age and it was realized that the limitations of the linguistic techniques did not prevent them from

being effective within restricted subject domains (Ingwersen and Willett, 1995). Unlike its more

successful sibling field of speech recognition, NLP has the severe disadvantages of diffuse goals

and lack of robust machine learning algorithms (Bates, 1995). There seems to be wide consensus

that NLP is still not competitive with statistical approaches to traditional IR, but that it may be

practical and even critical for applications such as phrase extraction and text summarization. Even

Salton, the godfather of statistical IR, said, "In the absence of deep linguistic analysis methods that

are applicable to unrestricted subject areas, it is not possible to build intellectually satisfactory text

summaries" (Salton, Allan, Buckley, & Singhal, 1994).

Liz Liddy (2000, 2001) has become a prominent advocate for NLP in text mining. Her

definition of the goal of text mining, in fact, is "capturing semantic information" as tabular

metadata amenable to statistical data mining techniques. In her work, NLP includes stemming

(morphological level), part-of-speech tagging (syntactic level), phrase and proper name

extraction (semantic level), and disambiguation (discourse level). Goals include automating text

mark-up for hypertext linkages in digital libraries, and machine learning algorithms for text

classification (see below).

A "reverse flow" of purely statistical methods to NLP has been going on since about

1990 and has made "substantial contributions" (Kantor, 2001), increasing interest in hybrid

approaches (Marcus, 1995; Losee, 2001a; Perrin, 2001). Statistical enrichment has been shown

Page 17: text_mining.doc

Text Mining 17

to significantly improve the accuracy of proper name classification, part-of-speech tagging, word

sense disambiguation, and parsing under certain conditions (Marcus, 1995), and tagging and

disambiguation improve probabilistic document retrieval ranking discrimination by some parts of

speech (Losee, 2001a). Ultimately, lexical statistics are a reflection of term dependencies which

in turn reflect natural languages' relation to "naturally occurring dependencies in the physical

world" (Losee, 2001b). However, higher-level NLP proved far inferior to "shallow" tricks like

stemming and query expansion in improving the performance of an advanced IR system under

rigorous test conditions (Perez-Carballo & Strzalkowski, 2000).

Computational linguistics is used as a synonym for NLP by some writers and as a

narrower term by others. According to Hearst (1999), it is the branch of NLP which deals with

finding statistical patterns in large text collections to inform algorithms for NLP techniques such

as part-of-speech tagging, word sense disambiguation, and bilingual dictionary creation; i.e.,

computational linguistics is a form of text mining. Thus, to Hearst and Liddy, text mining

subserves NLP, rather than the reverse. Both Hearst and Liddy refer often to metadata as being

the bridge between NLP and statistics. They both envision text mining as a component of a full-

featured information access system which also includes source detection, content retrieval, and

analytical aids such as text visualization (see below).

A major problem in text analysis is "dangling anaphors" – pronouns and demonstratives

(this, that, the latter, etc.) which refer back to other sentences (Johnson, Paice, Black, & Neal,

1993). Therefore a good job for NLP would be to detect anaphors and search backwards to

resolve their referent. In the language of logic, this might be called identifying the point in the

text where each significant new proposition begins. In 1993, that was beyond available text

processing capabilities, so the authors had to exclude anaphoric sentences from further analysis

Page 18: text_mining.doc

Text Mining 18

regardless of their information content.

In summary, all this activity and interest raise hopes, but NLP still "has not delivered the

goods" (Saracevic, 2001) and so the jury remains out.

Text Summarization

An obvious example of text mining would be to find previously unknown natural

correlations by looking at co-occurrences of themes in a corpus of texts. Before one can do that,

of course, one must identify the themes. A theme being a form of summary, automated theme-

finding is a form of automatic text summarization (or automatic abstracting), a proud old IR

tradition.

Johnson, Paice, Black, and Neal (1993) trace the history of automatic abstract generation

from Luhn (1958), who proposed extracting sentences based on their computed word content

weights, and Baxendale (1958, cited by Johnson et al, 1993), who drew attention to the

importance of the first and last sentences of paragraphs. Edmundson (1969, cited by Johnson et

al, 1993) found that both of these methods were inferior to extraction on the basis of cues (bonus

words and stigma words). Paice (1981, cited by Johnson et al, 1993) sharpened Edmundson's

idea of cues to "indicator constructs" such as In this paper we show that…

Johnson et al (1993) built a NLP-based auto-abstracting system which selected non-

anaphoric, indicator-containing sentences and ran them through a bottom-up parser, dictionary-

based part-of-speech tagger (noun, verb, etc.) and morphology-based tagger (-ly = adverb, etc.).

Each word was then indexed by its sentence number, position within the sentence, part of speech,

verb tense if applicable, and whether it was plural or singular. The result was then be "cleaned

Page 19: text_mining.doc

Text Mining 19

up" by a set of corrective heuristics and a grammar-based tag disambiguator3. A global parser

then identified noun phrases based on definitive cues such as being separated by a preposition

(e.g., the primary factor in public health), and then parsed the sentence. The resulting sample

abstract was "far from perfect" as the authors admitted, but it was a plausible condensation down

to 22% of the original text size. Since 22% is an inadequate degree of data reduction for most

text summarization needs, the next step might be to take a page from statistical IR and develop

ways of ranking the selected sentences.

Template mining

SCISOR's (Rau, 1988) text summarization capabilities were based on filling in values

specified by domain-dependent, manually formulated "scripts" – e.g., company A offered B

dollars per share in a takeover bid for company C on date D. The values were extracted from

raw text by parsing and stored in relational data tables. Then summaries of the parsed data

values could be written by a natural language generator. This seems to be a form of template

mining, where the script or metadata table field structure constitutes the template.

Chowdhury (1999) describes template mining as a form of information extraction using

NLP "to extract data directly from the text if either the data and/or text surrounding the data form

recognizable patterns. When text matches a template, the system extracts data according to the

instructions associated with that template." Chowdury traces its history from the mid-1960s

Linguistic String Project at New York University, where "fact retrieval" was conducted against

template data mined from natural language text, up to its current (1999) use in the AltaVista and

3 An example of a sentence with intractable tag ambiguity would be Rice flies like sand, which could refer to the behavior of grain or insects (Allen, 1995, p. 13). Such a sentence would require higher (pragmatic and discourse) levels of analysis to disambiguate.

Page 20: text_mining.doc

Text Mining 20

Ask Jeeves web search engines. .He cites some of the same work I reviewed under NLP and

below (the Rau, Paice, and Gaizauskas groups) perhaps implying that template mining is a

general term for NLP-based metadata approaches to text mining. He also cites Croft (1995) in

reference to the U.S. Advanced Research Projects Agency (ARPA) initiative in this area, the

Message Understanding Conferences (MUCs).

To facilitate template mining, Chowdhury recommends "standardization in the

presentation and layout of information within digital documents" through the use of templates for

document creation. But this is contrary to the spirit of text mining, which is to liberate both the

creators and the users of text from as much tedium and artificiality as possible. Like Kostoff's

unrestricted reliance on human filters, it represents a form of surrender in the face of difficulty –

hopefully premature!

Theme Finding

Salton, Allan, Buckley, and Singhal (1994) looked at how traditional IR models can be

applied to theme generation and text summarization. The authors derived the notion of passage

retrieval from the problem of ranking vector matches when the vectors are of different lengths,

e.g. very short queries against long documents, or clustering documents of different sizes. One

solution is to decompose the documents into subunits of roughly equal size, called "passages." A

common passage unit is a paragraph.

The passages may be converted to normalized vectors and compared. Those with

similarities above a certain threshold (which may be chosen to deliver a desired degree of

abstraction) are considered connected. If the documents are plotted as arcs on the circumference

of a circle and their component passages connected by straight lines in accordance with their

Page 21: text_mining.doc

Text Mining 21

vector similarities, the resulting starburst pattern can convey themes within and between

documents. These themes can be focused by expressing each triangle of passage similarities

as a centroid and doing similarity calculations on the centroids.

One may want to compute an estimate of the "most important" passages for the purpose

of selective text traversal ("skimming") or text summarization. Such passages might be

identified as (a) having a large number of above-threshold similarity connections, (b) strategic

position (e.g., the first paragraph in each section), or (c) high similarity to some reference node.

The last criterion (c) is called "depth first" selection. In practice, all three of these criteria can be

combined; e.g., start with some desired passage (as in "more like this"), go to the most similar

sectional heading passage, then go to its strongest link, the select the other densely connected

nodes in that cluster in chronological order. For text summarization, repetition can be edited out

on the basis of similarities between sentences or other subunits which are "too high."

Text Categorization

Text categorization should not be considered a form of text mining because it is a

"boiling down" of document content to "pre-defined labels" which "does not lead to discovery of

new information" since "presumably the person who wrote the document knew what it was

about," according to Hearst (1999). Presumably she would also rule out text summarization and

auto-indexing for the same reason. She makes exceptions, however, for cases where the goal of

categorization is to find "unexpected patterns" or "new events" because these "tell us something

about the world, outside of the text collection itself" and therefore qualify as new information.

I would argue, however, that it is not so easy to predict where "new information" will

come from, that novelty is in the eye of the beholder, and that any form of text data reduction is a

Page 22: text_mining.doc

Text Mining 22

form of separating "precious nuggets" from "worthless rock" according to the human

idiosyncrasies of whoever is doing the separating, be it a traditional library cataloguer/indexer or

a vector space modeler. This is not to say that cataloguing, indexing, and other IR tools are all

text mining, but just to highlight the fuzziness of the boundaries between them.

Clustering

Clustering can be used to classify texts or passages in natural categories that arise from

statistical, lexical, and semantic analysis rather than the arbitrarily pre-determined categories of

traditional manual indexing systems. In the context of text mining, it is the derivation of the

categories which is of interest, since this is a form of theme finding and therefore text

summarization. Once the texts are clustered on the basis of common themes, it may also be useful

to correlate their divergent themes, a la Swanson. Texts may also be clustered on the basis of

length, cost, date, etc. (IBM, 1998b), or bibliographic data such as author, institution, or country of

origin (Kostoff, 1999). Computational aspects of clustering are reviewed by Witten and Frank

(2000, Section 6.6).

Filtering

E-mail filtering is often mentioned as an example of text mining (e.g., Witten and Frank,

2000). The relevance of related techniques such as name recognition, theme finding, and text

categorization are obvious, and it is even possible to imagine software which modifies its own

filtering criteria by discovering new patterns in the whole e-mail stream. However, I was unable

to find reports of any actual work on such a system.

Belkin and Croft (1992) built a model of information filtering (IF) based on Belkin's

Page 23: text_mining.doc

Text Mining 23

famous anomalous states oif knowledge (ASK) model of IR. In a side-by-side comparison, the

two (IF and IR) appear strikingly similar, the biggest difference being the "stable, long-term…

regular information interests" of IF compared to the "periodic… information need or ASK" of

IR. Extending the side-by-side modeling to Bayesian inference networks, the authors arrive at

another striking comparison: the IF network looks exactly like an upside-down IR network! That

is, in IR multiple documents are percolating down to a single user, while in IF each single

incoming document is percolating down to multiple users. However, the authors reject this

analogy for reasons not entirely clear to me.4

Text Visualization

Text visualization shares text mining's goals of using computational transformations to

reduce the cognitive effort of dealing with large text corpora, highlight patterns across

documents, and help discover new knowledge. Text mining implies homing in on "precious

nuggets" whereas text visualization seems to be concerned with the "big picture," but in practice

both may be regarded as elements of a holistic approach to multi-text corpora. The text mining

systems of Hearst, Kostoff, and Liddy all have explicit text visualization components.

Wise (1999) developed a text visualization paradigm for intelligence analysis named

Spatial Paradigm for Information Retrieval and Exploration (SPIRE) "to find a means of

‘visualizing text’ in order to reduce information processing load and to improve productivity" by

representing large numbers of documents to permit "rapid retrieval, categorization, abstraction,

and comparison, without the requirement to read them all." The theory behind SPIRE was that

4 They seem to feel that "P(oj|pi)", the probability that the incoming document will satisfy the information need given a user's filtering profile, is poorly understood compared to the conventional Bayesian need-query-document relationships, but I'm not sure the latter are so well-understood, either.

Page 24: text_mining.doc

Text Mining 24

humans’ most highly evolved perceptual abilities are those involved in interpreting "visual

features of the natural world." Therefore the goal was to represent text as natural, ecological

images from our early hominid past which require no "prolonged training to appreciate and use"

such as star fields or landscapes (Figure 1). This transformation was accomplished using

standard vector space algorithms and involves clustering and text summarization. SPIRE is an

excellent example of how a cognitive theory can be helpful in inspiring IR innovation and

guiding system development, despite its apparent lack of commercial success.5

Text Compression

As mentioned at the beginning, I started this paper by trying to narrow the definition and

scope of text mining by differentiating it from other nontraditional IR strategies (Table 1). One

by one, however, the other strategies refused to be cleanly differentiated, and the foregoing

polyglot review is the result. The only concept I thought I had succeeded in banishing from the

scope of text mining was data compression, which showed up in the title of a single citation in a

literature search performed for me by Melissa Yonteck. Data compression, a la PKZIP, was

surely not related in any meaningful way to text mining, Yonteck and I agreed. Here at last was

something I could confidently rule out.

But on page 334, Witten and Frank (2000), in discussing statistical character-based

models for token classification (names, dates, money amounts, etc.), note that "there is a close

connection with prediction and compression: the number of bits required to compress an item

with respect to a model can be interpreted as the negative logarithm of the probability with which

that item is produced by the model." That is, text compression algorithms might function as

5 Cartia, Inc., which was marketing the ThemeScape™ software (Figure 2, downloaded Fall 2000), no longer has any detectable presence on the Web.

Page 25: text_mining.doc

Text Mining 25

token classifiers in reverse! So I give up. Text mining appears to be related to just about

everything on my original list.

Biomedical Applications

My interest in text mining is motivated primarily by the belief that it can be fruitfully

applied to biomedical literature, specifically the MEDLINE database, to discover new knowledge.

I see text analysis as a major new frontier in bioinformatics, whose smashing success in the area of

gene sequence analysis is based, after all, on nothing more than algorithms for finding and

comparing patterns in the four-letter language of DNA. Swanson's work has focused on

MEDLINE, and Hearst (1999) has also declared a research interest in "automating the discovery of

the function of newly sequenced genes" by determining which novel genes are "co-expressed with

already understood genes which are known to be involved in disease."

Humphreys, Demetriou, and Gaizauskas (2000) used information extraction, defined as

"extracting information about predefined classes of entities and relationships from natural

language texts and placing this information into a structured representation called a template" [is it

therefore template mining?], to build a database of information about enzymes, metabolic

pathways, and protein structure from full text biomedical research articles. The LaSIE (Large

Scale Information Extraction) system includes modules for datatype recognition (names, dates,

etc.), co-reference resolution (pronouns, anaphors, metonyms, etc.), and different types of template

filling. It does linguistic analysis at all levels up to discourse using lexical knowledge,

morphology, and grammars to identify significant words. The enzyme and metabolic pathway

variant of LaSIE is called (of course) EMPathIE and fills the following template fields: enzyme

name, EC (Enzyme Commission) number, organism, pathway, compounds involved and their roles

Page 26: text_mining.doc

Text Mining 26

(substrate, product, cofactor, etc.), and, interestingly, compounds not involved. Optional fields

include concentration and temperature. The PASTA variant deals with protein structure

information such as which amino acid residues occupy given positions, active and binding sites,

secondary structure, subunits, interactions with other molecules, source organism, and SCOP

category. The prototype has been tested on only six journal papers, so it is far from satisfying the

large text corpus requirement for true text mining, but the authors make no such claim.

The U.S. National Institutes of Health (NIH) have also gotten involved. Tanabe, Scherf,

Smith, Lee, Hunter, and Weinstein (1999) developed a system named MedMiner to help them sort

out the thousands of gene expression correlations resulting from microarray experiments6 to

separate "interesting biological stories" from mere epiphenomena and statistical coincidences. The

first module gathers the relevant texts by querying PubMed (MEDLINE) and GeneCards (an

Israeli gene information database) on the expressed genes. [Gene names generally make good

search words because they are different from normal English words, e.g. "JAK3".] The second

module filters the retrieved texts by user-specifiable relevance criteria based on classical proximity

or term frequency scores (NLP criteria being regarded as too computationally expensive). The

third module is a "carefully designed user interface" to facilitate access to the most likely-to-be-

interesting documents.

Despite the name, then, MedMiner is not a true text mining system, but rather a search and

display enhancement to PubMed (which offers only flat Boolean search logic, unranked retrieval,

and no integration with GeneCards, although it is integrated with other gene and protein

databases). Like Kostoff's system, it is designed to deal with highly technical information by

assisting expert users in their traditional IR tasks rather than attempting to automate them 6 Basically, a square chip coated with an array of known DNA sequences at known locations on the chip is dipped into a broth containing the expressed messenger RNA (mRNA) from cells under given conditions. The mRNA is labeled so that when it binds to its complementary DNA on the chip the gene expression pattern is revealed. Gifford (2001) briefly reviewed the direct application of data visualization to gene expression data not involving any text.

Page 27: text_mining.doc

Text Mining 27

completely. MedMiner is freely available online at http://discover.nci.nih.gov.

Another NIH group, Rindflesch, Hunter, and Aronson (1999), developed a true NLP

system named ARBITER for mining molecular binding terms from MEDLINE. ARBITER

attempts to identify noun phrases representing molecular entities such as drugs, receptors,

enzymes, toxins, genes, messenger molecules, etc., and their structural features (box, chain,

sequence, subunit, etc.) likely to be involved in binding. ARBITER makes use of MeSH indexing,

the lexical and semantic knowledge bases of the Unified Medical Language System's (UMLS) and

GenBank, co-word adjacency to forms of bind, and a variety of linguistic strategies to deal with

acronyms, anaphors, modifiers, coordinated phrases, and nested phrases (e.g., "…a previously

unrecognized coiled-coil domain within the C terminus of the PKD1 gene product, polycystin, and

demonstrate…"). A test on a small sample (116 abstracts containing a form of bind, one month's

worth from MEDLINE) yielded 72% recall and 79% precision of manually marked binding terms.

While terminology extraction might be considered a fairly trivial form of text mining, it is

obviously a logical step toward the mining of binding relationships (A binds B) which would have

enormous potential for knowledge discovery.

Stapley and Benoit (2000) developed a system named “BioBiblioMetrics” (Stapley,

2000) which uses text visualization to suggest functional clusters of genes from the yeast

Saccharomyces cerevisiae. The system uses a subset of MEDLINE records containing the

yeast's name, a lexical knowledge base of all the known, nontrivial yeast genes and their aliases

from the SGD (Saccharomyces Gene Database), and a matrix of gene name pair co-occurrence

statistics. When one does a search on a gene name or function (e.g. "DNA replication"), the co-

occurring genes are displayed in a graph with “nodes” representing genes and edge lengths

between the nodes representing biological proximity (Figure 2). Nodes are hypertext-linked to

Page 28: text_mining.doc

Text Mining 28

sequence databases, and edges to those MEDLINE documents that generated them, creating a

biomedical information “landscape” and inference network. BioBiblioMetrics is freely available

online at http://www.bmm.icnet.uk/~stapleyb/biobib/.

Other MEDLINE text mining papers which I did not have a chance to review in full

involve dictionary-controlled natural language processing for extraction of drug-gene relationships

(Rindflesch, Tanabe, Weinstein, & Hunter, 2000); statistical term strength analysis (Wilbur &

Yang, 1996); statistical text classification and a relational machine-learning method (Craven &

Kumlien, 1999); statistical identification of key phrases against an evolutionary protein family

background (Andrade & Valencia, 1997 & 1998); pre-specified protein names and a limited set of

action verbs (Blaschke, Andrade, Ouzounis, & Valencia, 1999); and a proprietary information

extraction system (Thomas, Milward, Ouzounis, Pulman, & Carroll, 2000). Futrelle (2001a)

provides online full-text access to many biomedical text mining papers, including those from the

hard-to-get 2000 and 2001 Pacific Symposia on Biocomputing.

Bob Futrelle (2001a,b) has organized a large "bio-NLP" information network and

enunciated a radical vision which includes several of the themes of this paper, such as the

analogy between text and genome analysis, and the long history of information extraction in its

many guises. He see the challenge as "understanding the nature of biological text, whatever that

turns out to be, linguistic theories not withstanding." He seems to feel that the traditional rules

and grammars of Chomskian linguistics are more hindrance than help.

Frankly, a fresh new approach is needed, fueled by the conviction that language is a biological phenomenon, not a logical phenomenon. By this we mean that the nature of language is as messy as the genome. The data and observed phenomena in all their richness and variety are dominant and cannot subsumed by any elegant theories. This means that in many ways, biologists have far better hopes of cracking the NLP problem than the computational linguists, who are focused on mathematics and logic. Even when they look at data, it is primarily as grist for their math mills.

Page 29: text_mining.doc

Text Mining 29

Futrelle recommends, for example, building visualization tools such as a protein noun phrase

highlighter which could be used to "assemble a large collection of the standard textual

expression forms [and] map these onto the query forms for which they are the answers."

But Futrelle also goes beyond immediate practical needs. Like Wise (1999), he has a

coherent theory based on the biological nature of language.

By this I mean that language is a communicative capability of living organisms that has evolved from deep biological roots and from social interactions over millions, and ultimately, billions of years. I claim that language is not logical and mathematical, because that's not the nature of the organism (us) that exhibits the language capability. An example of this is found in our vocabularies. A technically skilled adult will have a vocabulary of over 100,000 words, basically all memorized. The meaning of "bear" or "ship" does not follow from the characters that make them up. We simply commit them to memory. Linguists would like us to believe that our natural ability to "parse" is radically different and can be explained as a rule-based system.

My radical view is that we understand language not by generalization to abstract rules as much as by retaining examples and generalizing from them as needed. This is quite within our capacity, given our 100,000 word vocabularies. We also do reason. I would claim, again in the biological view, that this is done more by "imagined life" than by logic. Humans have superb abilities to remember events and to build detailed mental plans for future activities …. So we need to build this type of reasoning into our systems.

The analogy to genomics is clear. The coding of a particular protein by a particular

sequence of DNA bases is just an accident of evolution. Whatever rules now appear to prevail

(such as "zinc fingers" for DNA-binding proteins) can only be derived empirically, by looking

for patterns within the data. Purely logical approaches must wait for a richer knowledge base.

Only now, after the massive effort of half a century of molecular genetic research, sequencing

whole genomes, and building databases and tools such as GenBank, Gene Cards, and Proteome,

can we begin to think about prediction of protein structure and function from sequence data

alone. Biological linguistics now stands at the beginning of a comparably arduous journey.

These considerations put Swanson's, Kostoff's, Tanabe's, and Chowdhury's reliance on

human expertise and manual filtering in a better light. Perhaps they do not represent premature

Page 30: text_mining.doc

Text Mining 30

surrender to difficulty so much as a necessary but hopefully temporary expedient. Perhaps they

are keeping "the human in the loop" (Kantor) only long enough to "study the human to learn

what to put in the machine" (Saracevic, 2001). This surprising interface between biomedical text

mining and the cognitive tradition in IR would make a worthy topic for another paper.

Page 31: text_mining.doc

Text Mining 31

References

Allen, J. (1995). Natural Language Understanding, Second Edition. Redwood City, CA:

Benjamin/Cummings.

Andrade, M. A., & Valencia A. (1997). Automatic annotation for biological sequences

by extraction of keywords from MEDLINE abstracts. Development of a prototype system.

Proceedings of the international conference on intelligent systems for molecular biology 5:25-32.

Andrade, M. A., & Valencia, A. (1998). Automatic extraction of keywords from

scientific text: application to the knowledge domain of protein families. Bioinformatics

14(7):600-607.

Bates, M. (1995). Models of natural language understanding. Proceedings of the

National Academy of Sciences, 92, 9977-9982.

Belkin, N. J., & Croft, W. B. (1992). Information filtering and information retrieval:

Two sides of the same coin? Communications of the ACM, 35, 29-38.

Blaschke, C., Andrade, M. A., Ouzounis, C., & Valencia, A. (1999). Automatic extract-

ion of biological information from scientific text: protein-protein interactions. Proceedings of

the international conference on intelligent systems for molecular biology, pp.60-67.

Bush, V. (1945). As We May Think. Atlantic Monthly, 176 (11), 101-108.

Cartia, Inc. (2000). ThemeScape product suite. Formerly online:

http://www.cartia.com/products/index.html [no longer accessible].

Chowdhury, G. G. (1999). Template mining for information extraction from digital

documents. Library Trends, 48, 182-208.

Craven, M., & Kumlien, J. (1999). Constructing biological knowledge bases by

extracting information from text sources. Proceedings of the International Conference on

Page 32: text_mining.doc

Text Mining 32

Intelligent Systems for Molecular Biology, pp.77-86.

Dorre, J., Gerstl, P., & Seiffert, R. (1999). Text mining: Finding nuggets in mountains of

textual data. KDD-99, Association of Computing Machinery.

Doyle, L. (1961). Semantic road maps for literature searchers. Journal of the

Association for Computing Machinery, 8, 223-239.

Fan, W. (2001). Text mining, web mining, information retrieval and extraction from the

WWW references. Online: http://www-personal.umich.edu/~wfan/text_mining.html

Futrelle, R. P. (2001a). Natural language processing of biology texts. Online:

http://www.ccs.neu.edu/home/futrelle/bionlp/

Futrelle, R. P. (2001b). The past, present and future of biology text understanding.

Presented at the Conference on Biological Research with Information Extraction (BRIE), Tivoli

Gardens, Copenhagen, Denmark, July 26. Online:

http://www.ccs.neu.edu/home/futrelle/brie2001/index.html

Gifford, D. K. (2001). Blazing pathways through genetic mountains. Science, 293,

2049-2051.

Greenfield, L. (2001). Text mining. Online: http://www.dwinfocenter.org/docum.html

Hearst, M. (1997). Distinguishing between web data mining and information access.

Presentation for the Panel on Web Data Mining, KDD 97, August 16, Newport Beach, CA.

Online: http://www.sims.berkeley.edu/~hearst/talks/data-mining-panel/index.htm

Hearst, M. (1999). Untangling text data mining. In Proceedings of ACL'99: the 37th

Annual Meeting of the Association for Computational Linguistics, University of Maryland, June

20-26, 1999 (invited paper). Online: http://www.sims.berkeley.edu/~hearst/papers/acl99/acl99-

tdm.html

Page 33: text_mining.doc

Text Mining 33

Hearst, M. (2001). About TextTiling. Online:

http://www.sims.berkeley.edu/~hearst/tiling-about.html

Humphreys, K., Demetriou, G., & Gaizauskas, R. (2000). Bioinformatics applications of

information extraction for scientific journal articles. Journal of Information Science, 26, 75-85.

IBM (1998a). Text analysis tools. Slide #8 of Intelligent Miner for Text Overview.

Online: http://www-4.ibm.com/software/data/iminer/fortext/presentations/im4t23over/

im4t23over8.htm

IBM (1998b). Text mining technology: Turning information into knowledge: A white

paper from IBM. Daniel Tkach (Ed.). Online:

http://www-4.ibm.com/software/data/iminer/fortext/download/whiteweb.pdf

Ingwersen, P., & Willett, P. (1995). An introduction to algorithmic and cognitive

approaches for information retrieval. Libri, 45, 160-177.

Johnson, F. C., Paice, C. D., Black, W. J., & Neal, A. P. (1993). The application of

linguistic processing to automatic abstract generation. Journal of Document and Text

Management, 1, 215-241.

Kantor, P. B. (2001). Lecture K: Natural language concepts. Information Retrieval class,

Rutgers University, School of Communication, Information, and Library Studies, New

Brunswick, NJ.

Kostoff, R. N. (1999). Science and technology innovation. Technovation, 19. Online:

http://www.dtic.mil/dtic/kostoff/Swanson2.txt

Kostoff, R. N., & DeMarco, R. A. (2001). Information extraction from scientific

literature with text mining. Analytical Chemistry (in press). Online:

http://www.onr.navy.mil/sci_tech/special/technowatch/kdocs/anchem2/txt

Page 34: text_mining.doc

Text Mining 34

Kostoff, R. N., del Rio, J. A., Humenik, J. A., Garcia, E. O., & Ramirez, A. M. (2001).

Citation mining: Integrating text mining and biliometrics for research user profiling. Journal of

the American Society for Information Science, 52, 1148-1156.

Kostoff, R. N., Toothman, D. R., Eberhart, H. J., & Humenik, J. A. (2000). Text mining

using database tomography and bibliometrics: A review. Online:

http://www.onr.navy.mil/sci_tech/special/technowatch/textmine.htm

KRDL (2001). Text mining: transforming raw text into actionable knowledge (white

paper). Kent Ridge Digital Labs. Online: http://textmining.krdl.org.sg/

Laender, A. H. F., Ribeiro-Neto, B., da Silva, A. S., & Teixeira, J. S. (2001). A brief

survey of web data extraction tools. In press.

Liddy, E. D. (2000). Text mining. Bulletin of the American Society for Information

Science, 27. Online: http://www.asis.org/Bulletin/Oct-00/liddy.html

Liddy, E. D. (2001). Data mining, meta-data, and digital libraries. DIMACS Workshop

on Data Analysis and Digital Libraries, May 17, Center for Discrete Mathematics and

Theoretical Computer Science, Rutgers University, New Brunswick, NJ.

Lindsay, R. K., & Gordon, M. D. (1999). Literature-based discovery by lexical statistics.

Journal of the American Society for Information Science, 50, 574-587.

Losee, R. M. (2001a). Natural language processing in support of decision-making:

phrases and part-of-speech tagging. Information Processing and Management, 37, 769-787.

Losee, R. M. (2001b). Term dependence: A basis for Luhn and Zipf models. Journal of

the American Society for Information Science, 52, 1019-1025.

Losiewicz, P., Oard, D. W., & Kostoff, R. N. (2000). Textual data mining to support

science and technology management. Online:

Page 35: text_mining.doc

Text Mining 35

http://www .onr.navy.mil/sci_tech/special/technowatch/textmine.htm

Luhn, H. P. (1958). The automatic creation of literature abstracts. IBM Journal of

Research and Development, 2, 159-165.

Marcus, M. (1995). New trends in natural language processing: Statistical natural

language processing. Proceedings of the National Academy of Sciences, 92, 10052-10059.

Mjolsness, E., & DeCoste, D. (2001). Machine learning for science: State of the art and

future prospects. Science, 293, 2051-2055.

Perez-Carballo, J., & Strzalkowski, T. (2000). Natural language information retrieval:

Progress report. Information Processing and Management, 37, 155-178.

Perrin, P. (2001). Personal communication, Molecular Systems research group, Merck &

Co., Inc., Rahway, NJ.

Qin, J. (2000). Working with data: Discovering knowledge through mining and analysis.

Bulletin of the American Society for Information Science, 27. Online:

http://www.asis.org/Bulletin/Oct-00/qin.html

Rau, L. F. (1988). Conceptual information extraction and retrieval from natural language

input. In RIAO 88, pp. 424-437. Paris: Centre des Hautes Etudes Internationales d'Informatique

Documentaire, 1997, General Electric, USA.

Rindflesch, T. C., Hunter, L., & Aronson, A. R. (1999). Mining molecular binding

terminology from biomedical text. Proceedings of the American Medical Informatics

Association Symposium, 1999, 127-131. Online:

http://www.amia.org/pubs/symposia/D005564.PDF

Rindflesch, T. C., Tanabe, L., Weinstein, J. N., & Hunter, L. (2000). EDGAR: extraction

of drugs, genes and relations from the biomedical literature. Pacific Symposium on

Page 36: text_mining.doc

Text Mining 36

Biocomputing, 2000, 517-528.

Russell, S., & Norvig, P. (1995). Artificial Intelligence: A Modern Approach. Upper

Saddle River, NJ: Prentice Hall.

Salton, G. (1992). The state of retrieval systems evaluation. Information Processing and

Management, 28, 441-449.

Salton, G., Allan, J., Buckley, C., & Singhal, A. (1994). Automatic analysis, theme

generation, and summarization of machine-readable texts. Science, 264, 1421-1426.

Saracevic, T. (2001). Personal communication and class discussions, Seminar in

Information Studies, Rutgers University, School of Communication, Information and Library

Studies, New Brunswick, NJ.

SDM (2001). Text mining 2002 [workshop prospectus]. Second SIAM International

Conference on Data Mining, Arlingon, VA, April 13, 2002. Online:

http://www.cs.utk.edu/tmw02/

Sneiderman, C. A., Rindflesch, T. C., Aronson, A. R. (1996). Finding the findings:

identification of findings in medical literature using restricted natural language processing.

Proceedings of the American Medical Informatics Association Annual Fall Symposium, 1996,

239-243.

Stapley, B. J. (2000). BioBiblioMetrics [On-line]. Available:

http://www.bmm.icnet.uk/~stapleyb/biobib/

Stapley, B. J., & Benoit, G. (2000). Biobibliometrics: information retrieval and

visualization from co-occurrences of gene names in Medline abstracts. Pacific Symposium on

Biocomputing, 2000, 529-540.

Swanson, D. R. (1988). Historical note: Information retrieval and the future of an

Page 37: text_mining.doc

Text Mining 37

illusion. Journal of the American Society for Information Science, 39, 92-98.

Swanson, D. R., & Smalheiser, N. R. (1997). An interactive system for finding

complementary literatures: A stimulus to scientific discovery. Artificial Intelligence, 91, 183-

203.

Swanson, D. R., & Smalheiser, N. R. (1999). Implicit text linkages between Medline

records: Using Arrowsmith as an aid to scientific discovery. Library Trends, 48, 48-51.

Swanson, D. R., Smalheiser, N. R., & Bookstein, A. (2001). Information discovery from

complementary literatures: Categorizing viruses as potential weapons. Journal of the American

Society for Information Science and Technology, 52, 797-812.

Tanabe, L., Scherf, U., Smith, L. H., Lee, J. K., Hunter, L., & Weinstein, J. H. (1999).

MedMiner: An Internet text-mining tool for biomedical information, with application to gene

expression profiling. BioTechniques, 27, 1210-1217.

Thomas, J., Milward, D., Ouzounis, C., Pulman, S., & Carroll, M. (2000). Automatic

extraction of protein interactions from scientific abstracts. Pacific Symposium on Biocomputing,

2000, 541-552.

Wilbur, W. J., & Yang, Y. (1996). An analysis of statistical term strength and its use in

the indexing and retrieval of molecular biology texts. Computers in Biology and Medicine,

26(3):209-222.

Wise, J. A. (1999). The ecological approach to text visualization. Journal of the

American Society for Information Science, 50(13):1224-1233.

Witten, I. H., & Frank, E. (2000). Data Mining: Practical Machine Learning Tools and

Techniques with Java Implementations. San Francisco: Morgan Kaufmann (Academic Press).

Page 38: text_mining.doc

Text Mining 38

Table 1.

Initial List of Information Retrieval (IR) Concepts Related to Text Mining.

IR concept Authority (see References)

Artificial intelligence Fan; Perrin

Bioinformatics Futrelle; Perrin

Citation mining Kostoff

Computational Linguistics Fan; Hearst

Conceptual Graphs KRDL

Data Abstraction Fan

Data Mining Fan; Perrin; SDM

Database Tomography Kostoff

Document Mining Fan

Domain Knowledge KRDL

Electronic Commerce Fan

Factor Analysis SDM

Information Access Hearst

Information Extraction Chowdhury; Fan; Futrelle; Kostoff; Perrin

Information filtering Fan

Information Integration Fan

Information Retrieval Fan; Perrin

Information Visualization/Mapping Futrelle; Fan; SDM

Intelligent Agents ("bots") Fan

Page 39: text_mining.doc

Text Mining 39

Knowledge Discovery Fan

Knowledge Extraction Perrin

Knowledge Representation Perrin

Language Identification IBM

Machine Learning Fan; Futrelle; Perrin

Metadata Generation SDM

Natural language processing Fan; Futrelle; Perrin; Rindflesch; Saracevic

Ontologies/Vocabularies/Lexicons Futrelle

Phrase Extraction Fan

Question Answering Futrelle

Resource Discovery Fan

Resource Indexing Fan

Semantic Modeling Perrin; SDM

Semantic Processing Rindflesch

Statistical Language Modeling Fan

Stemming SDM

Syntactic Processing Saracevic

Template Mining Chowdhury; KRDL

Text Analysis Futrelle; IBM

Text Classification/Categorization Fan; Hearst (distinct); IBM; SDM

Text Clustering Fan; IBM

Text Data Mining Hearst; Kostoff

Text Parsing SDM

Page 40: text_mining.doc

Text Mining 40

Text Purification SDM

Text Segmentation/"TextTiling" Hearst; SDM

Text Summarization Futrelle; IBM; Saracevic; SDM

Text Understanding Futrelle; Fan

Web Data Mining Hearst

Web Mining Fan

Web Utilization Mining Fan

Page 41: text_mining.doc

Text Mining 41

Figure 1. ThemeScape™ visualization of a collection of 4,314 Y2K debate forum documents (Cartia, 2000, expired website).

Page 42: text_mining.doc

Text Mining 42

Figure 2. BioBiblioMetrics retrieval from a search on “DNA repair” and “recombination” (Stapley, 2000).