The Detection of Emerging Concepts in Constructive ...cimel/designDocs/textmining.doc · Web viewTechnology forecasting is another example with numerous applications of both academic

Detecting Emerging Concepts in Textual Data MiningWilliam M. Pottenger, Ph.D. and David R. Gevry

Lehigh University

Recent advances in computer technology are fueling radical changes in the nature of information

management. Increasing computational capacities coupled with the ubiquity of networking have

resulted in widespread digitization of information, thereby creating fundamentally new

possibilities for managing information. One such opportunity lies in the budding area of textual

data mining. With roots in the fields of statistics, machine learning and information theory, data

mining is emerging as a field of study in its own right. The marriage of data mining techniques

to applications in textual information management has created unprecedented opportunity for the

development of automatic approaches to tasks heretofore considered intractable. This document

briefly summarizes our research to date in the automatic identification of emerging trends in

textual data. We also discuss the integration of trend detection in the development of

constructive, inquiry-based multimedia courseware.

The process of detecting emerging conceptual content that we envision is analogous to the

operation of a radar system. A radar system assists in the differentiation of mobile vs. stationary

objects, effectively screening out uninteresting reflections from stationary objects and preserving

interesting reflections from moving objects. In the same way, our proposed techniques will

identify regions of semantic locality in a set of collections and ‘screen out’ topic areas that are

stationary in a semantic sense with respect to time. As with a radar screen, the user of our

proposed prototype must then query the identified ‘hot topic’ regions of semantic locality and

determine their characteristics by studying the underlying literature automatically associated with

each such ‘hot topic’ region.

Applications of trend detection in textual data are numerous: the detection of such trends in

warranty repair claims, for example, is of genuine interest to industry. Technology forecasting is

another example with numerous applications of both academic and practical interest. In general,

trending analysis of textual data can be performed in any domain that involves written records of

human endeavors whether scientific or artistic in nature.

Trending of this nature is primarily based on human-expert analysis of sources (e.g., patent,

trade, and technical literature) combined with bibliometric techniques that employ both semi and

fully automatic methods [White and McCain 1989]. Automatic approaches have not focused on

the actual content of the literature primarily due to the complexity of dealing with large numbers

of words and word relationships. With advances in computer communications, computational

capabilities, and storage infrastructure, however, the stage is set to explore complex

interrelationships in content as well as links (e.g., citations) in the detection of time-sensitive

patterns in distributed textual repositories.

Semantics are, however, difficult to identify unambiguously. Computer algorithms deal with a

digital representation of language – we do not have a precise interpretation of the semantics. The

challenge thus lies in mapping from this digital domain to the semantic domain in a temporally

sensitive environment. In fact, our approach to solving this problem imbues semantics to a

statistical abstraction of relationships that change with time in literature databases.

Our research objective is to design, implement, and validate a prototype for the detection of

emerging content through the automatic analysis of large repositories of textual data. In this

project in particular, we are interested in applying trend detection algorithms as a textual data

mining tool that will aid students in learning through constructive exercises.

The following steps are involved in the process: concept identification/extraction; concept co-

occurrence matrix formation; knowledge base creation; identification of regions of semantic

locality; the detection of emerging conceptual content; and a visualization depicting the flow of

topics through time. For details on our approach, please see [Pottenger and Yang 2000] and

[Bouskila and Pottenger 2000].

The integration of our Hot Topics Data Mining System in constructive, inquiry-based

multimedia requires sophisticated lesson tracking and context construction mechanisms that are

described in more detail below.

Lesson Tracking and Context Enhancement

The research that is being done in this area is two fold. The first focus of this project is to track

users as they move through the lessons and determine how individual users as well as users as a

group approach the lessons. The goal is to use individual users’ contexts to enhance their

performance when conducting constructive, inquiry-based learning exercises that employ the Hot

Topics Textual Data Mining System to uncover trends in a given field of study.

The motivation for this tracking research comes from our current work with user profiling based

on temporal aspects of web access: how often a user visits a page and how long they stay on that

page. The goal of the research is to link users’ temporal data with the semantic data of the

documents that they view. This temporal link will allow us to automatically filter a model of the

user’s interests based on their history of access to the material. The first step in this research is

thus to gather source data for individual user access.

Initially this user profiling research began by examining server web logs in order to profile

individual users – unfortunately this data did not work for our purposes. The logs that we were

using did not contain enough information to distinguish individual users. For example, given an

IP address it is hard to determine whether the user is a distinct person or a number of users who

are using the same address (e.g., a proxy server). In the logs we studied1 the reported value for

the operating system changed for an individual address in many cases. IP address look-up of

these addresses revealed that the majority of the addresses were proxy servers or similar

gateways, hence invalidating them as individual users for our purposes.

Another reason why these logs files were not useful for the research was due to the sparseness of

individual user access. Users did not seem to frequent the site for very long, and during the two

week period of time we chose, few users made repeat visits to the site. Below is a chart that

depicts user access in a continuous five-day period. In order for temporal user profiling to be

effective there must be sufficient data to characterize the user’s browsing activities. The web

1 Our logs were drawn from a two week period of access to www.ncsa.uiuc.edu

server logs however cannot provide us with this type of data because users do not frequent the

site enough to yield adequate temporal data.

These factor compounded by the uncertainty of identifying individual users caused us to abandon

these logs as a viable source of data for our research.

In response to this issue we devised an approach to track the usage of multimedia courseware.

We believe that tracking lessons in this way will yield us with a larger source of individual user

data. The nature of the lessons themselves will promote user access, and we will be able to track

these individual users as they progress through the lessons. Although this data will not be

representative of individual user web access it will provide us with interesting representations on

how students use and access the data as well as possible temporal relationships with user interest.

Additionally it will assist in locating spots inside the lessons themselves where individuals or

groups of users spend a significant amount of time, and this will allow us to determine possible

points of interest or confusion in understanding the material. The users will have round-the-

clock access to the lessons and therefore tracking individual access will give us a better picture

of a user’s interest and what parts of the lessons the user found useful in studying the material,

working homework exercises, studying for exams, etc.

Our focus will be to generate temporally sensitive contexts specific to individual users, and to

boost the performance of the Hot Topics Textual Data Mining System using these contexts. By

tracking the user we plan to build the context within which they are conducting constructive,

inquiry-based learning exercises that employ the Hot Topics Textual Data Mining System. The

data will be used to generate time sensitive contexts to focus the nature of the detection of

emerging topics in the field being studied. Time sensitive contexts will be compared to an

unmodified general context for the course to see if focusing on what the user has examined in a

given timeframe is more effective in identifying ‘hot topics’ relevant to the constructive, inquiry-

based exercises. To aid the Hot Topics Textual Data Mining System we will generate a

repository of documents related to the topic area under study. This will give a more focused

conceptual space from which to draw when performing ‘hot topics’ detection.

Finally, though not directly related to this project, an individual proxy system will be

implemented to gain a second, more general dataset of user profiles. This will involve routing

participants’ web browsers through a proxy server placed on their machine. The log files

generated by this system will have the benefit of being specific to a user and will give a better

picture of user browsing patterns and interests in a temporal sense.

Given below is a use-case diagram for the Lesson Tracking and Hot Topics Textual Data Mining

System as well as a timeline for our research. The use-case diagram shows the pathways

through the proposed system we will design. The user’s actions will be tracked through

JavaScript functions that will communicate to our database through a CGI script. The database

will also contain additional content information for the lessons and this will be combined with

the temporal data to form a temporally sensitive contextual model that will be used by the Hot

Topics Textual Data Mining System to augment its performance. The timeline is broken into

separate timelines for the Lesson Tracking and Hot Topics Textual Data Mining System, proxy

tracking, and collection development and management for the repository encompassing the field

of study.

Documents

The Detection of Emerging Concepts in Constructive ...cimel/designDocs/textmining.doc · Web viewTechnology forecasting is another example with numerous applications of both academic