Upload
ijeete
View
14
Download
0
Embed Size (px)
DESCRIPTION
Web mining - is the applicationof data mining techniques to discover patternsfrom the Web. Topic tracking is one of thetechnologies that has been developed and canbe used in the text mining process. The mainpurpose of topic tracking is to identify andfollow events presented in multiple newssources, including newswires, radio and TVbroadcasts. In this paper, a survey of topictracking techniques is presented
Citation preview
International Journal of Exploring Emerging Trends in Engineering (IJEETE)
Vol. 02, Issue 01, JAN, 2015 WWW.IJEETE.COM
ISSN 2394-0573 All Rights Reserved 2014 IJEETE Page 5
TOPIC DETECTION AND TRACKING USING WEB MINING
1Nain Kanwal Kaur
Assistant Professor, Department of Computer Science and Engineering,
Continental Institute of Engineering and Technology, Jalvehra, Punjab, India
Abstract:- Web mining - is the application
of data mining techniques to discover patterns
from the Web. Topic tracking is one of the
technologies that has been developed and can
be used in the text mining process. The main
purpose of topic tracking is to identify and
follow events presented in multiple news
sources, including newswires, radio and TV
broadcasts. In this paper, a survey of topic
tracking techniques is presented.
Keywords- Text Mining, Topic detection, topic
tracking
I INTRODUCTION
The World Wide Web (WWW) is a popular and
interactive medium with tremendous growth of
amount of data or information available today.
The World Wide Web is the collection of
documents, text files, images, and other forms
of data in structured, semi structured and
unstructured form. The primary aim of web
mining is to extract useful information and
knowledge from web.
Raw Data Patterns Knowledge
Web mining is used to capture relevant
information, creating new knowledge out of
relevant data, personalization of the information
and learning about Consumers or individual
users and several others. Web mining can be
divided into three categories depending on the
type of data as:
(i) Web usage mining,
(ii) Web content mining and
(iii) Web structure mining.
II WEB CONTENT MINING
Web content mining is the mining, extraction
and integration of useful data, information and
knowledge from Web page content. The
heterogeneity and the lack of structure that
permits much of the ever-expanding
information sources on the World Wide Web.
Research activities in this field also involve
using techniques from other disciplines such as
Information Retrieval (IR) and natural language
processing (NLP) [12].
III TEXT MINING
Text mining is a new area of computer science
which fosters strong connections with natural
language processing, data mining, machine
learning, information retrieval and knowledge
management. Several approaches exist for the
identification of patterns including automated
classification and clustering [14]. The field of
text mining has received a lot of attention due to
the always increasing need for managing the
information that resides in the vast amount of
available documents [16].
Figure 1. Typical text mining process [6]
International Journal of Exploring Emerging Trends in Engineering (IJEETE)
Vol. 02, Issue 01, JAN, 2015 WWW.IJEETE.COM
ISSN 2394-0573 All Rights Reserved 2014 IJEETE Page 6
IV TOPIC DETECTION AND TRACKING
(TDT)
Topic detection and tracking (TDT) applications
aim to organize the temporally ordered stories
of a news stream according to the events. A
topic tracking system works by keeping user
profiles and based on the documents the user
views, predicts other documents of interest to
the user[5]. The task of topic tracking is to
monitor a stream of news stories and find out
what discuss the same topic described by a few
positive samples [20]. It collects dispersed
information together and makes it easy for user
to get a general understanding [11]. There are
many areas where topic tracking can be applied
in industry. It can be used to alert companies
anytime a competitor is in the news. It could
also be used in the medical industry by doctors
and other people looking for new treatments.
The tasks of TDT can be briefed as:
1. The Topic Tracking Task: The TDT topic
tracking task is defined to be the task of
associating incoming stories with topics that
are known to the system. A topic is
known by its association with stories that
discuss it. Thus each target topic is defined
by one or more stories that are on the topic.
To support this task, a small set of on-topic
training stories is identified for each topic to
be tracked.
2. The Supervised Adaptive Tracking Task: An optional variant of the topic tracking
task is supervised adaptive tracking. This
task is identical to the topic tracking task
except that, for each story judged to be on-
topic, the relevance judgment for that story
is then made available, allowing supervised
adaptation during tracking.
3. The New Event Detection Task: The TDT new event detection task is defined to be the
task of detecting, in a chronologically
ordered stream of stories from multiple
sources (and in multiple languages), the first
story that discusses an event.
4. The Link Detection Task: The TDT link detection task is defined to be the task of
determining whether two stories discuss the
same topic. Thus, the system must embody
an understanding of what a topic is, and this
understanding must be independent of topic
specifics.
Figure 2. Architecture of a topic tracking system [13]
V LITERATURE REVIEW
Event detection problem is a part of topic
detection and tracking (TDT). The topic is a
seminal activity or event which considers all
associated events. The event is an occurrence
reported at a particular time and place with
consequences. It is defined by a list of stories
that discusses the single event. New events refer
to those stories that discuss an event which has
not been reported already in previous stories.
Real-time detection of the events and discovery
of their evolutions should be explored to more
effectively present news stories.
James Allan, Ron Papka and Victor Lavrenko
[1] performed event detection using a clustering
algorithm and threshold model. The major
components of the model are the properties of
an event. For event tracking, filtering methods
are deployed. The event detection follows an
online setting strictly, i.e., processing one news
story at a time. The proposed work
encompasses properties of event identity which
determines whether two events are the same. A
system incorporating the event identity
properties performs new event detection by
International Journal of Exploring Emerging Trends in Engineering (IJEETE)
Vol. 02, Issue 01, JAN, 2015 WWW.IJEETE.COM
ISSN 2394-0573 All Rights Reserved 2014 IJEETE Page 7
comparing the newly arrived story in the stream
with the existing ones. The algorithm used for
new event detection is a modified version of
single pass clustering.
Yiming Yang, Jaime Q. Carbonell, Ralf D.
Brown, Thomas Pierce, Brian Archibald, and
Xin Liu [19] proposed topic detection and
tracking (TDT) to devise an intelligent system
that automatically detects novel events from
large volumes of news stories. This method
accepts news stories from various TV channels
and radio broadcasts as input. The subtasks of
TDT includes segmentation of speech
recognized input into news stories, detection of
events from segmented news streams, tracking
user interested events. Event detection task is
unsupervised and is divided into two forms
[18]:
1. Retrospective detection.
2. Online detection.
Hassan Sayyadi, Matthew Hurst and Alexey
Maykov [15] presented an algorithm for new
event detection, which detects events by
creating keyword graph and using community
detection methods. Events are characterized by
a set of keywords. Keywords are extracted from
the news articles which comprises named
entities. The key factor is the dependency
between the extracted keywords. More than one
event can be denoted by the same set of terms
causing ambiguity. Thus a graph is constructed
using the extracted keywords called key graph.
Each node in the graph represents a keyword
whereas the edge represents co-occurrence of
the keywords in multiple documents. The
proposed algorithm performs three tasks,
namely building the key graph, community
detection and document clustering. For each
keyword term frequency (TF), document
frequency (DF) and inverse document
frequency (IDF) values are computed to
determine its relevancy and association with
other keywords. A node is removed if the
keyword has low document frequency. An edge
is removed if the keywords co-occurrence is
below some threshold value.
Wei CHEN, Chun CHEN, Li-Jun ZHANG, Can
WANG, Jia-Jun BU [3] monitors the news
stream for a predefined duration to identify
bursty events. It is represented using features
(i.e., keywords). Bursty event comprises bursty
features whose frequency increases as the
corresponding event occurs. The steps involved
are identified bursty features in the current
window for different periods, grouping the
bursty features detected and formulating the
bursty
events, each being associated with a power
value corresponding to its bursty level,
discovering the evolution of events. Bursty
features are identified using an online multi
resolution burst detection (OMRBD) algorithm.
Giridhar Kumaran and James Allan [8] perform
new event detection (NED). It involves
monitoring the news stream to identify stories
that report on a new event. In this work, NED is
treated as a binary classification problem. Each
news story has three representations on the basis
of named entities. Since the occurrence of new
event does not follow a pattern and is almost
instantaneous, named entity is used. Named
entities like person, location, organization, etc.
are identified. When two stories depict the same
event, then the named entities and topic terms
will be similar.
W. Lam, H. M. L. Meng, K. L. Wong, J. C. H.
Yen [9] presented a method called contextual
analysis for event detection in a continuous
stream of Newswire stories. The proposed
method doesn't only depend on keywords for
describing an event, but takes into account the
concept terms, named entities like person,
location, organization and story terms.
The information obtained from these terms
along with its weights is used for event
International Journal of Exploring Emerging Trends in Engineering (IJEETE)
Vol. 02, Issue 01, JAN, 2015 WWW.IJEETE.COM
ISSN 2394-0573 All Rights Reserved 2014 IJEETE Page 8
detection. Event detection model is composed
of three components:
1. Similarity calculation component.
2. Grouping the relevant elements by means of
agglomerative clustering.
3. Event identification.
James Allan, Victor Lavrenko and Margaret E.
Connell [2] described the purpose of new event
detection is to find the point where the system
must decide to start a new cluster. The new
event evaluation focuses entirely on whether or
not a system finds the triggers of new topics and
ignores what happens within the topics. The
approach evaluates the tasks within topic
detection and tracking (TDT) using a signal
detection methodology. A TDT system
produces binary YES/NO judgments for every
story in a stream.
Wang Xiaowei, JiangLongbin, MaJialin and
Jiangyan came up with a new improved
approach for topic tracking [17]. They proposed
multi vector model that extracts NER features
from text and make it into a separate vector. It
first selects the features and classifies in
accordance with characteristics of different
tasks, then calculates the vector, then finally
selects the combination of model and optimizes
the parameters.
Xianfei Zhang, Zhigang Guo and Bicheng Li
proposed a new method for News topic tracking
[20]. The LSI-SVM (Latent Semantic Analysis-
Support Vector Machine) method makes an in
depth analysis of the co- occurrence of words
and provides a way of dealing with synonymy
automatically It is based on the assumption that
there is an underlying or latent structure in the
pattern of word usage across document.
Figure 3. NER System Architecture [17]
The keyword extraction technique can be used
for tracking the topics over time. Keywords are
the set of significant words in an article that
gives high level description of its contents to
readers. But manual keyword extraction is
extremely difficult and time consuming task.
This problem has been addressed by Sungjick
Lee and Han-joon Kim [10]. They proposed an
automatic unsupervised keyword extraction
technique. The conventional model evaluates
the degree of importance of a word in a single
document, but the proposed variants evaluate
the degree of importance of a word in a whole
document collection.
Kamaldeep Kaur and Vishal Gupta [6] proposed
event detection and topic tracking for Punjabi
news streams. Punjabi is highly inflectional and
agglutinating language providing one of the
richest and most challenging sets of linguistic
and statistical features resulting in long and
complex word forms. The topic tracking for
Punjabi language has been experimented with
two approaches:
International Journal of Exploring Emerging Trends in Engineering (IJEETE)
Vol. 02, Issue 01, JAN, 2015 WWW.IJEETE.COM
ISSN 2394-0573 All Rights Reserved 2014 IJEETE Page 9
1. NER based approach
2. Keyword extraction approach
CONCLUSION
Topic tracking monitors a stream of news
stories and find out what discuss the same topic
described by a few positive samples. In this
report, the two approaches are studied that are
NER features extraction and keyword
extraction. The language dependent features and
language independent features are formed and
analyzed. Name entities such as date/ time,
location, person name, organization, designation
and keywords from title, cue phrase and high
frequency noun are extracted.
REFRENCES:
[1] Allan, James, Papka, Ron and Lavrenko, Victor, 1998, Online new event detection and tracking. In: Proceedings of the Annual International ACM SIGIR Conference on
Research and Development in Information
Retrieval, Association for Computing
Machinery Special Interest Group on
Information Retrieval, , p 37-45.
[2] Allan, James, Lavranko, Victor and Connell, E., Margaret, A Month to Topic Detection and Tracking in Hindi.
[3] CHEN, Wei, CHEN, Chun, ZHANG, Li-Jun, WANG, Can, BU, Jia-Jun, 2010,
Online detection of bursty events and their evolution in news streams J. Zhejiang Univ.-Sci C 340-355.
[4] Dadgar, Omid, Topic Detection and tracking, Available: www.tcnj.edu/~mmmartin/.../TDT/TopicDet
ectionTracking04.ppt
[5] Gupta, Vishal, Lehal, S., G., (2009), A Survey of Text Mining Techniques and
Applications,in Journal of Emerging Technologies in Web Intelligence.
[6] Kaur, Kamaldeep and Gupta, Vishal, 2011, TOPIC TRACKING FOR PUNJABI LANGUAGE, Computer Science & Engineering: An International Journal
(CSEIJ), Vol.1, No.3.
[7] Kolya, Kumar, Anup, Ekbal, Asif, Bandyopadhyay, Sivaji, 2009, A Simple Approach for Monolingual Event Tracking
System in Bengali, 8th International Symposium on NaturalLanguage
Processing, IEEE.
[8] Kumaran, Giridhar and Allan, James, Using names and topics for new event detection
[9] Lam, W., Meng, L., M., H., K., Wong, L., Yen, H., C., J., Event detection using contextual analysis. Int. J. Intell. Syst., 16 (4): 525-546. [doi: 10.1002/int. 1022]
[10] Lee, Sungjick, Kim, Han-joon, 2008, News Keyword Extraction for Topic Tracking, 4th International Conferenceon Networked Computing and Advanced
Information Management, IEEE.
[11] Liu, Yan, Lv, Nan, Luo, Junyong, Yang, Huijie, (2009), Subtopic Based Topic Evolution Analysis, International Conference on Web Information Systems
and Mining, IEEE.
[12] Navathe, Shamkant, B. and Ramez, Elmasri, 2000, Data Warehousing and Data Mining, inFundamentals of Database Systems, Pearson Education pvtInc, Singapore, 841-872.
[13] Qin, Xiangju, Zhang, Yang, (2008), Improving the performance of Topic Tracking System by Ensemble, International Conference on Computer
Science and Software Engineering, IEEE.
[14] Radovanovic, Milos, vanovic, MirjanaI, (2008), Text Mining: Approaches and Applications, Novi Sad J. Math, Vol. 38, No. 3: 227-234
[15] Sayyadi, Hassan, Hurst, Matthew and Maykov, Alexey, 2009, Event detection and tracking in social streams, 3rd Intl AAAI Conference on Weblogs and Social
Media, ICWSM 09, AAAI,. [16] Stavrianou, Anna, Andritsos, Periklis,
Nicoloyannis, Nicolas, (2007), Overview and Semantic Issues of Text Mining, Sigmod Record, Vol. 36, No. 3
[17] Xiaowei, Wang, Longbin, Jiang, Jiangyan, MaJialin, 2008, Use of NER Information for Improved Topic Tracking, Eighth International Conference on
International Journal of Exploring Emerging Trends in Engineering (IJEETE)
Vol. 02, Issue 01, JAN, 2015 WWW.IJEETE.COM
ISSN 2394-0573 All Rights Reserved 2014 IJEETE Page 10
Intelligent Systems Design and
Applications, IEEE.
[18] Yang Y, Pierce T, Carbonell J. 1998, A study on retrospective and on-line event
detection. In: Proceedings of the Annual International ACM SIGIR Conference on
Research and Development in Information
Retrieval.
[19] Yang, Yiming, Carbonell, Q., Jaime, Brown, D., Ralf, Pierce, Thomas, Archibald,
Brian. and Liu, Xin, (1999) Learning approaches for detecting and tracking news
events IEEE Intell. Syst. 32-43. [20] Zhang, Xianfei, Guo, Zhigang, Li,
Bicheng, (2009), An Effective Algorithm of News Topic Tracking, Global Congress on Intelligent Systems,IEEE.
AUTHORS BIBLOGRAPHY
Nain Kanwal Kaur Currently
working as Assistant Professor,
Department of Computer
Science and Engineering,
Continental Institute of
Engineering and Technology,
Jalvehra, Punjab, India