Upload
jan-vosecky
View
972
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Discovering high-level topics from social streams is important for many downstream applications. However, traditional text mining methods that rely on the bag-of-words model are insufficient to uncover the rich semantics and temporal aspects of topics in Twitter. In particular, topics in Twitter are inherently dynamic and often focus on specific entities, such as people or organizations. In this paper, we therefore propose a method for mining multifaceted topics from Twitter streams. The Multi-Faceted Topic Model (MfTM) is proposed to jointly model latent semantics among terms and entities and captures the temporal characteristics of each topic. We develop an efficient online inference method for MfTM, which enables our model to be applied to large-scale and streaming data. Our experimental evaluation shows the effectiveness and efficiency of our model compared with state-of-the-art baselines. We further demonstrate the effectiveness of our framework in the context of tweet clustering. More info: http://www.cse.ust.hk/~jvosecky/
Citation preview
Dynamic Multi-Faceted Topic Discovery in Twitter
Jan Vosecky
Di Jiang
Kenneth Wai-Ting Leung
Wilfred Ng
2
3
Representation
• Vector space model– Term vector sparseness issue
• Topic models– Latent topic vector better than VSM?
4
Topic Models
A latent topic in LDA
“Arab revolutions”
Libya 0.00040Force 0.00020Human 0.00010Abuse 0.00010Protect 0.00009Secure 0.00008War 0.00005Execute 0.00004
5
A topic in Twitter?
• Not just words• People talk about entities
Locations
Time
…PersonsOrganizations
6
Multi-faceted Topic Model
• Each topic consists of n facets– Elements of each facet ~ multinomial distribution
• Each document d is a distribution over topics– General terms, named entities and timestamp
drawn from the respective facet of topic z
7
Multi-faceted Topic Model
Multi-faceted latent topic “Arab revolutions”
General terms Persons Locations Organizations
Time
8
Parameter Inference
• Scalability– Gibbs sampling and variational inference
process data in a batch
• Online inference– Stochastic variational inference
to process streaming data
Model continuously updated
Constant time to process a new doc
doc doc doc doc
inference
doc doc doc doc
inference
……
9
Perplexity comparison:Online inference vs. Gibbs sampling
K = 50 K = 200
10
Tweet Clustering
(a) Manually-labeled dataset (b) Hashtag-labeled dataset
DBSCANK-means Direct DBSCANK-means Direct
Vector space model (TF-IDF)
11
Summary
• Model multi-faceted topics in microblogs– Entity-oriented and dynamic
• Online inference method
• Beneficial for downstream applications
12
Thank You!
Jan Vosecky
Di Jiang
Kenneth Wai-Ting Leung
Wilfred Ng