Dynamic Multi-Faceted Topic Discovery in Twitter

Dynamic Multi-Faceted Topic Discovery in Twitter

Jan Vosecky

Di Jiang

Kenneth Wai-Ting Leung

Wilfred Ng

2

Twitter

3

Representation

• Vector space model– Term vector sparseness issue

• Topic models– Latent topic vector better than VSM?

4

Topic Models

A latent topic in LDA

“Arab revolutions”

Libya 0.00040Force 0.00020Human 0.00010Abuse 0.00010Protect 0.00009Secure 0.00008War 0.00005Execute 0.00004

5

A topic in Twitter?

• Not just words• People talk about entities

Locations

Time

…PersonsOrganizations

6

Multi-faceted Topic Model

• Each topic consists of n facets– Elements of each facet ~ multinomial distribution

• Each document d is a distribution over topics– General terms, named entities and timestamp

drawn from the respective facet of topic z

7

Multi-faceted Topic Model

Multi-faceted latent topic “Arab revolutions”

General terms Persons Locations Organizations

Time

8

Parameter Inference

• Scalability– Gibbs sampling and variational inference

process data in a batch

• Online inference– Stochastic variational inference

to process streaming data

Model continuously updated

Constant time to process a new doc

doc doc doc doc

inference

doc doc doc doc

inference

……

9

Perplexity comparison:Online inference vs. Gibbs sampling

K = 50 K = 200

10

Tweet Clustering

(a) Manually-labeled dataset (b) Hashtag-labeled dataset

DBSCANK-means Direct DBSCANK-means Direct

Vector space model (TF-IDF)

11

Summary

• Model multi-faceted topics in microblogs– Entity-oriented and dynamic

• Online inference method

• Beneficial for downstream applications

12

Thank You!

Jan Vosecky

Di Jiang

Kenneth Wai-Ting Leung

Wilfred Ng

Technology

Dynamic Multi-Faceted Topic Discovery in Twitter