18
Online Generative Topic Models Pu Wang (presenter) Authors: Loulwah AlSumait ([email protected]) Daniel Barbará ([email protected]) Carlotta Domeniconi ([email protected]) Department of Computer Science George Mason University SDM 2009

Semantic History Embedding in Online Generative Topic Models Pu Wang (presenter) Authors: Loulwah AlSumait ([email protected]) Daniel Barbará ([email protected])

Embed Size (px)

Citation preview

Page 1: Semantic History Embedding in Online Generative Topic Models Pu Wang (presenter) Authors: Loulwah AlSumait (lalsumai@gmu.edu) Daniel Barbará (dbarbara@gmu.edu)

Semantic History Embedding in

OnlineGenerative Topic

ModelsPu Wang (presenter)

Authors:Loulwah AlSumait ([email protected])Daniel Barbará ([email protected])Carlotta Domeniconi ([email protected])

Department of Computer ScienceGeorge Mason University

SDM 2009

Page 2: Semantic History Embedding in Online Generative Topic Models Pu Wang (presenter) Authors: Loulwah AlSumait (lalsumai@gmu.edu) Daniel Barbará (dbarbara@gmu.edu)

Outline Introduction and related work Online LDA (OLDA) Parameter Generation

Sliding history window Contribution weights

Experiments Conclusion and future work

Page 3: Semantic History Embedding in Online Generative Topic Models Pu Wang (presenter) Authors: Loulwah AlSumait (lalsumai@gmu.edu) Daniel Barbará (dbarbara@gmu.edu)

Introduction When a topic is observed at a certain

time, it is more likely to appear in the future

previously discovered topics hold important information about the underlying structure of data

Incorporating such information in future knowledge discovery can enhance the inferred topics

Page 4: Semantic History Embedding in Online Generative Topic Models Pu Wang (presenter) Authors: Loulwah AlSumait (lalsumai@gmu.edu) Daniel Barbará (dbarbara@gmu.edu)

Related Work Q. Sun, R. Li et al. ACL 2008.

LDA-based Fisher kernel to measure the text semantic similarity between blocks of LDA documents

X. Wang et al. ICDM 2007 Topical N-Gram model that automatically identified

feasible N-grams based on the context that surround it

X. Phan et al. IW3C2 2008. a classifier on both a small set of labeled documents

in addition to an LDA topic model estimated from Wikipedia.

Page 5: Semantic History Embedding in Online Generative Topic Models Pu Wang (presenter) Authors: Loulwah AlSumait (lalsumai@gmu.edu) Daniel Barbará (dbarbara@gmu.edu)

Tracking Topics

Tracking Topics

Nd

M t

K

zti

wti

t

t

t

t

t Time

(time between t & t+1 = ε)

St

Topic Evolution Tracking

PriorsConstruction

Emerging Topic

Detection

t

t

t+1

Nd

Mt+1

K

zit+1

wit+1

t+1

t+1

t+1

t+1

S t+ 1

Emerging Topic List

Emerging Topic List

t+1

t+1

t+1

t+1

Online LDA (OLDA)

Page 6: Semantic History Embedding in Online Generative Topic Models Pu Wang (presenter) Authors: Loulwah AlSumait (lalsumai@gmu.edu) Daniel Barbará (dbarbara@gmu.edu)

Inference Process

t

jv

t

i

i

t

i

V

v

tKVjv

tjw

KVjwttt

itt

i

C

CjzP

1 ,

,,

,

,,,|

βαzw

tK

k

KDkd

tKDjd

kd

t

i

jd

t

i

C

C

,

,

1 ,

,

Current stream

Historicobservations

Parameter Generation

Simple inference problem Gibbs Sampling Current

streamHistoric

observations

Page 7: Semantic History Embedding in Online Generative Topic Models Pu Wang (presenter) Authors: Loulwah AlSumait (lalsumai@gmu.edu) Daniel Barbará (dbarbara@gmu.edu)

Topic Evolution Tracking Topic alignment over time Handles changes in lexicon, topic drift

Topic 1 (0.65)

Bank (0.44), money (0.35), loan (0.21)

Topic 2 (0.35)

Factory (0.53), production (0.34), labor (0.13)

Topic 1 (0.43) Bank (0.5), credit (0.32), money (0.18)

Topic 2 (0.57) Factory (0.48), cost (0.32), manufacturing (0.2)

t Time t+1

P(topic) P(word|topic)

Aligned topicsover time

Page 8: Semantic History Embedding in Online Generative Topic Models Pu Wang (presenter) Authors: Loulwah AlSumait (lalsumai@gmu.edu) Daniel Barbará (dbarbara@gmu.edu)

Sliding History Window Consider all topic-word distributions within

a “sliding history window” (δ) Alternatives for keeping track of history at

time t full memory, δ= t short memory, δ=1 Intermediate memory, δ= c

Matrix Evolution MatrixDictionary

Topic distribution over time

Page 9: Semantic History Embedding in Online Generative Topic Models Pu Wang (presenter) Authors: Loulwah AlSumait (lalsumai@gmu.edu) Daniel Barbará (dbarbara@gmu.edu)

Contribution Control Evolution Tuning Parameters ω

Individual weights of models Decaying history: ω1 < ω2<…< ωδ

Equal contributions: ω1 = ω2=…= ωδ

Total weight of history (vs. weight of new observations)

Balanced weights (sum=1) Biased toward the past (sum>1) Biased toward the future (sum<1)

Page 10: Semantic History Embedding in Online Generative Topic Models Pu Wang (presenter) Authors: Loulwah AlSumait (lalsumai@gmu.edu) Daniel Barbará (dbarbara@gmu.edu)

Parameter Generation Priors of Topic distribution over words at

time t+1

Generate topic distribution

ωB

β

)(

)1(1

)2(1

)()(

tk

tk

tk

tk

tk

Page 11: Semantic History Embedding in Online Generative Topic Models Pu Wang (presenter) Authors: Loulwah AlSumait (lalsumai@gmu.edu) Daniel Barbará (dbarbara@gmu.edu)

Experimental Design “Matlab Topic Modeling Toolbox”, by Mark Steyvers

and Tom Griffiths Datasets:

NIPS Proceedings from 1988-2000 1,740 papers, 13,649 unique words, 2,301,375 word tokens 13 streams, size from 90 to 250 doc’s per stream

Reuters-21578 News from 26-FEB-1987 to 19-OCT-1987 10,337 documents; 12,112 unique words; 793,936 word tokens 30 streams (29/340 doc’s, 1/517 doc’s)

Baselines: OLDAfixed: no memory OLDA (ω(1) ): short memory

Performance Evaluation measure: Perplexity Test set: documents of next year or stream

Page 12: Semantic History Embedding in Online Generative Topic Models Pu Wang (presenter) Authors: Loulwah AlSumait (lalsumai@gmu.edu) Daniel Barbará (dbarbara@gmu.edu)

ReutersOLDA with fixed β vs. OLDA with semantic β

No memory

Page 13: Semantic History Embedding in Online Generative Topic Models Pu Wang (presenter) Authors: Loulwah AlSumait (lalsumai@gmu.edu) Daniel Barbará (dbarbara@gmu.edu)

ReutersOLDA with different window size and weights• Increasing window size enhanced prediction

• Incremental history information (δ>1,sum>1) did not improve topic estimation at all Increase window size

short memory

Equal contribution

Incremental History Information

Page 14: Semantic History Embedding in Online Generative Topic Models Pu Wang (presenter) Authors: Loulwah AlSumait (lalsumai@gmu.edu) Daniel Barbará (dbarbara@gmu.edu)

NIPSOLDA with Different Window

No memory

Short memory

• Increasing window size enhanced prediction w.r.t. short memory

• Window size greater than 3 enhanced prediction

• Effect of total weight

Page 15: Semantic History Embedding in Online Generative Topic Models Pu Wang (presenter) Authors: Loulwah AlSumait (lalsumai@gmu.edu) Daniel Barbará (dbarbara@gmu.edu)

NIPSOLDA with Different Total Weight

No memory

Sum of weight = 1

Decrease sum of weights

Models with lower total weight resulted in better prediction

Page 16: Semantic History Embedding in Online Generative Topic Models Pu Wang (presenter) Authors: Loulwah AlSumait (lalsumai@gmu.edu) Daniel Barbará (dbarbara@gmu.edu)

NIPS & ReutersOLDA with Different Total Weight

• Variable sum(ω)

• δ = 2Decrease to

tal

sum of weights

Increase total

sum of weights

Page 17: Semantic History Embedding in Online Generative Topic Models Pu Wang (presenter) Authors: Loulwah AlSumait (lalsumai@gmu.edu) Daniel Barbará (dbarbara@gmu.edu)

NIPSOLDA with Equal vs Decaying History Contribution

Page 18: Semantic History Embedding in Online Generative Topic Models Pu Wang (presenter) Authors: Loulwah AlSumait (lalsumai@gmu.edu) Daniel Barbará (dbarbara@gmu.edu)

Conclusions the effect of embedding semantic information in

LDA topic modeling of text streams Parameter generation based on topical

structures inferred in the past Semantic embedding enhances OLDA prediction Effect of

Total influence of history, History window size, and Equal or decaying contributions

Future work use of prior-knowledge effect of embedded historic semantics on detecting

emerging and/or periodic topics