Nonparametric Bayesian Storyline Detection from Microtexts · About time Prior approaches to modeling time I Maximum temporal gap between items on same storyline I Look for attention

Nonparametric Bayesian StorylineDetection from Microtexts

Vinodh Krishnan and Jacob EisensteinGeorgia Institute of Technology

Clustering microtexts into storylines

z = 1

Strong start for Barcelona

Oct 1, 1:15pmz = 2

Dog tuxedo bought with countycredit card

Oct 1, 1:23pm

z = 1

Messi scores! Barcelona up 1-0

Oct 1, 1:39pm

. . .

z = 3

Yellow card for Messi

Oct 8, 10:15am

Storyline detection is a multimodal clusteringproblem, involving content and time.

Vinodh Krishnan and Jacob Eisenstein: Nonparametric Bayesian Storyline Detection from Microtexts


z = 1 Strong start for Barcelona

Oct 1, 1:15pm

z = 2 Dog tuxedo bought with countycredit card

Oct 1, 1:23pm

z = 1 Messi scores! Barcelona up 1-0

Oct 1, 1:39pm

. . .z = 1 Yellow card for Messi

Oct 8, 10:15am




z = 1 Strong start for Barcelona Oct 1, 1:15pmz = 2 Dog tuxedo bought with county

credit cardOct 1, 1:23pm

z = 1 Messi scores! Barcelona up 1-0 Oct 1, 1:39pm. . .

z = 3 Yellow card for Messi Oct 8, 10:15am




z = 1 Strong start for Barcelona Oct 1, 1:15pmz = 2 Dog tuxedo bought with county

credit cardOct 1, 1:23pm

z = 1 Messi scores! Barcelona up 1-0 Oct 1, 1:39pm. . .

z = 3 Yellow card for Messi Oct 8, 10:15am



About timePrior approaches to modeling time

I Maximum temporal gap between items onsame storyline

I Look for attention peaks (Marcus et al., 2011)I Model temporal distribution per storyline (Ihler

et al., 2006; Wang & McCallum, 2006)

Problems with these approaches:I Storylines can have vastly different timescales,

might be periodic, etc.I Methods for determining number of storylines

are typically ad hoc.


About timePrior approaches to modeling time

I Maximum temporal gap between items onsame storyline

I Look for attention peaks (Marcus et al., 2011)I Model temporal distribution per storyline (Ihler

et al., 2006; Wang & McCallum, 2006)Problems with these approaches:

I Storylines can have vastly different timescales,might be periodic, etc.

I Methods for determining number of storylinesare typically ad hoc.


This work

A non-parametric Bayesian framework for storylinesI The number of storylines is a latent variable.I No parametric assumptions about the temporal structure

of storyline popularity.I Text is modeled as a bag-of-words, but the modular

framework admits arbitrary (centroid-based) models.I Linear-time inference via streaming sampling


Modeling framework

Prior probability of storyline assignments,conditioned on timestamps

P(w , z | t) = P(z | t)K∏

k=1P({w i :zi=k})

Likelihood of text, computed per storyline


The prior over storyline assignments

We want a prior distribution P(z | t) that is:I nonparametric over the number of storylines;I nonparametric over the storyline temporal

distributions.How to do it?

The distance-dependentChinese restaurant process (Blei & Frazier, 2011)


The prior over storyline assignments

We want a prior distribution P(z | t) that is:I nonparametric over the number of storylines;I nonparametric over the storyline temporal

distributions.How to do it? The distance-dependentChinese restaurant process (Blei & Frazier, 2011)


From graphs to clusterings

c1

c2

c3

c4

Key idea of dd-CRP: “follower”graphs define clusterings.

I Z = ((1, 3, 4), (2))

I Z = ((1, 3), (2, 4))I Z = ((1, 3, 4), (2))I Z = ((1, 3), (2), (4))



c1

c2

c3

c4


I Z = ((1, 3, 4), (2))I Z = ((1, 3), (2, 4))

I Z = ((1, 3, 4), (2))I Z = ((1, 3), (2), (4))



c1

c2

c3

c4


I Z = ((1, 3, 4), (2))I Z = ((1, 3), (2, 4))I Z = ((1, 3, 4), (2))

I Z = ((1, 3), (2), (4))



c1

c2

c3

c4


I Z = ((1, 3, 4), (2))I Z = ((1, 3), (2, 4))I Z = ((1, 3, 4), (2))I Z = ((1, 3), (2), (4))


Prior distributionWe reformulate the prior over follower graphs:

P(z | t) = P(c | t) =N∏

i=1P(ci | ti , tci )

P(ci | ti , tci ) =

e−|ti−tci |/a, ci 6= iα, ci = i

I Probability of two documents being linkeddecreases exponentially with time gap ti − tj .

I The likelihood of a document linking to itself(starting a new cluster) is proportional to α.


Modeling framework


P(w , z | t) = P(z | t)K∏

k=1P({w i :zi=k})



Likelihood

Cluster likelihoods are computed using the DirichletCompound Multinomial (Doyle & Elkan, 2009).

P(w) =K∏

k=1P({w i}zi=k)

=K∏

k=1

∫θPMN({w i}zi=k | θk)PDir(θk ; η)dθk

=K∏

k=1PDCM({w i}zi=k ; η),

where η is a concentration hyperparameter.


The Dirichlet Compound MultinomialThe DCM is a distribution over vectors of counts,which rewards compact word distributions.

Mes

si

card

Barc

elon

a

yello

w

cred

it

tuxe

do

goal

word

0

1

2

3

4

5w1w2

10-5 10-4 10-3 10-2 10-1 100 101 102

η

70

60

50

40

30

20

10

logP

(w)

w1

w2

We set the hyperparameter η using a heuristic fromMinka (2012).


Modeling framework


P(w , z | t) = P(z | t)K∏

k=1P({w i :zi=k})



Inference: Gibbs samplingc1

c2

c3

c4

I We iteratively cut and resampleeach link.

I Each link is sampled from thejoint probability,

Prsample

(ci = j | c−i ,w) ∝ Pr(ci = j)× P(w | c)

∝ Pr(ci = j)×P({wk}zk =zi∨zk =zj )

P({wk}zk =zi )× P({wk}zk =zj )

I Online inference: Gibbssampling restricted to a movingwindow (linear-time)



c2

c3

c4



Prsample

(ci = j | c−i ,w) ∝ Pr(ci = j)× P(w | c)






c2

c3

c4



Prsample

(ci = j | c−i ,w) ∝ Pr(ci = j)× P(w | c)

∝ e−t4−t1

a × P({w1,w3,w4)}P({w4})× P({w1,w3})




c2

c3

c4



Prsample

(ci = j | c−i ,w) ∝ Pr(ci = j)× P(w | c)

∝ e−t4−t2

a × P({w2,w4)}P({w4})× P({w2})




c2

c3

c4



Prsample

(ci = j | c−i ,w) ∝ Pr(ci = j)× P(w | c)

∝ e−t4−t3

a × P({w1,w3,w4)}P({w4})× P({w1,w3})




c2

c3

c4



Prsample

(ci = j | c−i ,w) ∝ Pr(ci = j)× P(w | c)

∝ α× P({w4)}P({w4})




c2

c3

c4



Prsample

(ci = j | c−i ,w) ∝ Pr(ci = j)× P(w | c)





TREC 2014 TTG ResultsModel F1 F w

1

dd-CRP clustering models1. baseline 0.20 0.302. offline 0.29 0.343. online 0.29 0.35

Top systems from Trec-2014 TTG4. TTGPKUICST2 (Lv et al., 2014) 0.35 0.465. EM50 (Magdy et al., 2014) 0.25 0.386. hltcoeTTG1 (Xu et al., 2014) 0.28 0.37

I Online inference as accurate as offline GibbsI 2nd of 14 TREC systems on F1, 4th/14 on F w

1I We use the baseline retrieval model, 0.31 MAP

vs 0.5-0.6 MAP for best systems.



1





vs 0.5-0.6 MAP for best systems.



1





vs 0.5-0.6 MAP for best systems.Vinodh Krishnan and Jacob Eisenstein: Nonparametric Bayesian Storyline Detection from Microtexts

Summary

I Nonparametric Bayesian storyline detectionincorporating content and time.

Content Centroid-based likelihood(Dirichlet Compound Multinomial)

Time Distance-based prior (ddCRP)Fancier likelihoods and distance functions canbe incorporated in future work!

I Our nonparametric model is competitive withTREC TTG systems, despite using a muchweaker retrieval model.


Acknowledgments

I National Institutes for Health(R01GM112697-01)

I A Focused Research Award for computationaljournalism from Google

I CNewsStory 2016 reviewersI Patrick Violette and Irfan Essa


References I

Blei, D. M. & Frazier, P. I. (2011). Distance dependent chinese restaurant processes. Journal of Machine LearningResearch, 12(Aug), 2461–2488.

Doyle, G. & Elkan, C. (2009). Accounting for burstiness in topic models. In Proceedings of the 26th AnnualInternational Conference on Machine Learning, (pp. 281–288). ACM.

Ihler, A., Hutchins, J., & Smyth, P. (2006). Adaptive event detection with time-varying poisson processes. InKDD, (pp. 207–216). ACM.

Lv, C., Fan, F., Qiang, R., Fei, Y., & Yang, J. (2014). PKUICST at TREC 2014 Microblog Track: featureextraction for effective microblog search and adaptive clustering algorithms for TTG. Technical report, DTICDocument.

Magdy, W., Gao, W., Elganainy, T., & Wei, Z. (2014). Qcri at trec 2014: applying the kiss principle for the ttgtask in the microblog track. Technical report, DTIC Document.

Marcus, A., Bernstein, M. S., Badar, O., Karger, D. R., Madden, S., & Miller, R. C. (2011). Twitinfo: aggregatingand visualizing microblogs for event exploration. In chi, (pp. 227–236). ACM.

Minka, T. (2012). Estimating a dirichlet distribution.http://research.microsoft.com/en-us/um/people/minka/papers/dirichlet/minka-dirichlet.pdf.

Wang, X. & McCallum, A. (2006). Topics over time: a non-markov continuous-time model of topical trends. InProceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining,(pp. 424–433). ACM.

Xu, T., McNamee, P., & Oard, D. W. (2014). Hltcoe at trec 2014: Microblog and clinical decision support.


http://research.microsoft.com/en-us/um/people/minka/papers/dirichlet/minka-dirichlet.pdf

Documents

Nonparametric Bayesian Storyline Detection from Microtexts · About time Prior approaches to modeling time I Maximum temporal gap between items on same storyline I Look for attention