27
Web Science & Technologies University of Koblenz ▪ Landau, Germany opic Discovery in Unstructured Data The Next Generation Christoph Kling, Sergej Sizov, Steffen Staa

Topic Discovery in Unstructured Data: The Next Generation

Embed Size (px)

DESCRIPTION

Looking at text clustering using probabilistic methods (LDA) and correlating with structured data, in particular geolocation

Citation preview

Page 1: Topic Discovery in Unstructured Data: The Next Generation

Web Science & Technologies

University of Koblenz ▪ Landau, Germany

Topic Discovery in Unstructured Data:

The Next Generation

Christoph Kling, Sergej Sizov, Steffen Staab

Page 2: Topic Discovery in Unstructured Data: The Next Generation

Steffen Staab Topic Detection - TNG2 of 25

WeST

Understanding Social Media: Example Yahoo News Comments

• Many comments

• More opinions

• Commenting different (sub)topics

10.09.12 2

Page 3: Topic Discovery in Unstructured Data: The Next Generation

Steffen Staab Topic Detection - TNG3 of 25

WeST

Discovering topics using LDA

Page 4: Topic Discovery in Unstructured Data: The Next Generation

Steffen Staab Topic Detection - TNG4 of 25

WeST

more..

more..

Browse by topic

Page 5: Topic Discovery in Unstructured Data: The Next Generation

Steffen Staab Topic Detection - TNG5 of 25

WeST

We have: Topic-Document – All Fine?

How do we understand the topics?

Are all topics of same value?

Is there structured data to correlate?• Space• Time • Network information

We work on:• Opinions about topics• Diversity of opinions• Localisation of topics

• Time-varying topic models (Blei, Lafferty)

• ....• Geo-varying topic

models

Page 6: Topic Discovery in Unstructured Data: The Next Generation

Steffen Staab Topic Detection - TNG6 of 25

WeST

Geo-located social media content

Chevrolet

BMWAudi

PontiacChevrolet

Mercedes

Audi

CitroenBMW

Chevrolet

BMW

MercedesBMW

Audi

Fiat

Pontiac

CitroenPeugeot

Renault

Page 7: Topic Discovery in Unstructured Data: The Next Generation

Steffen Staab Topic Detection - TNG7 of 25

WeST

Geo-located social media content

Chevrolet

BMWAudi

PontiacChevrolet

Mercedes

Audi

CitroenBMW

Chevrolet

BMW

MercedesBMW

Audi

Fiat

Pontiac

CitroenPeugeot

Renault

citroenrenaultpeugeotbmw

bmwaudimercedesfiatcitroen

chevroletpontiacbmwmercedesaudi

Page 8: Topic Discovery in Unstructured Data: The Next Generation

Steffen Staab Topic Detection - TNG8 of 25

WeST

Related work

Chevrolet

BMWAudi

PontiacChevrolet

Mercedes

Audi

CitroenBMW

Chevrolet

BMW

MercedesBMW

Audi

Fiat

Pontiac

CitroenPeugeot

Renault

citroenrenaultpeugeotbmw

bmwaudimercedesfiatcitroen

chevroletpontiacbmwmercedesaudi

LGTA, Yin et al. 2011

Page 9: Topic Discovery in Unstructured Data: The Next Generation

Steffen Staab Topic Detection - TNG9 of 25

WeST

Problem

Geographical distribution of topics

Language areas Dominating religion

Page 10: Topic Discovery in Unstructured Data: The Next Generation

Steffen Staab Topic Detection - TNG10 of 25

WeST

Our approach

Chevrolet

BMWAudi

PontiacChevrolet

Mercedes

Audi

CitroenBMW

Chevrolet

BMW

MercedesBMW

Audi

Fiat

Pontiac

CitroenPeugeot

Renault

citroenrenaultpeugeotbmw

bmwaudimercedesfiatcitroen

chevroletpontiacbmwmercedesaudi

Page 11: Topic Discovery in Unstructured Data: The Next Generation

Steffen Staab Topic Detection - TNG11 of 25

WeST

Chevrolet

BMWAudi

PontiacChevrolet

Mercedes

Audi

CitroenBMW

Chevrolet

BMW

MercedesBMW

Audi

Fiat

Pontiac

CitroenPeugeot

Renault

citroenrenaultpeugeotbmw

bmwaudimercedesfiatcitroen

chevroletpontiacbmwmercedesaudi

Chevrolet

BMWAudi

PontiacChevrolet

Mercedes

Audi

CitroenBMW

Chevrolet

BMW

MercedesBMW

Audi

Fiat

Pontiac

CitroenPeugeot

Renault

citroenrenaultpeugeotbmw

bmwaudimercedesfiatcitroen

chevroletpontiacbmwmercedesaudi

Our approach

Page 12: Topic Discovery in Unstructured Data: The Next Generation

Steffen Staab Topic Detection - TNG12 of 25

WeST

Geographical network construction

Data points Spatial region centroids Geographical network

10.09.12 12

Page 13: Topic Discovery in Unstructured Data: The Next Generation

Steffen Staab Topic Detection - TNG13 of 25

WeST

Topic detection

Topic assignments

Page 14: Topic Discovery in Unstructured Data: The Next Generation

Steffen Staab Topic Detection - TNG14 of 25

WeST

Topic detection

Topic assignments

Page 15: Topic Discovery in Unstructured Data: The Next Generation

Steffen Staab Topic Detection - TNG15 of 25

WeST

Topic detection

Topic assignments

Page 16: Topic Discovery in Unstructured Data: The Next Generation

Steffen Staab Topic Detection - TNG16 of 25

WeST

Topic detection

Topic assignments

Page 17: Topic Discovery in Unstructured Data: The Next Generation

Steffen Staab Topic Detection - TNG17 of 25

WeST

Topic detection

Topic assignments

Page 18: Topic Discovery in Unstructured Data: The Next Generation

Steffen Staab Topic Detection - TNG18 of 25

WeST

Topic detection

Topic exchange between adjacent clusters:

Chevrolet

PontiacChevrolet

BMW

Pontiac

BMW

Page 19: Topic Discovery in Unstructured Data: The Next Generation

Steffen Staab Topic Detection - TNG19 of 25

WeST

Topic detection

Topic exchange between adjacent clusters:

Chevrolet

PontiacChevrolet

BMW

spatial region A

document1

spatial region B

spatial region C

Pontiac

BMW

spatial region D

AB

CD1

Page 20: Topic Discovery in Unstructured Data: The Next Generation

Steffen Staab Topic Detection - TNG20 of 25

WeST

Topic detection

Topic exchange between adjacent clusters:

Chevrolet

PontiacChevrolet

BMW

spatial region A

document1

spatial region B

spatial region C

Pontiac

BMW

spatial region D

AB

CD1

Page 21: Topic Discovery in Unstructured Data: The Next Generation

Steffen Staab Topic Detection - TNG21 of 25

WeST

Topic detection

is drawn from with equal probability

11 1

BC

A

Chevrolet

PontiacChevrolet

BMW BMW

Pontiac

A B

CD1

Page 22: Topic Discovery in Unstructured Data: The Next Generation

Steffen Staab Topic Detection - TNG22 of 25

WeST

Visualisation

chevrolet 0.35bmw 0.18cadillac 0.16pontiac 0.09gmc 0.07buick 0.06audi 0.05

bmw 0.29audi 0.18fiat 0.10citroen 0.09renault 0.09peugeot 0.08mercedesbenz 0.06chevrolet 0.05

Page 23: Topic Discovery in Unstructured Data: The Next Generation

Steffen Staab Topic Detection - TNG23 of 25

WeST

Visualisation

fiat 0.66bmw 0.10citroen 0.09renault 0.05

pontiac 0.92bmw 0.63mercedesbenz 0.17audi 0.13

renault 0.28citroen 0.22peugeot 0.15bmw 0.10audi 0.09fiat 0.07

Page 24: Topic Discovery in Unstructured Data: The Next Generation

Steffen Staab Topic Detection - TNG24 of 25

WeST

Topic Detection: The next generation

GeoMTD• Better understandability: „nicer regions“• Improved quality

• Better explanation of the data• Measured in terms of reduced perplexity

• about half compared to related work

Page 25: Topic Discovery in Unstructured Data: The Next Generation

Steffen Staab Topic Detection - TNG25 of 25

WeST

Topic Detection: The next generation

Other next generation mechanisms for understanding social media:• Opinions

• adding vocabularies with meaning (LIWC, POMS,...)

• Diversity • maximizing for spread of topics and opinions

• Author-topic-time...

Need to balance between complexity of model and sparsity of data!

Page 26: Topic Discovery in Unstructured Data: The Next Generation

Web Science & Technologies

University of Koblenz ▪ Landau, Germany

Thank you for your attention!

Page 27: Topic Discovery in Unstructured Data: The Next Generation

Steffen Staab Topic Detection - TNG27 of 25

WeST

References

Hierarchical Dirichlet processesby: Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. BleiIn: Journal of the American Statistical Association, Vol. 101 (2006) , p. 1566-1581.

GeoFolk: latent spatial semantics in web 2.0 social media.by: Sergej SizovIn: WSDM ACM (2010) , p. 281-290.

Geographical topic discovery and comparison.by: Zhijun Yin, Liangliang Cao, Jiawei Han, Chengxiang Zhai, and Thomas S. HuangIn: WWW ACM (2011) , p. 247-256.

A Nonparametric Bayesian Model of Multi-Level Category Learning.by: Kevin Robert Canini, and Thomas L. GriffithsIn: AAAI AAAI Press (2011) .

Naveed, Nasir; Gottron, Thomas; Sizov, Sergej; Staab, Steffen (2012): FREuD: Feature-Centric Sentiment Diversification of Online Discussions. In: WebSci'12: Proceedings of the 4th International Conference on Web Science. ACM, 2012.

Nasir Naveed, Sergej Sizov, Steffen Staab: ATTention: Understanding Authors and Topics in Context of Temporal Evolution. European Conference on Information Retrieval 2011: 733-737. Springer, 2011.

Further papers about our work currently in preparation. Contact us if interested