Clustering short status messages: A topic model based approach

Clustering short status messages:A topic model based approach

Masters Thesis DefenseAnand Karandikar

Advisor: Dr. Tim Finin

Date: 26th July 2010Time: 9:00 amPlace: ITE 325B

http://www.binterest.com/

Thesis Contributions

• Determine a topic model that is “optimal” for clustering tweets by determining good parameters to build a topic model in terms of dataset type, dataset size and number of topics.

• Cluster tweets based on topic similarity.

• Cluster twitter users using topic models.

Outline

• Introduction• Motivation• Related work• Approach• Experiments and results• Conclusion• Future work

Rise of online social mediaAbility to rapidly disseminate information. A medium of communication and information sharing.

Twitter, Facebook, Flickr and Youtube facilitate information sharing via text, hyperlinks, photos, video etc.

Status updates or tweets (for Twitter) can contain text, emoticon, link or their combination.

Basics…

• Topic models are generative models.• The basic idea is to describe a document as

mixture of different topics. • A topic is simply a collection of words that

occur frequently with each other.

Properties of interest

Bag of words model, unsupervised learning, identify latent relationships in the data, document represented as a numerical vector

Motivation• Content oriented analysis applying NLP techniques is difficult

a. Short length of messages, about 140 charactersb. Lack of grammar rules. Use of abbreviations and slangsc. Implied references to entities

• Topic models can address above mentioned difficulties.

• Clustering will help research community to categorize tweets based on their content without the need for labeled data.

• Such clustering will further help users to discover other users who post about topics of their liking or interest.

Related Work

• Discover topics covered by papers in PNAS. These were used to identify relationships between various

science disciplines and finding latest trends.

• Author-topic models To discover topic trends, finding authors who most likely

tend to write on certain topics.

• Detect topics in biomedical text. It performs topic based clustering using unsupervised

hierarchical clustering algorithms.

Related Work

• Smarter BlogRoll augments a blogroll with information about current topics of the blogs in that blog roll.

• Map content in Twitter feed into dimensions that correspond roughly to substance, style, status and social characteristics of posts.

• Identify latent patterns like informational and emotional messages in Earthquake and Tsunami data sets collected from Twitter.

Problem 1

• Topic models can be trained using different datasets, varying size of training data and varying number of topics.

Problem Definition:

Given that we have topic models with varying parameters, to determine which topic model configuration is “optimal” for clustering tweets.

Problem 2

Problem Definition:

Given a set of twitter users and their tweets, cluster the twitter users based on similarity in the content they tweet about.

Twitterdb datasetThe total collection is about 150 million tweets from 1.5 million users, collected over a period of 20 months (during 2007–2008)

Language Percentage

English 32.4 %

Scots 12.5 %

Japanese 7.4 %

Catalan 5.2 %

German 3.9 %

Danish 3.1 %

Approx. 48 million English tweets that can be used

TAC KBP Corpus• This was basically 2009 TAC KBP corpus with approximately

377K newswire articles from Agence France-Presse (AFP)• About half articles were from 2007 and half from 2008 with

a few (less than 1%) from 1994-2006.

Disaster Events datasetEvent Name Source

DC snow Twitter API

NE thunderstorm Twitter API

Haiti earthquake Twitter API

Afghanistan war Twitter API

China mine blasts Twitter API

Gulf oil spills Twitter API

California fires Twitterdb

Gustav hurricane Twitterdb

1500 tweets per event

Hence a total of 12k tweets

Supplementary test datasetEvent Name Source # tweets

Hurricane Alex Twitter API 624China earthquake Twitterdb 376

Manually scanned through all 1000 tweets to make sure they are relevant to the respective event.

Sample Twitter API queries

Using words, hashtags and date ranges for queryingHaiti earthquake in Jan 2010: haiti earthquake # haiti since:2010-01-12 until:2010-01-16

Using words, date ranges and locationWashington DC snow blizzard in Feb 2010: snow since:2010-02-25 until:2010-02-28 near:”Washington DC” within:25mi

An eyeballing resulted in approximately 97% tweets obtained this way relevant to the event name in our Disaster events dataset.

Approach

Training Corpus

MALLETtopic

modeler

Disaster Events data with

12000 tweets

ClusteringOutput

12000 topic

vectors

Topic model configuration parameters

Topic inference file

Topic modelerWhy MALLET?

a. open source.b. extremely fast and highly scalable implementation of Gibbs sampling.c. tools to infer topics from new documents.

http://mallet.cs.umass.edu/

Steps involved in building a topic model

Input •Pruning of dataset•Convert input data to MALLET’s internal data format

Training •‘train-topics’ command•200 to 400 topics for fine granularity

Output •Inference file•Top ‘k’ words associated with each topic

Topic to word association

Topic model configurationsTraining corpus Size of training corpus # topics

Twitterdb 5, 10, 15, 16, 17, 18, 19, 20, 40 (in millions)

200, 300, 400

TAC KBP Approx 377k documents 200, 300, 400

Topic vectors• Using previously generated inference file.• The output is a topic vector which gives a distribution over each topic for every document.

Clustering

Topic vectors

CSV Format

MDSK-means

clustering

R analysis package

Cluster 1

Cluster 2

Cluster 3

Cluster 4

Cluster 5

Induced clusters

A common way to visualize N-dimensional data by exploring similarities and dissimilarities in it.

cmdscale command in R.Input: Distance matrix which indicates dissimilarities in the row vectorsOutput: set of points s.t. distance between them is proportional to the dissimilarities in them.

Aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean.

k-means command in RInput: output from MDSOutput: data points associated with cluster-id’s

a. Widely used for statistical computing and visualizations of large datasets.

b. Built-in functions and rich data structures.

c. Open source.

Sample 2-D clustering output via R

Clustering with k = 8 on disaster events dataset using topic model trained on TAC KBP news wire corpus with # topics=200

Sample 3-D plot

Clustering with k = 8 on disaster events dataset using topic model trained on TAC KBP news wire corpus with # topics=200

Evaluation

8 original clusters over 12k tweets with 1500 tweets per cluster

Induced Clusters over the same 12k tweets using K-means

Previously trained

topic model

12000 topic

vectorsMDS and k-means

Evaluation Parameters

Clustering parametersa. Residual Sum of Squares (RSS)b. Cluster cardinalityc. Cluster centers and iterations for convergenced. Cluster validations – cardinality and goodnesse. Clustering accuracy

Topic model parametersf. Training corpus sizeg. Training corpus type – news wire and twitterdbh. Number of topics

Residual Sum of Squares (RSS)RSS is the squared distance of each vector from it’s cluster centroid summed over all vectors in the cluster.

RSSk = ∑xωk |x − μ(ωk)|2

where μ(ωk) represents centriod of cluster ωk given by

μ(ωk) = (1/|ω|)∑ xω x

Hence, the RSS for a particular clustering output with say K clusters is given by

RSS = ∑ K k=1 RSSk

Smaller value of RSS indicates tighter clusters.

Cluster CardinalityHeuristic method to calculate number of clusters for k-means clustering algorithm as mentioned in [1]

a. Perform clustering i times(we use i = 10) for a said value of k. Find the RSS in each case.

b. Find the minimum RSS value. Denote it as RSSmin.

c. Find RSSmin for different values of k as k increases.

d. Find the ’knee’ in the curve i.e. the point where successive decrease in this value is the smallest. This value of k indicates the cluster cardinality.

[1] Manning, Christopher, D.; Raghavan, P.; and Schutze, H. 2008. Introduction to Information Retrieval. Cambridge University Press.

RSSmin versus k

RSSmin k

0.6903 3

0.4220 40.3662 5

0.2581 6

0.2391 7

0.2192 8

0.2098 9

0.1594 10

0.1469 11

0.1204 12

0.0999 13

RSSmin and k for twitterdb trained topic model with 200 topics

Cluster centers and iterations

• K-means in R-analysis package randomly chooses data rows as cluster centers.

• The default number of iterations performed until convergence is reached is 10.

• We have built more than 27 different topic models and performed k-means clustering for each. We have observed that baring just 3 cases convergence was reached within 10 iterations.

• In those 3 cases, convergence was achieved by setting the # iterations to 15.

Cluster validations

a. Cluster cardinality using RSSmin versus k

b. Goodness of clustering itself using Jaccard coefficient

Jaccard coefficient

Higher the Jaccard coefficient value, more is an induced cluster similar to an original cluster

Effect of change in training data size on Jaccard coefficient

Case #

Training size(tweets in millions)

#topics = 200, twitterdb training dataSimilar results obtained for topic models with #topics=300

DC Snow

Californ

ia Fir

NE tunders

China mine b

Afghan

Gulf Oil S

Gustav H

urrican

Haiti Ea

rthquak

Case 1Case 2Case 3 Case 4Case 5Case 6Case 7Case 8

Event Name

Coeffi

Effect of change in training data type on Jaccard coefficient

#topics=200, we compare the best model from previous slide with news wire trained model.

Effect of change in # topics on Jaccard coefficient

DC Snow

Californ

ia Fir

NE tunders

China mine b

Afghan

Gulf Oil S

Gustav H

urrican

Haiti Ea

rthquak

Event Name

Coeffi

• All models trained with same 16 million tweets from twitterdb

Selecting an optimal topic model• # topics 300• TAC KBP corpus for trained model outperforms twitterdb trained models

TAC KBP trained topic model with 300 topics is the optimal one.

DC Snow

Californ

ia Fir

NE tunders

China mine b

Afghan

Gulf Oil S

Gustav H

urrican

Haiti Ea

rthquak

AFP_200AFP_300AFP_400

Event Name

Coeffi

Jaccard coefficient matrixDC snow

California fire

NE thunderstorm

China mine blasts

Afghan war

Gulf Oil Spills

Gustav hurricane

Haiti Earthquake

DC snow 0.505 0.028 0.231 0.016 0.009 0.046 0.12 0.048

California fire 0.024 0.483 0.042 0.127 0.139 0.08 0.046 0.061

NE thunderstorm

0.141 0.012 0.498 0.004 0.016 0.111 0.213 0.009

China mine blasts

0.008 0.092 0.016 0.546 0.21 0.024 0.003 0.101

Afghan war 0.019 0.136 0.026 0.124 0.49 0.016 0.097 0.098

Gulf Oil Spills 0.089 0.071 0.009 0.066 0.117 0.527 0.018 0.083

Gustav hurricane

0.178 0.061 0.18 0.037 0.002 0.096 0.492 0.101

Haiti Earthquake

0.051 0.134 0.003 0.108 0.09 0.101 0.014 0.499

Induced

Original

Observations based on Jaccard coefficient matrix

Induced Cluster Top 5 most frequent words from event datasets

Afghan war war, fires, army, terrorist, kill

California fire fire, burn, smoke, damage, west

NE thunderstorm storm, winds, rain, warning, people

Gustav hurricane hurricane, storm, floods, heavy, weather

Topic keys generated by MALLET

fire, california, fires, damage, police, killed, shot, attack, died, injured, wounded

storm, people, hurricane, rain, rains, flood, flooding, coast, mexico, areas

Accuracy on test dataCluster Name

Size of induced cluster (A)

Correctly clustered tweets (A B)

Original cluster size (B)

Jaccard coefficient

Accuracy

Hurricane Alex

572 403 624 0.508 64.58 %

China earthquake

428 263 376 0.486 69.94 %

Baseline for comparison

A framework to classify short and sparse text by Phan, X. H.; Nguyen, L., M.; and Horiguchi, S. 2008.

Accuracy of around 67% using 22.5k documents for training and with 200 topics using topic models with Gibbs sampling.

Clustering twitter users

• 21 well known twitter users across 7 different domains• 100 tweets per user via Twitter API

Domain Twitter users

Sports @ESPN, @Lakers, @NBA

Travel Reviews @Frommers, @TravBuddy, @mytravelguide

Finance @CBOE, @CNNMoney, @nysemoneysense

Movies @imdb, @peoplemag, @RottenTomatoes, @eonline

Technology News @Techcrunch, @digg_technews

Gaming @EASPORTS, @IGN, @NeedforSpeed

Breaking News @foxnews, @msnbc, @abcnews

Users were obtained via http://www.twellow.com/It’s like yellow pages for twitter.

Results for twitter user clusteringCluster # Twitter users

1 @CBOE, @msnbc, @CNNMoney, @nysemoneysense

2 @IGN, @NeedforSpeed

3 @EASPORTS, @ESPN, @Lakers, @NBA

4 @Frommers, @TravBuddy, @mytravelguide

5 @imdb, @peoplemag, @RottenTomatoes, @eonline

6 @foxnews, @abcnews, @TechCrunch, @digg_technews

Conclusions

• We have empirically shown how to select a topic model by considering various topic model and clustering parameters. We have also supplied statistical evidence for same.

• We showed that a news wire trained topic model performs better than a twitterdb trained topic model for clustering tweets.

• We obtained approx 65% accuracy for clustering tweets in the test dataset.

• We also showed the usefulness of topic models to cluster twitter users.

Future Work

• Using a faster implementation for k-means

• How can we make the implementation scalable to cluster tweets at real time?

• Extending the work to cluster Facebook status messages.

References[1] Java, A.; Song, X.; Finin, T.; and Tseng, B. 2007. Why we twitter: Understanding micro blogging usage

and communities. WebKDD/SNA-KDD 2007.[2] Kireyev, K.; Palen, L.; and Anderson, A. 2009. Applications of topics models to analysis of disaster-

related twitter data. NIPS Workshop 2009.[3] Kuropka, D., and Becker, J. 2003. Topic-based vector space model.[4] Lee, M.; Wang, W.; and Yu, H. Exploring supervised and unsupervised methods to detect topics in

biomedical text.[5] MacQueen, J., B. 1967. Some methods for classification and analysis of multivariate

observations. In roceedings of 5th Berkeley Symposium on Mathematical Statistics and Probability. University of California Press., 281–297.

[6] Manning, Christopher, D.; Raghavan, P.; and Schutze, H. 2008. Introduction to Information Retrieval. Cambridge University Press.

[7] McCallum, A.; Corrada-Emmanuel, A.; and Wang, X. Topic and role discovery in social networks.[8] McCallum, A. K. 2002. Mallet: A machine learning for language toolkit.[9] Murnane, W. 2010. Improving accuracy of named entity recognition on social media data. Master’s

thesis, University of Maryland, Baltimore County.[10] Phan, X. H.; Nguyen, L., M.; and Horiguchi, S. 2008. Learning to classify short and sparse text & web

with hidden topics from large-scale data collections. In Proceedings of the 17th International World Wide Web Conference (WWW 2008), 91–100.

References[11] R Development Core Team. 2010. R: A Language and Environment for Statistical Computing. R

Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0.[12] Ramage, D.; Dumais, S.; and Liebling, D. Characterizing microblogs with topic models. In

Proceedings of the Fourth International AAAI Conference on Weblogs and Social Media.[13] Starbird, K.; Palen, L.; Hughes, A.; and Vieweg, S. 2010. Chatter on the red:what hazards threat

reveals about the social life of microblogged information. ACM CSCW 2010.[14] Steyver, M., and Griffiths, T. 2007. Probabilistic Topic Models. Lawrence Erlbaum Associates.[15] Steyvers, M.; Griffiths, T., H.; and Smyth, P. 2004. Probabilistic author-topic models for information

discovery. In Proceedings in 10th ACM SigKDD conference knowledge discovery and data mining.[16] Vieweg, S.; Hughes, A.; Starbird, K.; and Palen, L. 2010. Supporting situational awareness in

emergencies using microblogged information. ACM Conf. on Human Factors in Computing Systems 2010.

[17] Yardi, S.; Romero, D.; Schoenebeck, G.; and Boyd, D. 2010. Detecting spam in a twitter network. First Monday 15:1–4.

[18] Zhao, D., and Rosson, M. B. 2009. How and why people twitter: the role that microblogging plays in informal communication at work.

Questions?

Thank you!

AcknowledgementsAdvisor, committee members and eBiquity members.

Clustering short status messages: A topic model based approach

Documents

The stochastic topic block model for the clustering of

First topic: clustering and pattern recognition

Generative Clustering, Topic Modeling, & Bayesian Inference · Topic Modeling, & Bayesian Inference INFO-4604, Applied Machine Learning University of Colorado Boulder December 11-13,

NLP in Archival Processingbitcurator.net/files/2016/12/mennerich-1.pdf · Lloyd k.means Clustering: o 00 0 9000 tee o sc00cÊ o 0 0 00 iterations 00 0 a Topic 1 Topic 2 Topic 1 Topic

z/OS Language Environment Runtime Messagesfile/ceea900_v2r3.pdf · Language Environment runtime messages The messages in this topic pertain to Language Envir onment. Each message

First topic: clustering and pattern recognition Marc Sobel

Nonnegative Matrix Factorization for Interactive Topic Modeling and Document Clustering

A Topic Detection and Tracking method combining NLP with Suffix Tree Clustering

Bayesian Topic Model Approaches to Online and Time ...mcoate/publications/Kharratzadeh_DSP_2015.pdf · Bayesian Topic Model Approaches to Online and Time-dependent Clustering M. Kharratzadeh

KELP Module 1 Topic: Sending Messages WIRED-UP …

Clustering by Passing Messages Between Data Pointsfrey/dossier/FreyClustering.pdf · Clustering by Passing Messages Between Data Points ... Canada, M5S 3G4. Email: frey@psi.toronto.edu

Fast Clustering and Topic Modeling Based on Rank-2 ... · PDF fileFast Clustering and Topic Modeling Based on Rank-2 ... Our method is based on fast Rank-2 nonnegative matrix factorization

On-line Clustering for Real-Time Topic Detection in Social Media Streaming …ceur-ws.org/Vol-1150/popovici.pdf · 2014-04-24 · On-line Clustering for Real-Time Topic Detection

Identifying diachronic topic-based research communities by ... · Identifying diachronic topic-based research communities by clustering shared research trajectories Francesco Osborne

Publish messages to kafka topic

Topic 1 Clustering Basics · Topic 1 Clustering Basics CS898. Overview Basics (K-means) • variance clustering • generalizations (parametric & non-parametric) Kernel K-means Probabilistic

Today’s Topic: Clustering Document clustering Motivations Document representations Success criteria Clustering algorithms Partitional Hierarchical

Lecture7 Topic1: Graph spectral analysis/Graph spectral clustering and its application to metabolic networks Topic 2: Concept of Line Graphs Topic 3: Introduction

Page 1 FEM 4100 BRAIN & HUMAN BEHAVIOR. Page 2 Topic 3 Neurotransmission: Sending & Receiving Messages

Key messages - Fostering and Adoption · 2020-01-23 · Key messages Topic 16: key messages Key Messages A child and family-centred approach is needed to ensure children have ‘a