View
39
Download
0
Category
Preview:
DESCRIPTION
Clustering short status messages: A topic model based approach. Masters Thesis Defense Anand Karandikar Advisor: Dr. Tim Finin Date: 26 th July 2010 Time: 9:00 am Place: ITE 325B. http://www.binterest.com/. Thesis Contributions. - PowerPoint PPT Presentation
Citation preview
1
Clustering short status messages:A topic model based approach
Masters Thesis DefenseAnand Karandikar
Advisor: Dr. Tim Finin
Date: 26th July 2010Time: 9:00 amPlace: ITE 325B
http://www.binterest.com/
2
Thesis Contributions
• Determine a topic model that is “optimal” for clustering tweets by determining good parameters to build a topic model in terms of dataset type, dataset size and number of topics.
• Cluster tweets based on topic similarity.
• Cluster twitter users using topic models.
3
Outline
• Introduction• Motivation• Related work• Approach• Experiments and results• Conclusion• Future work
4
Rise of online social mediaAbility to rapidly disseminate information. A medium of communication and information sharing.
Twitter, Facebook, Flickr and Youtube facilitate information sharing via text, hyperlinks, photos, video etc.
Status updates or tweets (for Twitter) can contain text, emoticon, link or their combination.
5
Basics…
• Topic models are generative models.• The basic idea is to describe a document as
mixture of different topics. • A topic is simply a collection of words that
occur frequently with each other.
Properties of interest
Bag of words model, unsupervised learning, identify latent relationships in the data, document represented as a numerical vector
6
Motivation• Content oriented analysis applying NLP techniques is difficult
a. Short length of messages, about 140 charactersb. Lack of grammar rules. Use of abbreviations and slangsc. Implied references to entities
• Topic models can address above mentioned difficulties.
• Clustering will help research community to categorize tweets based on their content without the need for labeled data.
• Such clustering will further help users to discover other users who post about topics of their liking or interest.
7
Related Work
• Discover topics covered by papers in PNAS. These were used to identify relationships between various
science disciplines and finding latest trends.
• Author-topic models To discover topic trends, finding authors who most likely
tend to write on certain topics.
• Detect topics in biomedical text. It performs topic based clustering using unsupervised
hierarchical clustering algorithms.
8
Related Work
• Smarter BlogRoll augments a blogroll with information about current topics of the blogs in that blog roll.
• Map content in Twitter feed into dimensions that correspond roughly to substance, style, status and social characteristics of posts.
• Identify latent patterns like informational and emotional messages in Earthquake and Tsunami data sets collected from Twitter.
9
Problem 1
• Topic models can be trained using different datasets, varying size of training data and varying number of topics.
Problem Definition:
Given that we have topic models with varying parameters, to determine which topic model configuration is “optimal” for clustering tweets.
10
Problem 2
Problem Definition:
Given a set of twitter users and their tweets, cluster the twitter users based on similarity in the content they tweet about.
11
Twitterdb datasetThe total collection is about 150 million tweets from 1.5 million users, collected over a period of 20 months (during 2007–2008)
Language Percentage
English 32.4 %
Scots 12.5 %
Japanese 7.4 %
Catalan 5.2 %
German 3.9 %
Danish 3.1 %
Approx. 48 million English tweets that can be used
TAC KBP Corpus• This was basically 2009 TAC KBP corpus with approximately
377K newswire articles from Agence France-Presse (AFP)• About half articles were from 2007 and half from 2008 with
a few (less than 1%) from 1994-2006.
12
Disaster Events datasetEvent Name Source
DC snow Twitter API
NE thunderstorm Twitter API
Haiti earthquake Twitter API
Afghanistan war Twitter API
China mine blasts Twitter API
Gulf oil spills Twitter API
California fires Twitterdb
Gustav hurricane Twitterdb
1500 tweets per event
Hence a total of 12k tweets
13
Supplementary test datasetEvent Name Source # tweets
Hurricane Alex Twitter API 624China earthquake Twitterdb 376
Manually scanned through all 1000 tweets to make sure they are relevant to the respective event.
Sample Twitter API queries
Using words, hashtags and date ranges for queryingHaiti earthquake in Jan 2010: haiti earthquake # haiti since:2010-01-12 until:2010-01-16
Using words, date ranges and locationWashington DC snow blizzard in Feb 2010: snow since:2010-02-25 until:2010-02-28 near:”Washington DC” within:25mi
An eyeballing resulted in approximately 97% tweets obtained this way relevant to the event name in our Disaster events dataset.
Approach
14
Training Corpus
MALLETtopic
modeler
Disaster Events data with
12000 tweets
ClusteringOutput
12000 topic
vectors
Topic model configuration parameters
Topic inference file
15
Topic modelerWhy MALLET?
a. open source.b. extremely fast and highly scalable implementation of Gibbs sampling.c. tools to infer topics from new documents.
http://mallet.cs.umass.edu/
Steps involved in building a topic model
Input •Pruning of dataset•Convert input data to MALLET’s internal data format
Training •‘train-topics’ command•200 to 400 topics for fine granularity
Output •Inference file•Top ‘k’ words associated with each topic
16
Topic to word association
17
Topic model configurationsTraining corpus Size of training corpus # topics
Twitterdb 5, 10, 15, 16, 17, 18, 19, 20, 40 (in millions)
200, 300, 400
TAC KBP Approx 377k documents 200, 300, 400
Topic vectors• Using previously generated inference file.• The output is a topic vector which gives a distribution over each topic for every document.
18
Clustering
Topic vectors
CSV Format
MDSK-means
clustering
R analysis package
Cluster 1
Cluster 2
Cluster 3
Cluster 4
Cluster 5
Induced clusters
A common way to visualize N-dimensional data by exploring similarities and dissimilarities in it.
cmdscale command in R.Input: Distance matrix which indicates dissimilarities in the row vectorsOutput: set of points s.t. distance between them is proportional to the dissimilarities in them.
Aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean.
k-means command in RInput: output from MDSOutput: data points associated with cluster-id’s
a. Widely used for statistical computing and visualizations of large datasets.
b. Built-in functions and rich data structures.
c. Open source.
19
Sample 2-D clustering output via R
Clustering with k = 8 on disaster events dataset using topic model trained on TAC KBP news wire corpus with # topics=200
20
Sample 3-D plot
Clustering with k = 8 on disaster events dataset using topic model trained on TAC KBP news wire corpus with # topics=200
21
Evaluation
8 original clusters over 12k tweets with 1500 tweets per cluster
Induced Clusters over the same 12k tweets using K-means
Previously trained
topic model
12000 topic
vectorsMDS and k-means
22
Evaluation Parameters
Clustering parametersa. Residual Sum of Squares (RSS)b. Cluster cardinalityc. Cluster centers and iterations for convergenced. Cluster validations – cardinality and goodnesse. Clustering accuracy
Topic model parametersf. Training corpus sizeg. Training corpus type – news wire and twitterdbh. Number of topics
23
Residual Sum of Squares (RSS)RSS is the squared distance of each vector from it’s cluster centroid summed over all vectors in the cluster.
RSSk = ∑xωk |x − μ(ωk)|2
where μ(ωk) represents centriod of cluster ωk given by
μ(ωk) = (1/|ω|)∑ xω x
Hence, the RSS for a particular clustering output with say K clusters is given by
RSS = ∑ K k=1 RSSk
Smaller value of RSS indicates tighter clusters.
24
Cluster CardinalityHeuristic method to calculate number of clusters for k-means clustering algorithm as mentioned in [1]
a. Perform clustering i times(we use i = 10) for a said value of k. Find the RSS in each case.
b. Find the minimum RSS value. Denote it as RSSmin.
c. Find RSSmin for different values of k as k increases.
d. Find the ’knee’ in the curve i.e. the point where successive decrease in this value is the smallest. This value of k indicates the cluster cardinality.
[1] Manning, Christopher, D.; Raghavan, P.; and Schutze, H. 2008. Introduction to Information Retrieval. Cambridge University Press.
25
RSSmin versus k
RSSmin k
0.6903 3
0.4220 40.3662 5
0.2581 6
0.2391 7
0.2192 8
0.2098 9
0.1594 10
0.1469 11
0.1204 12
0.0999 13
RSSmin and k for twitterdb trained topic model with 200 topics
26
Cluster centers and iterations
• K-means in R-analysis package randomly chooses data rows as cluster centers.
• The default number of iterations performed until convergence is reached is 10.
• We have built more than 27 different topic models and performed k-means clustering for each. We have observed that baring just 3 cases convergence was reached within 10 iterations.
• In those 3 cases, convergence was achieved by setting the # iterations to 15.
27
Cluster validations
a. Cluster cardinality using RSSmin versus k
b. Goodness of clustering itself using Jaccard coefficient
Jaccard coefficient
Higher the Jaccard coefficient value, more is an induced cluster similar to an original cluster
Effect of change in training data size on Jaccard coefficient
28
Case #
Training size(tweets in millions)
1 5
2 10
3 16
4 17
5 18
6 19
7 20
8 40
#topics = 200, twitterdb training dataSimilar results obtained for topic models with #topics=300
DC Snow
Californ
ia Fir
e
NE tunders
torm
China mine b
lasts
Afghan
War
Gulf Oil S
pills
Gustav H
urrican
e
Haiti Ea
rthquak
e0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
Case 1Case 2Case 3 Case 4Case 5Case 6Case 7Case 8
Event Name
Jacc
ard
Coeffi
cien
t
29
Effect of change in training data type on Jaccard coefficient
#topics=200, we compare the best model from previous slide with news wire trained model.
30
Effect of change in # topics on Jaccard coefficient
DC Snow
Californ
ia Fir
e
NE tunders
torm
China mine b
lasts
Afghan
War
Gulf Oil S
pills
Gustav H
urrican
e
Haiti Ea
rthquak
e0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
T_200
T_300
T_400
Event Name
Jacc
ard
Coeffi
cien
t
• All models trained with same 16 million tweets from twitterdb
31
Selecting an optimal topic model• # topics 300• TAC KBP corpus for trained model outperforms twitterdb trained models
TAC KBP trained topic model with 300 topics is the optimal one.
DC Snow
Californ
ia Fir
e
NE tunders
torm
China mine b
lasts
Afghan
War
Gulf Oil S
pills
Gustav H
urrican
e
Haiti Ea
rthquak
e0
0.1
0.2
0.3
0.4
0.5
0.6
AFP_200AFP_300AFP_400
Event Name
Jacc
ard
Coeffi
cien
t
32
Jaccard coefficient matrixDC snow
California fire
NE thunderstorm
China mine blasts
Afghan war
Gulf Oil Spills
Gustav hurricane
Haiti Earthquake
DC snow 0.505 0.028 0.231 0.016 0.009 0.046 0.12 0.048
California fire 0.024 0.483 0.042 0.127 0.139 0.08 0.046 0.061
NE thunderstorm
0.141 0.012 0.498 0.004 0.016 0.111 0.213 0.009
China mine blasts
0.008 0.092 0.016 0.546 0.21 0.024 0.003 0.101
Afghan war 0.019 0.136 0.026 0.124 0.49 0.016 0.097 0.098
Gulf Oil Spills 0.089 0.071 0.009 0.066 0.117 0.527 0.018 0.083
Gustav hurricane
0.178 0.061 0.18 0.037 0.002 0.096 0.492 0.101
Haiti Earthquake
0.051 0.134 0.003 0.108 0.09 0.101 0.014 0.499
Induced
Original
33
Observations based on Jaccard coefficient matrix
Induced Cluster Top 5 most frequent words from event datasets
Afghan war war, fires, army, terrorist, kill
California fire fire, burn, smoke, damage, west
NE thunderstorm storm, winds, rain, warning, people
Gustav hurricane hurricane, storm, floods, heavy, weather
Topic keys generated by MALLET
fire, california, fires, damage, police, killed, shot, attack, died, injured, wounded
storm, people, hurricane, rain, rains, flood, flooding, coast, mexico, areas
34
Accuracy on test dataCluster Name
Size of induced cluster (A)
Correctly clustered tweets (A B)
Original cluster size (B)
Jaccard coefficient
Accuracy
Hurricane Alex
572 403 624 0.508 64.58 %
China earthquake
428 263 376 0.486 69.94 %
Baseline for comparison
A framework to classify short and sparse text by Phan, X. H.; Nguyen, L., M.; and Horiguchi, S. 2008.
Accuracy of around 67% using 22.5k documents for training and with 200 topics using topic models with Gibbs sampling.
35
Clustering twitter users
• 21 well known twitter users across 7 different domains• 100 tweets per user via Twitter API
Domain Twitter users
Sports @ESPN, @Lakers, @NBA
Travel Reviews @Frommers, @TravBuddy, @mytravelguide
Finance @CBOE, @CNNMoney, @nysemoneysense
Movies @imdb, @peoplemag, @RottenTomatoes, @eonline
Technology News @Techcrunch, @digg_technews
Gaming @EASPORTS, @IGN, @NeedforSpeed
Breaking News @foxnews, @msnbc, @abcnews
Users were obtained via http://www.twellow.com/It’s like yellow pages for twitter.
36
Results for twitter user clusteringCluster # Twitter users
1 @CBOE, @msnbc, @CNNMoney, @nysemoneysense
2 @IGN, @NeedforSpeed
3 @EASPORTS, @ESPN, @Lakers, @NBA
4 @Frommers, @TravBuddy, @mytravelguide
5 @imdb, @peoplemag, @RottenTomatoes, @eonline
6 @foxnews, @abcnews, @TechCrunch, @digg_technews
37
Conclusions
• We have empirically shown how to select a topic model by considering various topic model and clustering parameters. We have also supplied statistical evidence for same.
• We showed that a news wire trained topic model performs better than a twitterdb trained topic model for clustering tweets.
• We obtained approx 65% accuracy for clustering tweets in the test dataset.
• We also showed the usefulness of topic models to cluster twitter users.
38
Future Work
• Using a faster implementation for k-means
• How can we make the implementation scalable to cluster tweets at real time?
• Extending the work to cluster Facebook status messages.
39
References[1] Java, A.; Song, X.; Finin, T.; and Tseng, B. 2007. Why we twitter: Understanding micro blogging usage
and communities. WebKDD/SNA-KDD 2007.[2] Kireyev, K.; Palen, L.; and Anderson, A. 2009. Applications of topics models to analysis of disaster-
related twitter data. NIPS Workshop 2009.[3] Kuropka, D., and Becker, J. 2003. Topic-based vector space model.[4] Lee, M.; Wang, W.; and Yu, H. Exploring supervised and unsupervised methods to detect topics in
biomedical text.[5] MacQueen, J., B. 1967. Some methods for classification and analysis of multivariate
observations. In roceedings of 5th Berkeley Symposium on Mathematical Statistics and Probability. University of California Press., 281–297.
[6] Manning, Christopher, D.; Raghavan, P.; and Schutze, H. 2008. Introduction to Information Retrieval. Cambridge University Press.
[7] McCallum, A.; Corrada-Emmanuel, A.; and Wang, X. Topic and role discovery in social networks.[8] McCallum, A. K. 2002. Mallet: A machine learning for language toolkit.[9] Murnane, W. 2010. Improving accuracy of named entity recognition on social media data. Master’s
thesis, University of Maryland, Baltimore County.[10] Phan, X. H.; Nguyen, L., M.; and Horiguchi, S. 2008. Learning to classify short and sparse text & web
with hidden topics from large-scale data collections. In Proceedings of the 17th International World Wide Web Conference (WWW 2008), 91–100.
40
References[11] R Development Core Team. 2010. R: A Language and Environment for Statistical Computing. R
Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0.[12] Ramage, D.; Dumais, S.; and Liebling, D. Characterizing microblogs with topic models. In
Proceedings of the Fourth International AAAI Conference on Weblogs and Social Media.[13] Starbird, K.; Palen, L.; Hughes, A.; and Vieweg, S. 2010. Chatter on the red:what hazards threat
reveals about the social life of microblogged information. ACM CSCW 2010.[14] Steyver, M., and Griffiths, T. 2007. Probabilistic Topic Models. Lawrence Erlbaum Associates.[15] Steyvers, M.; Griffiths, T., H.; and Smyth, P. 2004. Probabilistic author-topic models for information
discovery. In Proceedings in 10th ACM SigKDD conference knowledge discovery and data mining.[16] Vieweg, S.; Hughes, A.; Starbird, K.; and Palen, L. 2010. Supporting situational awareness in
emergencies using microblogged information. ACM Conf. on Human Factors in Computing Systems 2010.
[17] Yardi, S.; Romero, D.; Schoenebeck, G.; and Boyd, D. 2010. Detecting spam in a twitter network. First Monday 15:1–4.
[18] Zhao, D., and Rosson, M. B. 2009. How and why people twitter: the role that microblogging plays in informal communication at work.
41
Questions?
Thank you!
AcknowledgementsAdvisor, committee members and eBiquity members.
Recommended