Upload
hila-becker
View
109
Download
3
Tags:
Embed Size (px)
DESCRIPTION
Citation preview
Identification and Characterization of Events in Social MediaHila Becker, Thesis Defense
2
Social Media is Changing the World
Lady Gaga, Justin Bieber, and Britney Spears have more Twitter followers than the entire populations of some countries (e.g., Israel, Greece)
YouTube is the second largest search engine in the world
Every minute, 24 hours of video are uploaded to YouTube
Over the past five years people uploaded 6,000,000,000 images to Flickr
3Source: http://www.searchenginejournal.com/the-growth-of-social-media-an-infographic/
Event Content in Social Media
4
5
MIKE CLARKE/AFP/Getty Images
6
Source: Tweets from Tahrir, edited by Nadia Idle and Alex Nunns
7
8
Event Identification, Characterization, and Content Selection
Identify events and their associated social media documents In a timely manner Across different social media sites
Characterize events along different dimensions
Select high-quality, relevant, useful event documents
9
Event Content in Social Media
Challenges: Wide variety of topics, not all related to events
(e.g., personal status updates, every-day mundane conversations)
Unconventional text: abbreviations, typos Large-scale, rapidly produced content
Opportunities: Content generated in real-time, as events happen Rich context features (e.g., time, location) Users’ perspective
10
Event Content in Social MediaTimeliness
Con
ten
t D
isco
very
Real-time Retrospective
Know
nU
nkn
ow
n
Earthquake prediction using Twitter [Sakaki et al. WWW’10]
Twitter new event detection [Petrović et al. NAACL’10]
Event detection on Flickr [Chen and Roy CIKM’09]
Organization of YouTube concert videos [Kennedy and Naaman WWW’09]
11
Event Content in Social MediaC
on
ten
t D
isco
very
Know
nU
nkn
ow
n
Trending Event is a real-world occurrence described by: One or more terms and a time period Volume of messages posted for the terms in the
time period exceeds some expected level of activity
Planned Event is a real-world occurrence with corresponding published event record consisting of: Title, describing the subject of the event The time at which the event is planned to occur
12
Contributions Trend (and trending event) study, for characterizing
and differentiating between different types of trends
Online clustering framework with an event classification step for identifying trending events and their associated documents in social media
Social media document similarity metric learning approaches
Query formulation strategies for identifying social media documents for planned events
Selection techniques for identifying high quality, relevant, and useful event content
Unknown/Known
Known
Unknown
13
Contributions Trend (and trending event) study, for characterizing
and differentiating between different types of trends
Online clustering framework with an event classification step for identifying trending events and their associated documents in social media
Social media document similarity metric learning approaches
Query formulation strategies for identifying social media documents for planned events
Selection techniques for identifying high quality, relevant, and useful event content
Unknown/Known
Known
14
Contributions Trend (and trending event) study, for characterizing
and differentiating between different types of trends
Online clustering framework with an event classification step for identifying trending events and their associated documents in social media
Social media document similarity metric learning approaches
Query formulation strategies for identifying social media documents for planned events
Selection techniques for identifying high quality, relevant, and useful event content
Unknown/Known
15
Contributions Trend (and trending event) study, for characterizing
and differentiating between different types of trends
Online clustering framework with an event classification step for identifying trending events and their associated documents in social media
Social media document similarity metric learning approaches
Query formulation strategies for identifying social media documents for planned events
Selection techniques for identifying high quality, relevant, and useful event content
16
Identification and Characterization of Events in Social Media
Characterization of trending events
Identification of trending events
Similarity metric learning for trending events
Identification of content for planned events
Selection of event content
17
What Types of Trends Exist in Social Media?
Taxonomy of trends
Characterization of each trend Manually assigned categories Automatically computed features
Analysis of differences between trend types according to each characteristic
Trending Events
Non-Event Trends
18
Trends
Trend: One or more terms and a time period Volume of messages posted for the terms in the
time period exceeds some expected level of activity
May or may not reflect a real-world occurrence
A trending event is a type of trend
19
Twitter Content
Streams of textual messages Brief content (140
characters) Communicated to
network of followers Provide timely
reflection of thoughts and interests
20
Characterizing Trends on Twitter
Collect a set of Twitter trends Burst detection Twitter’s “trending topics”
Qualitative analysis: trend taxonomy
Quantitative analysis Automatically compute features of each trend and
corresponding messages Manually label each trend according to categories
introduced by the taxonomy Identify differences between trend categories
according to automatically computed features
21
Affinity Diagram Method
22
Endogenous vs. Exogenous Trends
Endogenous Trends: Twitter-centric activities that do not correspond to external events (e.g., a popular post by a celebrity)
Exogenous Trends: trending events that originated outside of the Twitter system (e.g., an earthquake)
Do exogenous and endogenous trends exhibit different characteristics?
23
Characterization of Trends and Trending Events
Automatically computed features Content Features Interaction Features Time-based Features Participation Features Social Network Features
Compared differences between categories Hypotheses guided by differences in categories
according to feature types Performed t-tests for significance analysis
24
Contributions of the Study
Trends fall into two main categories: exogenous (i.e., trending event) and endogenous (i.e., platform-centric trend)
There are significant differences between exogenous and endogenous trends Proportion of messages with URLs Unique hashtag in top 10% of messages Proportion of retweets Reciprocity
25
Identification and Characterization of Events in Social Media
Characterization of trending events
Identification of trending events
Similarity metric learning for trending events
Identification of content for planned events
Selection of event content
Identifying Trending Events
26
Document Clusters
Documents Event Clusters
27
Identifying Trending Events in Real-Time
Order documents by post time
Use tf-idf vector representation of textual content Stop word elimination Stemming idf computed over past data
Separate tweets by location Focus on tweets from NYC Different locations can be processed in parallel
28
Clustering Algorithm
Many alternatives possible! [Berkhin 2002]
Single-pass incremental clustering algorithm Scalable, online solution
Using centroid representation Used effectively for
Event identification in textual news [Allan et al. 1998]
News event detection on Twitter [Sankaranarayanan et al. 2009]
Does not require a priori knowledge of number of clusters
Parameters: Similarity Function σ Threshold μ
29
Overview of Cluster-based Approach
Group similar documents via online clustering
Compute statistics of cluster content Top terms (e.g., [earthquake, japan]) Number of documents per hour …
Use cluster-level features to identify trending event clusters Single feature with threshold (e.g., increase in
volume over time-window [Petrović et al. 2010]) Trained classification model
30
Event Classification on Twitter
Cluster-level features Social interaction Topic coherence Trending behavior Platform-centric
Event classifier Human-annotated training data SVM model (selected during training phase)
31
Experimental Setup Classification accuracy
Baseline: Naïve Bayes text classification (NB-Text) [Sankaranarayanan et al. 2009]
10-fold cross validation Blind test set of randomly chosen tweets
Event surfacing: select top event clusters per hour Baselines
Fastest-growing clusters per hour (Fastest) [Petrović et
al. 2010]
Randomly selected clusters per hour (Random) 5 hours, top-20 clusters per hour
32
Identified Events
Description Keywords
Senator Evan Bayh's Retirement
bayh, evan, senate, congress, retire
Westminster Dog Showwestminster, dog, show,
club, kennel
Obama’s Meeting with the Dalai Lama
lama, dalai, meet, obama, china
NYC Toy Fairtoyfairny, starwars, hasbro,
lego, toy
Marc Jacobs Fashion Showjacobs, marc, nyfw, show,
fashion
A sample of events identified by our classifiers on the test set
33
Classification Performance (F-measure)
RW-Event event classifier is more effective at discriminating between real-world events and rest of Twitter data
Classifier Validation Test
NB-Text 0.785 0.702
RW-Event 0.849 0.837
34
NDCG@K Evaluation
5 10 15 200
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
RW-Event
Fastest
Random
Number of Clusters (K)
ND
CG
Performance of event classifier and baselines for event surfacing task.
35
Identification and Characterization of Events in Social Media
Characterization of trending events
Identification of trending events
Similarity metric learning for trending events
Identification of content for planned events
Selection of event content
Social Media Document Representation
TitleTitle
Description
Description
TagsTags
Date/TimeDate/Time
LocationLocation
All-TextAll-Text
3636
Social Media Document Similarity
Text: cosine similarity of tf-idf vectors (tf-idf version?; stemming?; stop word elimination?)
37
TitleTitle
Description
Description
TagsTags
LocationLocation
All-TextAll-Text
Date/TimeDate/Time
time
AA AAAA BB BBBB
Time: proximity in minutes
Location: geo-coordinate proximity
37
38
Clustering Algorithm
Many alternatives possible! [Berkhin 2002]
Single-pass incremental clustering algorithm Scalable, online solution
Using centroid representation Used effectively for
Event identification in textual news [Allan et al. 1998]
News event detection on Twitter [Sankaranarayanan et al. 2009]
Does not require a priori knowledge of number of clusters
Parameters: Similarity Function σ Threshold μ
39
Cluster Representation and Parameter Tuning
Centroid cluster representation Average tf-idf scores Average time Geographic mid-point
Parameter tuning in supervised training phaseClustering quality metrics to optimize:
Normalized Mutual Information (NMI) [Amigó et al. 2008]
B-Cubed [Strehl et al. 2002]
40
Learning a Similarity Metric for Clustering
Ensemble-based similarity Training a cluster ensemble Computing a similarity score by:
Combining individual partitions Combining individual similarities
Classification-based similarity Training data sampling strategies Modeling strategies
Consensus Function:combine ensemble similarities
Consensus Function:combine ensemble similarities
Overview of a Cluster Ensemble Algorithm
41
Wtitle
Wtags
Wtime
f(C,W)
f(C,W)
Ctitle
Ctags
Ctime
Ensemble clustering solution
Ensemble clustering solution
Learned in a training step
Learned in a training step
Overview of a Cluster Ensemble Algorithm: Combining Partitions
42
Wtitle
Wtags
Wtime
f(C,W)
f(C,W)
Ctitle
Ctags
Ctime
Overview of a Cluster Ensemble Algorithm: Combining Similarities
43
Wtitle
Wtags
Wtime
f(C,W)
f(C,W)
σCtitle(di,cj)>μCtitle
σCtags(di,cj)>μCtags
σCtime(di,cj)>μCtime
For each document di
and cluster cj
For each document di
and cluster cj
44
Learning a Similarity Metric for Clustering
Ensemble-based similarity Training a cluster ensemble Computing a similarity score by:
Combining individual partitions Combining individual similarities
Classification-based similarity Training data sampling strategies Modeling strategies
45
Classification-based Similarity Metrics
Classify pairs of documents as similar/dissimilar
Feature vector Pairwise similarity scores One feature per similarity metric (e.g., time-
proximity, location-proximity, …)
Modeling strategies Document pairs Document-centroid pairs
46
Experiments: Alternative Similarity Metrics
Ensemble-based techniques Combining individual partitions (ENS-PART) Combining individual similarities (ENS-SIM)
Classification-based techniques Modeling: document-document vs. document-
centroid pairs Logistic Regression (CLASS-LR), Support Vector
Machines (CLASS-SVM)
Baselines Title, Description, Tags, All-Text, Time-Proximity,
Location-Proximity
Experimental Setup
Datasets: Upcoming
>270K Flickr photos Event labels from the “upcoming” event database
(upcoming:event=12345) Split into 3 parts for training/validation/testing
LastFM >594K Flickr photos Event labels from last.fm music catalog
(lastfm:event=6789) Used as an additional test set
47
48
Clustering Accuracy over Upcoming Test Set
All similarity learning techniques outperform the baselines
Classification-based techniques perform better than ensemble-based techniques
Algorithm NMI B-Cubed
All-Text 0.9240 0.7697
Tags 0.9229 0.7676
ENS-PART 0.9296 0.7819
ENS-SIM 0.9322 0.7861
CLASS-SVM 0.9425 0.8095
CLASS-LR 0.9444 0.8155
49
NMI: Clustering Accuracy over Both Test Sets
Upcoming LastFM
NM
I
Similarity learning models trained on Upcoming data show similar trends when tested on LastFM data
50
Identification and Characterization of Events in Social Media
Characterization of trending events
Identification of trending events
Similarity metric learning for trending events
Identification of content for planned events
Selection of event content
51
Identifying Content for Planned Events
Identify planned event documents given known event information User-contributed planned event records
LastFM Events EventBrite Facebook Events
Structured features (e.g., title, time, location)
Challenging identification scenario Known event information is often inaccurate or
incomplete Social media documents are brief and noisy
52
Planned Event Record
TitleTitle
Description
Description
Date/TimeDate/Time
VenueVenue
CityCity
53
Approach for Known Identification Scenario
Two-step query formulation strategy Precision-oriented queries using known event
features Recall-oriented queries using retrieved content
from precision-oriented queries
Leverage cross-site content Identify event documents on each site
individually Use event documents on one site to retrieve
additional event documents on a different site
54
Query Formulation Strategies
Precision-oriented Queries: Combined event record features Phrase, bag-of-words, stop word elimination Examples: [“title”+”venue”], [title-no-
stopwords+”city”]
Recall-oriented Queries Frequency Analysis
Frequent terms in the event’s retrieved content
Infrequently found in Web documents Term Extraction
55
Leveraging Cross-Site Content Build precision-oriented
queries using planned event features
Use precision-oriented queries to retrieve data from: Twitter Flickr YouTube
Build recall-oriented queries using data from: Each site individually All sites collectively
[title+city]
[title+venue]
…
tweet1
tweet2
tweetn
photo1
photo2
photon
video1
video2
videon
56
Experimental Settings 60 planned events from EventBrite, LastFM,
LinkedIn, and Facebook
Corresponding social media documents Retrieved from Twitter, Flickr, and YouTube Ranked according to similarity to event record
Techniques Precision: only precision-oriented queries MS: precision- and recall-oriented queries selected using
Microsoft n-gram probability score RTR: precision- and recall-oriented queries selected
using ratio of document frequency around the time of the event to document frequency in larger time window
57
NDCG Performance on Twitter
NDCG scores for top-k Twitter documents retrieved by Precision-oriented queries (Precision), and query strategies using Twitter data (Twitter-RTR, Twitter-MS).
58
Cross-Site NDCG Performance
NDCG scores for top-k YouTube documents retrieved by Precision-oriented queries (Precision), and query strategies using data from Twitter (Twitter-MS) and YouTube (YouTube MS).
59
Identification and Characterization of Events in Social Media
Characterization of trending events
Identification of trending events
Similarity metric learning for trending events
Identification of content for planned events
Selection of event content
60
Event Content Selection
Tiger Woods
Apology
Tiger Woods to make a public apology Friday
and talk about his future in golf.
Tiger Woods Returns To Golf - Public
Apology http://bit.ly/9Ui5jx
Tiger woods y'all,tiger woods y'all,ah tiger
woods y'all
Tiger Woods Hugs: http://tinyurl.com/
yhf4uzw
Wedge wars upstage Watson v Woods: BBC Sport (blog)
61
Event Content Selection
Challenges: Document clusters contain noise Relevant documents might have poor quality
text Relevant, high quality documents might not be
interesting
For each document and a given event evaluate Quality Relevance Usefulness
62
Centrality Based Document Selection
CentroidCosine similarity of each document to cluster centroid
Degree Documents are nodes Documents are connected if their similarity is above
a threshold Compute degree centrality of each node
LexRank [Erkan and Radev 2004]
Same graph structure as Degree method Central documents are similar to other central
documents
63
Experimental Methodology: Content Selection
50 event clusters Randomly selected 5 top tweets per event for each: Centroid,
Degree, LexRank
Labeled on a 1-4 scale Quality: excellent (4) poor (1) Relevance: clearly relevant (4) not relevant
(1) Usefulness: clearly useful (4) not useful (1)
64
Content Selection Results
Average scores over all events (out of 4)
High quality and relevance (>3) for both Degree and Centroid
Centroid only method with high usefulness
Method Quality Relevance
Usefulness
LexRank 3.44 2.98 2.61
Degree 3.54 3.16 2.80
Centroid 3.64 3.69 3.47
65
ConclusionsTechniques for identifying, characterizing, and selecting social
media content for events There are significant differences between types of trends in
social media, specifically trending events and non-event trends
Trending events and their associated social media documents can be effectively identified using online clustering with: A classification step to separate event and non-event
content Social media document similarity metrics for documents
with rich context features A two-step query formulation technique is useful for
identifying planned events across different social media sites Centrality-based techniques can be used to select high
quality, relevant, and useful social media event content
66
Future Work
Clustering framework optimization Blocking techniques Topic models
Identify unknown events with learned similarity metrics across sites
Improve breadth of event content
Rank events for search and presentation Extension of content selection techniques Learned ranking models
67
Publications Hila Becker, Dan Iter, Mor Naaman, Luis Gravano, “Identifying Content for Planned
Events Across Social Media Sites,” under submission. Hila Becker, Mor Naaman, Luis Gravano, “Beyond Trending Topics: Real-World Event
Identification on Twitter,” in Proceedings of the Fifth International AAAI Conference on Weblogs and Social Media (ICWSM’11), short paper.
Hila Becker, Mor Naaman, Luis Gravano, “Selecting Quality Twitter Content for Events,” in Proceedings of the Fifth International AAAI Conference on Weblogs and Social Media (ICWSM’11), short paper.
Hila Becker, Feiyang Chen, Dan Iter, Mor Naaman, Luis Gravano, “Automatic Identification and Presentation of Twitter Content for Planned Events,” in Proceedings of the Fifth International AAAI Conference on Weblogs and Social Media (ICWSM’11), demo paper.
Mor Naaman, Hila Becker, Luis Gravano, “Hip and Trendy: Characterizing Emerging Trends on Twitter,” in Journal of the American Society for Information Science and Technology.
Hila Becker, Mor Naaman, Luis Gravano, “Learning Similarity Metrics for Event Identification in Social Media,” in Proceedings of the Third ACM International Conference on Web Search and Data Mining (WSDM '10), 291-300.
Hila Becker, Bai Xiao, Mor Naaman and Luis Gravano, “Exploiting Social Links for Event Identification in Social Media,” in Proceedings of the 3rd Annual Workshop on Search in Social Media (SSM '10), poster paper.
Hila Becker, Mor Naaman, Luis Gravano, “Event Identification in Social Media,” in Proceedings of the ACM SIGMOD Workshop on the Web and Databases (WebDB '09), 2009.
68
Thank You!
69