69
Identification and Characterization of Events in Social Media Hila Becker, Thesis Defense

Identification and Characterization of Events in Social Media

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: Identification and Characterization of Events in Social Media

Identification and Characterization of Events in Social MediaHila Becker, Thesis Defense

Page 2: Identification and Characterization of Events in Social Media

2

Social Media is Changing the World

Lady Gaga, Justin Bieber, and Britney Spears have more Twitter followers than the entire populations of some countries (e.g., Israel, Greece)

YouTube is the second largest search engine in the world

Every minute, 24 hours of video are uploaded to YouTube

Over the past five years people uploaded 6,000,000,000 images to Flickr

Page 3: Identification and Characterization of Events in Social Media

3Source: http://www.searchenginejournal.com/the-growth-of-social-media-an-infographic/

Page 4: Identification and Characterization of Events in Social Media

Event Content in Social Media

4

Page 5: Identification and Characterization of Events in Social Media

5

MIKE CLARKE/AFP/Getty Images

Page 6: Identification and Characterization of Events in Social Media

6

Source: Tweets from Tahrir, edited by Nadia Idle and Alex Nunns

Page 7: Identification and Characterization of Events in Social Media

7

Page 8: Identification and Characterization of Events in Social Media

8

Event Identification, Characterization, and Content Selection

Identify events and their associated social media documents In a timely manner Across different social media sites

Characterize events along different dimensions

Select high-quality, relevant, useful event documents

Page 9: Identification and Characterization of Events in Social Media

9

Event Content in Social Media

Challenges: Wide variety of topics, not all related to events

(e.g., personal status updates, every-day mundane conversations)

Unconventional text: abbreviations, typos Large-scale, rapidly produced content

Opportunities: Content generated in real-time, as events happen Rich context features (e.g., time, location) Users’ perspective

Page 10: Identification and Characterization of Events in Social Media

10

Event Content in Social MediaTimeliness

Con

ten

t D

isco

very

Real-time Retrospective

Know

nU

nkn

ow

n

Earthquake prediction using Twitter [Sakaki et al. WWW’10]

Twitter new event detection [Petrović et al. NAACL’10]

Event detection on Flickr [Chen and Roy CIKM’09]

Organization of YouTube concert videos [Kennedy and Naaman WWW’09]

Page 11: Identification and Characterization of Events in Social Media

11

Event Content in Social MediaC

on

ten

t D

isco

very

Know

nU

nkn

ow

n

Trending Event is a real-world occurrence described by: One or more terms and a time period Volume of messages posted for the terms in the

time period exceeds some expected level of activity

Planned Event is a real-world occurrence with corresponding published event record consisting of: Title, describing the subject of the event The time at which the event is planned to occur

Page 12: Identification and Characterization of Events in Social Media

12

Contributions Trend (and trending event) study, for characterizing

and differentiating between different types of trends

Online clustering framework with an event classification step for identifying trending events and their associated documents in social media

Social media document similarity metric learning approaches

Query formulation strategies for identifying social media documents for planned events

Selection techniques for identifying high quality, relevant, and useful event content

Unknown/Known

Known

Unknown

Page 13: Identification and Characterization of Events in Social Media

13

Contributions Trend (and trending event) study, for characterizing

and differentiating between different types of trends

Online clustering framework with an event classification step for identifying trending events and their associated documents in social media

Social media document similarity metric learning approaches

Query formulation strategies for identifying social media documents for planned events

Selection techniques for identifying high quality, relevant, and useful event content

Unknown/Known

Known

Page 14: Identification and Characterization of Events in Social Media

14

Contributions Trend (and trending event) study, for characterizing

and differentiating between different types of trends

Online clustering framework with an event classification step for identifying trending events and their associated documents in social media

Social media document similarity metric learning approaches

Query formulation strategies for identifying social media documents for planned events

Selection techniques for identifying high quality, relevant, and useful event content

Unknown/Known

Page 15: Identification and Characterization of Events in Social Media

15

Contributions Trend (and trending event) study, for characterizing

and differentiating between different types of trends

Online clustering framework with an event classification step for identifying trending events and their associated documents in social media

Social media document similarity metric learning approaches

Query formulation strategies for identifying social media documents for planned events

Selection techniques for identifying high quality, relevant, and useful event content

Page 16: Identification and Characterization of Events in Social Media

16

Identification and Characterization of Events in Social Media

Characterization of trending events

Identification of trending events

Similarity metric learning for trending events

Identification of content for planned events

Selection of event content

Page 17: Identification and Characterization of Events in Social Media

17

What Types of Trends Exist in Social Media?

Taxonomy of trends

Characterization of each trend Manually assigned categories Automatically computed features

Analysis of differences between trend types according to each characteristic

Trending Events

Non-Event Trends

Page 18: Identification and Characterization of Events in Social Media

18

Trends

Trend: One or more terms and a time period Volume of messages posted for the terms in the

time period exceeds some expected level of activity

May or may not reflect a real-world occurrence

A trending event is a type of trend

Page 19: Identification and Characterization of Events in Social Media

19

Twitter Content

Streams of textual messages Brief content (140

characters) Communicated to

network of followers Provide timely

reflection of thoughts and interests

Page 20: Identification and Characterization of Events in Social Media

20

Characterizing Trends on Twitter

Collect a set of Twitter trends Burst detection Twitter’s “trending topics”

Qualitative analysis: trend taxonomy

Quantitative analysis Automatically compute features of each trend and

corresponding messages Manually label each trend according to categories

introduced by the taxonomy Identify differences between trend categories

according to automatically computed features

Page 21: Identification and Characterization of Events in Social Media

21

Affinity Diagram Method

Page 22: Identification and Characterization of Events in Social Media

22

Endogenous vs. Exogenous Trends

Endogenous Trends: Twitter-centric activities that do not correspond to external events (e.g., a popular post by a celebrity)

Exogenous Trends: trending events that originated outside of the Twitter system (e.g., an earthquake)

Do exogenous and endogenous trends exhibit different characteristics?

Page 23: Identification and Characterization of Events in Social Media

23

Characterization of Trends and Trending Events

Automatically computed features Content Features Interaction Features Time-based Features Participation Features Social Network Features

Compared differences between categories Hypotheses guided by differences in categories

according to feature types Performed t-tests for significance analysis

Page 24: Identification and Characterization of Events in Social Media

24

Contributions of the Study

Trends fall into two main categories: exogenous (i.e., trending event) and endogenous (i.e., platform-centric trend)

There are significant differences between exogenous and endogenous trends Proportion of messages with URLs Unique hashtag in top 10% of messages Proportion of retweets Reciprocity

Page 25: Identification and Characterization of Events in Social Media

25

Identification and Characterization of Events in Social Media

Characterization of trending events

Identification of trending events

Similarity metric learning for trending events

Identification of content for planned events

Selection of event content

Page 26: Identification and Characterization of Events in Social Media

Identifying Trending Events

26

Document Clusters

Documents Event Clusters

Page 27: Identification and Characterization of Events in Social Media

27

Identifying Trending Events in Real-Time

Order documents by post time

Use tf-idf vector representation of textual content Stop word elimination Stemming idf computed over past data

Separate tweets by location Focus on tweets from NYC Different locations can be processed in parallel

Page 28: Identification and Characterization of Events in Social Media

28

Clustering Algorithm

Many alternatives possible! [Berkhin 2002]

Single-pass incremental clustering algorithm Scalable, online solution

Using centroid representation Used effectively for

Event identification in textual news [Allan et al. 1998]

News event detection on Twitter [Sankaranarayanan et al. 2009]

Does not require a priori knowledge of number of clusters

Parameters: Similarity Function σ Threshold μ

Page 29: Identification and Characterization of Events in Social Media

29

Overview of Cluster-based Approach

Group similar documents via online clustering

Compute statistics of cluster content Top terms (e.g., [earthquake, japan]) Number of documents per hour …

Use cluster-level features to identify trending event clusters Single feature with threshold (e.g., increase in

volume over time-window [Petrović et al. 2010]) Trained classification model

Page 30: Identification and Characterization of Events in Social Media

30

Event Classification on Twitter

Cluster-level features Social interaction Topic coherence Trending behavior Platform-centric

Event classifier Human-annotated training data SVM model (selected during training phase)

Page 31: Identification and Characterization of Events in Social Media

31

Experimental Setup Classification accuracy

Baseline: Naïve Bayes text classification (NB-Text) [Sankaranarayanan et al. 2009]

10-fold cross validation Blind test set of randomly chosen tweets

Event surfacing: select top event clusters per hour Baselines

Fastest-growing clusters per hour (Fastest) [Petrović et

al. 2010]

Randomly selected clusters per hour (Random) 5 hours, top-20 clusters per hour

Page 32: Identification and Characterization of Events in Social Media

32

Identified Events

Description Keywords

Senator Evan Bayh's Retirement

bayh, evan, senate, congress, retire

Westminster Dog Showwestminster, dog, show,

club, kennel

Obama’s Meeting with the Dalai Lama

lama, dalai, meet, obama, china

NYC Toy Fairtoyfairny, starwars, hasbro,

lego, toy

Marc Jacobs Fashion Showjacobs, marc, nyfw, show,

fashion

A sample of events identified by our classifiers on the test set

Page 33: Identification and Characterization of Events in Social Media

33

Classification Performance (F-measure)

RW-Event event classifier is more effective at discriminating between real-world events and rest of Twitter data

Classifier Validation Test

NB-Text 0.785 0.702

RW-Event 0.849 0.837

Page 34: Identification and Characterization of Events in Social Media

34

NDCG@K Evaluation

5 10 15 200

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

RW-Event

Fastest

Random

Number of Clusters (K)

ND

CG

Performance of event classifier and baselines for event surfacing task.

Page 35: Identification and Characterization of Events in Social Media

35

Identification and Characterization of Events in Social Media

Characterization of trending events

Identification of trending events

Similarity metric learning for trending events

Identification of content for planned events

Selection of event content

Page 36: Identification and Characterization of Events in Social Media

Social Media Document Representation

TitleTitle

Description

Description

TagsTags

Date/TimeDate/Time

LocationLocation

All-TextAll-Text

3636

Page 37: Identification and Characterization of Events in Social Media

Social Media Document Similarity

Text: cosine similarity of tf-idf vectors (tf-idf version?; stemming?; stop word elimination?)

37

TitleTitle

Description

Description

TagsTags

LocationLocation

All-TextAll-Text

Date/TimeDate/Time

time

AA AAAA BB BBBB

Time: proximity in minutes

Location: geo-coordinate proximity

37

Page 38: Identification and Characterization of Events in Social Media

38

Clustering Algorithm

Many alternatives possible! [Berkhin 2002]

Single-pass incremental clustering algorithm Scalable, online solution

Using centroid representation Used effectively for

Event identification in textual news [Allan et al. 1998]

News event detection on Twitter [Sankaranarayanan et al. 2009]

Does not require a priori knowledge of number of clusters

Parameters: Similarity Function σ Threshold μ

Page 39: Identification and Characterization of Events in Social Media

39

Cluster Representation and Parameter Tuning

Centroid cluster representation Average tf-idf scores Average time Geographic mid-point

Parameter tuning in supervised training phaseClustering quality metrics to optimize:

Normalized Mutual Information (NMI) [Amigó et al. 2008]

B-Cubed [Strehl et al. 2002]

Page 40: Identification and Characterization of Events in Social Media

40

Learning a Similarity Metric for Clustering

Ensemble-based similarity Training a cluster ensemble Computing a similarity score by:

Combining individual partitions Combining individual similarities

Classification-based similarity Training data sampling strategies Modeling strategies

Page 41: Identification and Characterization of Events in Social Media

Consensus Function:combine ensemble similarities

Consensus Function:combine ensemble similarities

Overview of a Cluster Ensemble Algorithm

41

Wtitle

Wtags

Wtime

f(C,W)

f(C,W)

Ctitle

Ctags

Ctime

Ensemble clustering solution

Ensemble clustering solution

Learned in a training step

Learned in a training step

Page 42: Identification and Characterization of Events in Social Media

Overview of a Cluster Ensemble Algorithm: Combining Partitions

42

Wtitle

Wtags

Wtime

f(C,W)

f(C,W)

Ctitle

Ctags

Ctime

Page 43: Identification and Characterization of Events in Social Media

Overview of a Cluster Ensemble Algorithm: Combining Similarities

43

Wtitle

Wtags

Wtime

f(C,W)

f(C,W)

σCtitle(di,cj)>μCtitle

σCtags(di,cj)>μCtags

σCtime(di,cj)>μCtime

For each document di

and cluster cj

For each document di

and cluster cj

Page 44: Identification and Characterization of Events in Social Media

44

Learning a Similarity Metric for Clustering

Ensemble-based similarity Training a cluster ensemble Computing a similarity score by:

Combining individual partitions Combining individual similarities

Classification-based similarity Training data sampling strategies Modeling strategies

Page 45: Identification and Characterization of Events in Social Media

45

Classification-based Similarity Metrics

Classify pairs of documents as similar/dissimilar

Feature vector Pairwise similarity scores One feature per similarity metric (e.g., time-

proximity, location-proximity, …)

Modeling strategies Document pairs Document-centroid pairs

Page 46: Identification and Characterization of Events in Social Media

46

Experiments: Alternative Similarity Metrics

Ensemble-based techniques Combining individual partitions (ENS-PART) Combining individual similarities (ENS-SIM)

Classification-based techniques Modeling: document-document vs. document-

centroid pairs Logistic Regression (CLASS-LR), Support Vector

Machines (CLASS-SVM)

Baselines Title, Description, Tags, All-Text, Time-Proximity,

Location-Proximity

Page 47: Identification and Characterization of Events in Social Media

Experimental Setup

Datasets: Upcoming

>270K Flickr photos Event labels from the “upcoming” event database

(upcoming:event=12345) Split into 3 parts for training/validation/testing

LastFM >594K Flickr photos Event labels from last.fm music catalog

(lastfm:event=6789) Used as an additional test set

47

Page 48: Identification and Characterization of Events in Social Media

48

Clustering Accuracy over Upcoming Test Set

All similarity learning techniques outperform the baselines

Classification-based techniques perform better than ensemble-based techniques

Algorithm NMI B-Cubed

All-Text 0.9240 0.7697

Tags 0.9229 0.7676

ENS-PART 0.9296 0.7819

ENS-SIM 0.9322 0.7861

CLASS-SVM 0.9425 0.8095

CLASS-LR 0.9444 0.8155

Page 49: Identification and Characterization of Events in Social Media

49

NMI: Clustering Accuracy over Both Test Sets

Upcoming LastFM

NM

I

Similarity learning models trained on Upcoming data show similar trends when tested on LastFM data

Page 50: Identification and Characterization of Events in Social Media

50

Identification and Characterization of Events in Social Media

Characterization of trending events

Identification of trending events

Similarity metric learning for trending events

Identification of content for planned events

Selection of event content

Page 51: Identification and Characterization of Events in Social Media

51

Identifying Content for Planned Events

Identify planned event documents given known event information User-contributed planned event records

LastFM Events EventBrite Facebook Events

Structured features (e.g., title, time, location)

Challenging identification scenario Known event information is often inaccurate or

incomplete Social media documents are brief and noisy

Page 52: Identification and Characterization of Events in Social Media

52

Planned Event Record

TitleTitle

Description

Description

Date/TimeDate/Time

VenueVenue

CityCity

Page 53: Identification and Characterization of Events in Social Media

53

Approach for Known Identification Scenario

Two-step query formulation strategy Precision-oriented queries using known event

features Recall-oriented queries using retrieved content

from precision-oriented queries

Leverage cross-site content Identify event documents on each site

individually Use event documents on one site to retrieve

additional event documents on a different site

Page 54: Identification and Characterization of Events in Social Media

54

Query Formulation Strategies

Precision-oriented Queries: Combined event record features Phrase, bag-of-words, stop word elimination Examples: [“title”+”venue”], [title-no-

stopwords+”city”]

Recall-oriented Queries Frequency Analysis

Frequent terms in the event’s retrieved content

Infrequently found in Web documents Term Extraction

Page 55: Identification and Characterization of Events in Social Media

55

Leveraging Cross-Site Content Build precision-oriented

queries using planned event features

Use precision-oriented queries to retrieve data from: Twitter Flickr YouTube

Build recall-oriented queries using data from: Each site individually All sites collectively

[title+city]

[title+venue]

tweet1

tweet2

tweetn

photo1

photo2

photon

video1

video2

videon

Page 56: Identification and Characterization of Events in Social Media

56

Experimental Settings 60 planned events from EventBrite, LastFM,

LinkedIn, and Facebook

Corresponding social media documents Retrieved from Twitter, Flickr, and YouTube Ranked according to similarity to event record

Techniques Precision: only precision-oriented queries MS: precision- and recall-oriented queries selected using

Microsoft n-gram probability score RTR: precision- and recall-oriented queries selected

using ratio of document frequency around the time of the event to document frequency in larger time window

Page 57: Identification and Characterization of Events in Social Media

57

NDCG Performance on Twitter

NDCG scores for top-k Twitter documents retrieved by Precision-oriented queries (Precision), and query strategies using Twitter data (Twitter-RTR, Twitter-MS).

Page 58: Identification and Characterization of Events in Social Media

58

Cross-Site NDCG Performance

NDCG scores for top-k YouTube documents retrieved by Precision-oriented queries (Precision), and query strategies using data from Twitter (Twitter-MS) and YouTube (YouTube MS).

Page 59: Identification and Characterization of Events in Social Media

59

Identification and Characterization of Events in Social Media

Characterization of trending events

Identification of trending events

Similarity metric learning for trending events

Identification of content for planned events

Selection of event content

Page 60: Identification and Characterization of Events in Social Media

60

Event Content Selection

Tiger Woods

Apology

Tiger Woods to make a public apology Friday

and talk about his future in golf.

Tiger Woods Returns To Golf - Public

Apology http://bit.ly/9Ui5jx

Tiger woods y'all,tiger woods y'all,ah tiger

woods y'all

Tiger Woods Hugs: http://tinyurl.com/

yhf4uzw

Wedge wars upstage Watson v Woods: BBC Sport (blog)

Page 61: Identification and Characterization of Events in Social Media

61

Event Content Selection

Challenges: Document clusters contain noise Relevant documents might have poor quality

text Relevant, high quality documents might not be

interesting

For each document and a given event evaluate Quality Relevance Usefulness

Page 62: Identification and Characterization of Events in Social Media

62

Centrality Based Document Selection

CentroidCosine similarity of each document to cluster centroid

Degree Documents are nodes Documents are connected if their similarity is above

a threshold Compute degree centrality of each node

LexRank [Erkan and Radev 2004]

Same graph structure as Degree method Central documents are similar to other central

documents

Page 63: Identification and Characterization of Events in Social Media

63

Experimental Methodology: Content Selection

50 event clusters Randomly selected 5 top tweets per event for each: Centroid,

Degree, LexRank

Labeled on a 1-4 scale Quality: excellent (4) poor (1) Relevance: clearly relevant (4) not relevant

(1) Usefulness: clearly useful (4) not useful (1)

Page 64: Identification and Characterization of Events in Social Media

64

Content Selection Results

Average scores over all events (out of 4)

High quality and relevance (>3) for both Degree and Centroid

Centroid only method with high usefulness

Method Quality Relevance

Usefulness

LexRank 3.44 2.98 2.61

Degree 3.54 3.16 2.80

Centroid 3.64 3.69 3.47

Page 65: Identification and Characterization of Events in Social Media

65

ConclusionsTechniques for identifying, characterizing, and selecting social

media content for events There are significant differences between types of trends in

social media, specifically trending events and non-event trends

Trending events and their associated social media documents can be effectively identified using online clustering with: A classification step to separate event and non-event

content Social media document similarity metrics for documents

with rich context features A two-step query formulation technique is useful for

identifying planned events across different social media sites Centrality-based techniques can be used to select high

quality, relevant, and useful social media event content

Page 66: Identification and Characterization of Events in Social Media

66

Future Work

Clustering framework optimization Blocking techniques Topic models

Identify unknown events with learned similarity metrics across sites

Improve breadth of event content

Rank events for search and presentation Extension of content selection techniques Learned ranking models

Page 67: Identification and Characterization of Events in Social Media

67

Publications Hila Becker, Dan Iter, Mor Naaman, Luis Gravano, “Identifying Content for Planned

Events Across Social Media Sites,” under submission. Hila Becker, Mor Naaman, Luis Gravano, “Beyond Trending Topics: Real-World Event

Identification on Twitter,” in Proceedings of the Fifth International AAAI Conference on Weblogs and Social Media (ICWSM’11), short paper.

Hila Becker, Mor Naaman, Luis Gravano, “Selecting Quality Twitter Content for Events,” in Proceedings of the Fifth International AAAI Conference on Weblogs and Social Media (ICWSM’11), short paper.

Hila Becker, Feiyang Chen, Dan Iter, Mor Naaman, Luis Gravano, “Automatic Identification and Presentation of Twitter Content for Planned Events,” in Proceedings of the Fifth International AAAI Conference on Weblogs and Social Media (ICWSM’11), demo paper.

Mor Naaman, Hila Becker, Luis Gravano, “Hip and Trendy: Characterizing Emerging Trends on Twitter,” in Journal of the American Society for Information Science and Technology.

Hila Becker, Mor Naaman, Luis Gravano, “Learning Similarity Metrics for Event Identification in Social Media,” in Proceedings of the Third ACM International Conference on Web Search and Data Mining (WSDM '10), 291-300.

Hila Becker, Bai Xiao, Mor Naaman and Luis Gravano, “Exploiting Social Links for Event Identification in Social Media,” in Proceedings of the 3rd Annual Workshop on Search in Social Media (SSM '10), poster paper.

Hila Becker, Mor Naaman, Luis Gravano, “Event Identification in Social Media,” in Proceedings of the ACM SIGMOD Workshop on the Web and Databases (WebDB '09), 2009.

Page 68: Identification and Characterization of Events in Social Media

68

Thank You!

Page 69: Identification and Characterization of Events in Social Media

69