35
1 Twinder: A Search Engine for Twitter Streams #ICWE2012 Berlin, Germany July 25 th , 2012 Ke Tao, Fabian Abel, Claudia Hauff, Geert-Jan Houben Web Information Systems, TU Delft

Twinder: A Search Engine for Twitter Streams

  • Upload
    ke-tao

  • View
    1.071

  • Download
    1

Embed Size (px)

Citation preview

Page 1: Twinder: A Search Engine for Twitter Streams

1

Twinder: A Search Engine for Twitter Streams

#ICWE2012 Berlin, Germany July 25th, 2012

Ke Tao, Fabian Abel, Claudia Hauff, Geert-Jan Houben Web Information Systems, TU Delft

Page 2: Twinder: A Search Engine for Twitter Streams

2 Twinder: A search engine for Twitter streams

Get information from Twitter

• Twitter is more like a news media.

• How do people search Twitter? •  Search on Twitter (Teevan et al.)

How do people use Twitter as a source of information?

Web Search Twitter Search

Query length (chars) 18.80 12.00

Query length (words) 3.08 1.64

Is a celebrity name 3.11% 15.22%

Page 3: Twinder: A Search Engine for Twitter Streams

3 Twinder: A search engine for Twitter streams

Research Questions

1. Given a topic, can we identify the relevant tweets based on the characteristics of the tweets?

2. Are semantics meaningful for determining the tweets’ relevance for a topic?

3. How can we design an architecture that is scalable?

What are the challenges we are facing?

Page 4: Twinder: A Search Engine for Twitter Streams

4 Twinder: A search engine for Twitter streams

Search on Twitter

• Twitter search interface

• Ordered by time

• Keyword-based match

What you can do on current Twitter?

Page 5: Twinder: A Search Engine for Twitter Streams

5 Twinder: A search engine for Twitter streams

Twinder = TWeet +fINDER Our solution - Architecture

!"#$%&"'()$&#*+,-'

."#&*/'01"&'2-$"&3#*"'

4"5"6#-*"'(1+7#+,-'

.,*8#5'9":'.$&"#71'

!"#$%&"'()$&#*+,

-';#1<'=&,<"&'

."7#-+*1>:#1"?'4"5"6#-*"'

.@-$#*+*#5'!"#$%&"1'

."7#-+*'!"#$%&"1'

A,-$")$%#5'!"#$%&"1'

!"#$%&!#'($)*+&,*-./01.$21$.3&

B%"&@'

2-?")'

C"@D,&?>:#1"?'4"5"6#-*"'

1#(4254*03*04)63&& 1#(42503*04)63&&

&"1%5$1'3""?:#*<'

7"11#E"1'

784*%3.&&93/.2:&&;*+4*3&

3"#$%&"'")$&#*+,-'

$#1<1'

$03.0&

Page 6: Twinder: A Search Engine for Twitter Streams

6 Twinder: A search engine for Twitter streams

Core Components of Twinder Feature Extraction

• Receive Twitter messages from Social Web Streams

• Features of two categories:

•  (1) Topic-sensitive features

•  (2) Topic-insensitive features

• Different extracting strategies designed for different features

Page 7: Twinder: A Search Engine for Twitter Streams

7 Twinder: A search engine for Twitter streams

Core Components of Twinder Feature Extraction Task Broker

• Twinder makes use of MapReduce and cloud computing infrastructures to allow for high scalability and frequent updates of its multifaceted index.

•  Feature Extration Task Broker dispatches features extration tasks and indexing tasks to cloud computing infrastructure.

Page 8: Twinder: A Search Engine for Twitter Streams

8 Twinder: A search engine for Twitter streams

Core Components of Twinder Relevance Estimation

• Accepting search queries from front-end, passing them to Feature Extration Component.

• Tweets are classified into the relevant and the non-relevant by Relevance Estimation component, and are further delivered to front-end for rendering.

• Twinder can learn the classification model, initially from training dataset, then from usage data.

Page 9: Twinder: A Search Engine for Twitter Streams

9 Twinder: A search engine for Twitter streams

Efficiency of Indexing How good does Twinder make use of cloud-computing infrastructure?

Corpus size Mainstream Server EMR(10 instances)

100k (13MBytes) 0.4 min 5 min

1m (122MBytes) 5 min 8 min

10m (1.3GBytes) 48 min 19 min

32m (3.9GBytes) 283 min 47 min

Page 10: Twinder: A Search Engine for Twitter Streams

10 Twinder: A search engine for Twitter streams

Features of Microposts

Topic sensitive Topic insensitive Keyword-based

relevance ?

We already have keyword-based relevance, and…?

Hypothesis H1: The greater the keyword-based relevance score, the more relevant and interesting the tweet is to the topic.

Page 11: Twinder: A Search Engine for Twitter Streams

11 Twinder: A search engine for Twitter streams

Semantic-based relevance Expand the queries to match more tweets

dbp:Hu_Jintao

dbp:United_States

Reformulated query is expected to get a more accurate retrieval score.

Hypothesis H2 : The greater the semantic-based relevance score, the more relevant and interesting the tweet is.

Page 12: Twinder: A Search Engine for Twitter Streams

12 Twinder: A search engine for Twitter streams

Semantic-based relatedness Is there a semantic overlap between the query and the tweet?

dbp:Hu_Jintao

dbp:the_United_States

Hypothesis H3 : If a tweet is considered to be semantically related to the query then it is also relevant and interesting for the user.

Page 13: Twinder: A Search Engine for Twitter Streams

13 Twinder: A search engine for Twitter streams

Overview of features

Topic sensitive Topic insensitive Keyword-based ? Semantic-based

What do we have now?

Page 14: Twinder: A Search Engine for Twitter Streams

14 Twinder: A search engine for Twitter streams

Syntactical feature : Hashtag Is a tweet more relevant if it contains a #hashtag?

Hypothesis 4: tweets that contain hashtags are more likely to be relevant than tweets that do not contain hashtags.

Page 15: Twinder: A Search Engine for Twitter Streams

15 Twinder: A search engine for Twitter streams

Syntactical feature : hasURL Is a tweet that contains a URL more relevant?

Hypothesis 5: tweets that contain a URL are more likely to be relevant than tweets that do not contain a URL.

Page 16: Twinder: A Search Engine for Twitter Streams

16 Twinder: A search engine for Twitter streams

Syntactical feature : isReply Is a tweet which is a reply to @somebody more relevant?

Hypothesis 6: tweets that are formulated as a reply to another tweet are less likely to be relevant than other tweets.

Page 17: Twinder: A Search Engine for Twitter Streams

17 Twinder: A search engine for Twitter streams

Syntactical feature : length Does the length of a tweet influence its relevance for a topic?

Hypothesis 7: the longer a tweet, the more likely it is to be relevant and interesting.

Page 18: Twinder: A Search Engine for Twitter Streams

18 Twinder: A search engine for Twitter streams

Overview of features

Topic sensitive Topic insensitive Keyword-based Syntactical features Semantic-based ?

Short summary

Are there further features that allow for estimating the relevance?

Page 19: Twinder: A Search Engine for Twitter Streams

19 Twinder: A search engine for Twitter streams

Semantic features Find semantics in a tweet to estimate the relevance

dbp:Tim_Berners-Lee dbp:World_Wide_Web

dbp:France

dbp:Lyon

dbp:International_World_Wide_Web_Conference

Page 20: Twinder: A Search Engine for Twitter Streams

20 Twinder: A search engine for Twitter streams

Semantic features : #entity Is a tweet with more entities more interesting?

• 5 entities extracted.

Hypothesis 8: the more entities a tweet mentions, the more likely it is to be relevant and interesting.

Page 21: Twinder: A Search Engine for Twitter Streams

21 Twinder: A search engine for Twitter streams

Semantic features : diversity How many types are there in the entities?

• 4 types of entities

Hypothesis 9: the greater the diversity of concepts mentioned in a tweet, the more likely it is to be interesting and relevant.

Page 22: Twinder: A Search Engine for Twitter Streams

22 Twinder: A search engine for Twitter streams

Semantic features : sentiment Was the author of the tweet happy or not?

• Sentiment : Neutral

Hypothesis 10: the likelihood of a tweet’s relevance is influenced by its sentiment polarity.

Page 23: Twinder: A Search Engine for Twitter Streams

23 Twinder: A search engine for Twitter streams

Overview of features

Topic sensitive Topic insensitive Keyword-based Syntactical Semantic-based Semantics

By now, we have 4 types of features.

Can we utilize the contextual information of tweets?

Page 24: Twinder: A Search Engine for Twitter Streams

24 Twinder: A search engine for Twitter streams

Contextual features Does the number of followers influence the relatedness?

Hypothesis 11: The higher the number of followers a creator of a message has, the more likely it is that her tweets are relevant.

Page 25: Twinder: A Search Engine for Twitter Streams

25 Twinder: A search engine for Twitter streams

Contextual features Or the number of followers that the author appears in?

Hypothesis 12: The higher the number of lists in which the creator of a message appears, the more likely it is that her tweets are relevant.

Page 26: Twinder: A Search Engine for Twitter Streams

26 Twinder: A search engine for Twitter streams

Contextual features How long has been the author on Twitter?

Hypothesis 13: The older the Twitter account of a user, the more likely it is that her tweets are relevant.

Signed up

July 2008

Post

June 2012

Page 27: Twinder: A Search Engine for Twitter Streams

27 Twinder: A search engine for Twitter streams

Summary of Features

Topic sensitive Topic insensitive Keyword-based Syntactical Semantic-based Semantics

Contextual

The features

Page 28: Twinder: A Search Engine for Twitter Streams

28 Twinder: A search engine for Twitter streams

Analysis

• Research Questions: 1.  Which features are more influential on predicting the

relatedness of a tweet to a certain topic? 2.  Which types of features are more important? Are

semantics meaningful? 3.  What’s the performance that we can achieve by utilizing

these features?

• Twinder Setup •  Consider the search problem as a classification task •  Classification algorithm = Logistic Regression

Page 29: Twinder: A Search Engine for Twitter Streams

29 Twinder: A search engine for Twitter streams

Dataset

• Twitter corpus •  16 million tweets (Jan. 24th, 2011 – Feb. 8th) •  4,766,901 tweets classified as English •  6.2 million entity-extractions (140k distinct entities)

• Relevance judgments •  49 topics •  40,855 (topic, tweet) pairs •  60.31 relevant tweets per topic (on average)

From TREC 2011 Microblog Track

Page 30: Twinder: A Search Engine for Twitter Streams

30 Twinder: A search engine for Twitter streams

Results Which type of features matters?

Features Precision Recall F-measure

keyword relevance 0.3036 0.2851 0.2940

semantic relevance 0.3050 0.3294 0.3167

topic-sensitive 0.3135 0.3252 0.3192

topic-insensitive 0.1956 0.0064 0.0123

without semantics 0.3363 0.4618 0.3965

without sentiment 0.3701 0.3923 0.4048

without context 0.3827 0.4714 0.4225

all features 0.3674 0.4736 0.4138

Overall, we can achieve the precision and recall of over 35% and 45% respectively by applying all the features.

Page 31: Twinder: A Search Engine for Twitter Streams

31 Twinder: A search engine for Twitter streams

Weights of features Which feature matters?

-1

0

1

2

hasHashtag hasURL isReply length

Syntactical

-1

0

1

2

Keyword-based relevance

Keyword-based

-1

0

1

2

Relevance Relatedness

Semantic-based

-1

0

1

2

#entities diversity sentiment

Semantics

-1

0

1

2

#followers #lists Age

Contextual

Page 32: Twinder: A Search Engine for Twitter Streams

32 Twinder: A search engine for Twitter streams

Topics of different categories The impact on the performance and models

•  49 topics categorized into 2 parts w.r.t. 3 dimensions: •  Popularity •  Gobal vs. Local •  Temporal persistence

•  Popularity •  Higher recall for popular topics •  Less impact from sentiment features on unpopular topics

• Temporal persistence •  Higher performance on shorter-term topics •  Less impact from sentiment features on persistent topic

Page 33: Twinder: A Search Engine for Twitter Streams

33 Twinder: A search engine for Twitter streams

Conclusions What are our contributions?

1.  Twinder search engine proposed: analyzing various features to determine the relevance and interestingness of Twitter messages for a given topic.

2.  Scalability demonstrated for the Twinder search engine. 3.  Extensive analysis on 13 features along two-dimensions:

topic-sensitive features and topic-insensitive features.

Page 34: Twinder: A Search Engine for Twitter Streams

34 Twinder: A search engine for Twitter streams

Conclusions The lessons learned

1.  The learned models which take advantage of semantics and topic-sensitive features outperform those which do not take the semantics and topic-sensitive features into account.

2.  Contextual features that characterize the users who are posting the messages have little impact on the relevance estimation.

3.  The importance of a feature differs depending on the topic characteristics; for example, the sentiment-based features are more important for popular than for unpopular topics.

Page 35: Twinder: A Search Engine for Twitter Streams

35 Twinder: A search engine for Twitter streams

THANK YOU!

July 25th, 2012 [email protected] http://ktao.nl/

QUESTIONS?