People Search @ Study Group MSRA NLC

Preview:

DESCRIPTION

 

Citation preview

P83-1

People search, TwitterRank and Trendsetters finding in Twitter

Beijing, September, 2012

MSRA NLC Study Group

Yi Lu, Jie Liu

2

• Input: A query including expertise topic, such as database system, software engineering.

• Output: A list including people ranked with topic relevance.

Query List of people

Background: People Search

3

• Input:

• Output:

An Illustrative Example

4

• A student looks for a machine learning supervisor

• A patient looks for doctors who have many successful cases on his disease

• A historian looks for people who have expertise on Maya culture

• A CTO looks for engineers who have related skills

• …

Scenarios of People Search

5

• Identify opinion leaders, experts• Advertisement• Turn to somebody for help• Select a team to do a specific task• A lot of challenges remains.

Motivation

6

• Wisdom of the Crowd– Cognos Crowdsourcing Search for Topic Experts in Microblogs (SIGIR

2012)– Saptarshi Ghosh, Naveen Sharma, Fabricio Benevenuto, Niloy

Ganguly, Krishna P. Gummadi• Tweets and Link Relation

– TwitterRank: Finding Topic-sensitive Influential Twitterers (WSDM 2010)

– Jianshu Weng, Ee-Peng Lim, Jing Jiang, Qi He• #Hashtag and Link Relation

– Finding Trendsetters in Information Networks (SIGKDD 2012)– Diego Saez-Trumper, Giovanni Comarela, Virgílio Almeida, Ricardo

Baeza-Yates, Fabrício Benevenuto

Outline

P83-7

Cognos: Crowdsourcing Search for Topic Experts in Microblogs

Saptarshi Ghosh, Naveen Sharma, Fabricio Benevenuto,

Niloy Ganguly, Krishna P. Gummadi

8

• Twitter is now an important source of current news– 500 million users post 400 million tweets daily

• Quality of tweets posted by different users vary widely– News, pointless babble, conversational tweets,

spam, … • Challenge: to find topic experts

– Sources of authoritative information on specific topics

Topic experts in Twitter

9

• Existing approaches– Research studies: Pal [WSDM 11], Weng [WSDM 10]– Application systems: Twitter Who-To-Follow,

Wefollow, …• Existing approaches primarily rely on

information provided by the user herself– Bio, contents of tweets, network features e.g.

#followers• We rely on “wisdom of the Twitter crowd”

– How do others describe a user?

Identifying topic experts in Twitter

10

• Challenges in designing search system for topic experts in Twitter– How to infer topics of expertise of an individual

Twitter user? – How to rank the relative expertise of users

identified as experts on a topic?

Challenges

11

HOW TO INFER TOPICS OF EXPERTISE OF TWITTER USERS?

Challenge 1

Challenge #1

12

13

• A feature to organize tweets received from the people whom a user is following

• Create a List, add name & description, add Twitter users to the list

• Tweets from all listed users will be available as a separate List stream

Twitter Lists

14

• Collect Lists containing a given user U

• Identify U’s topics from List meta-data– Basic NLP techniques– Extract nouns and adjectives

• Extracted words collected to obtain a topic document for user

[movies tv hollywood stars entertainment celebrity hollywood …]

Mining Lists to infer expertise

15

• Collected Lists of 55 million Twitter users who joined before or in 2009

• All analyses consider 1.3 million users who are included in 10 or more Lists

Dataset

16

linux, tech, open, software, libre, gnu, computer, developer, ubuntu, unix

politics, senator, congress, government, republicans, Iowa, gop, conservative

politics, senate, government, congress, democrats, Missouri, progressive, women

Topics extracted from Lists

17

love, daily, people, time, GUI, movie, video, life, happy, game, cool

Most common words from tweets

celeb, actor, famous, movie, stars, comedy, music, Hollywood, pop culture

Most common words from Lists

Profile bio

Lists vs. other features

18

Fallon, happy, love, fun, video, song, game, hope, #fjoln, #fallonmono

Most common words from tweets

celeb, funny, humor, music, movies, laugh, comics, television, entertainers

Most common words from Lists

Profile bio

Lists vs. other features

19

• Top 20 WTF results for 200 queries 3495 users

• Whether the results returned by Cognos cover the results returned by Twitter WTF?

• For 83.4%, yes

• From among the rest 16.6%, manual inspection of a random sample shows two major cases

Cognos vs. Twitter Who-To-Follow

20

We can find Twitterer dineLA in Twitter if the query is “dining”

We can find Twitterer Space explorer HubbleHugger77 in Twitter if the query is “hubble”

Case 1 – topics inferred from Lists include semantically similar words, but not exact query-word

Topics from Lists – food, restaurant, recipes, Los Angeles

Topics from Lists – science, tech, space, cosmology, NASA

More than one way to express an idea

21

We can find Comedian jimmyfallon if the query is “astrophysicist”

Case 2 – results by Twitter unrelated to query

Topics from Lists – celebs, comedy, humor, actor

Results returned by Twitter is unrelated

22

• List-based method provides accurate & comprehensive inference of topics of expertise of Twitter users

• In many cases, more accurate than existing approaches that utilize profile information or tweets

Inferring expertise: Summary

23

HOW TO RANK EXPERTS ON A TOPIC?

Challenge 2

Challenge #2

24

• Used a ranking scheme solely based on Lists • Two components of ranking user U w.r.t. query

Q– Relevance of user to query – cover density ranking

between topic document TU of user and Q– Popularity of user – number of Lists including the

user

Topic relevance(TU, Q) × log(#Lists including U)

Ranking experts

25

• Search system for topic experts in Twitter

• Given a query (topic)– Identify experts on the topic using Lists– Rank identified experts

Cognos

26

Cognos results for “politics”

27

Cognos results for “stem cell”

28

• System deployed and evaluated ‘in-the-wild’• Evaluators were students & researchers from

the three home institutes of authors

Evaluation of Cognos

29

User-evaluation of Cognos

30

Sample queries for evaluation

31

• Overall 2136 relevance judgments– 1680 said relevant (78.7%)

• Large amount of subjectivity in evaluations– Same result for same query received both relevant

and non-relevant judgments– E.g., for query “cloud computing”, Werner Vogels

got 4 relevant judgments, 6 non-relevant judgments

Evaluation results

chief technology officer and vice President of Amazon.com in Seattle

32

• Considered only the results evaluated at least twice

• Result said to be relevant if voted relevant in the majority of evaluations

• Mean Average Precision considering top 10 results: 93.9 %

Evaluation results

33

Cognos vs. Twitter Who-To-Follow

34

• Considering 27 distinct queries asked at least twice• Judgment by majority voting

• Cognos judged better on 12 queries– Computer science, Linux, Mac, Apple, Ipad, Internet,

Windows phone, photography, political journalist, …

• Twitter Who-To-Follow judged better on 11 queries– Music, Sachin Tendulkar, Anjelina Jolie, Harry Potter,

metallica, cloud computing, IIT Kharagpur, …

Cognos vs. Twitter Who-To-Follow

35

Results for query music

P83-36

Questions

37

• Wisdom of the Crowd– Cognos Crowdsourcing Search for Topic Experts in Microblogs (SIGIR

2012)– Saptarshi Ghosh, Naveen Sharma, Fabricio Benevenuto, Niloy

Ganguly, Krishna P. Gummadi• Tweets and Link Relation

– TwitterRank: Finding Topic-sensitive Influential Twitterers (WSDM 2010)

– Jianshu Weng, Ee-Peng Lim, Jing Jiang, Qi He• #Hashtag and Link Relation

– Finding Trendsetters in Information Networks (SIGKDD 2012)– Diego Saez-Trumper, Giovanni Comarela, Virgílio Almeida, Ricardo

Baeza-Yates, Fabrício Benevenuto

Outline

P83-38

TwitterRank: Finding Topic-sensitive Influential Twitterers

Jianshu Weng, Ee-Peng Lim, Jing JiangSingapore Management University

Qi HePennsylvania State University

39

• Introduction• Dataset• Topic Modeling• TwitterRank

Outline

40

• Given a set of twitterers, find the influential ones– for different topics

• Challenges:– Topics unknown

Introduction

41

• Introduction• Dataset• Topic Modeling• TwitterRank

Outline

42

• Crawled = a set of Singapore-based twitterers from twitterholic.com with highest number of followers.

• For each , crawled its followers and friends and

• For each get its published tweets. Denote the set of all tweets as

Data preparation

43

|S| 996

|S*| 6748 (4050 with more than 10 tweets)

|| 1,021,039

# following relationships 49,872

Min/Max/Avg #tweets/twitterer 1/3200/179.57

Data preparation

44

Reciprocity in the Following Relationships

• Friend count = # twitterers being followed• Follower count = # twitterers following• Correlation between friends count and follow

count

45

Reciprocity in the Following Relationships

• 72.4% of the users follow more than 80% of their followers

• 80.5% of the users have 80% of their friends follow them back

46

• Homophily• Twitters with “following” relationships are

more similar than those without, according to the topics they are interested in.

Explanations

47

• Introduction• Dataset• Topic Modeling• TwitterRank

Outline

48

• Apply LDA to distill topics automatically• Find topics in the twitterer’s content to represent his

interests– Twitterer’s content = aggregated tweets

• Pre-processing– Use only those words without non-English characters– Min word length= 3– Remove @userid, URL, All-digit word, stopwords

• Apply analysis on twitterers with more than 10tweets. (#twitterer=4050)

Topic Distillation

49

• Three matrices:– DT, a D x T matrix, where D is the number of twitterers and T

is the number of topics. contains the number of times a word in tweets of twitterer has been assigned to topic

– WT, a W x T matrix, where W is the number of unique words used in the tweets and T is the number of topics. captures the number of times unique word has been assigned to topic

– Z, a 1 x N vector, where N is the total number of words in the tweets. is the topic assignment forword

Results of Topic Distillation

50

• Introduction• Dataset• Topic Modeling• TwitterRank

Outline

51

• A topic-specific random walk model is applied to calculate each twitterer’s influential score.

• The transition matrix for topic t, denoted as . The transition probability of the random surfer from follower to friend .

– Where S is the set of ’s friends– = 1 - | |

Topic-specific TwitterRank

DT’ is row-normalized form of matrix DT

52

• This captures two notions:– The more publishes, the higher portion of tweets

reads is from . Generally, this leads to a higher influence on

– ’s influence on is also related to the topical similarity between the two as suggested by the homophily phenomenon.

Topic-specific TwitterRank

53

• Topic-specific teleportation

• The influence scores of twitterers arecalculated iteratively

– is the t-th column of matrix DT’’, which is the column-normalized form of matrix DT

Topic-specific TwitterRank

54

• General influence: can be set as the probabilities of different topics’ presence

• Perceived general influence: can also be set as the probabilities that a particular twitterer is interested in different topics.

Aggregation of Topic-specific TwitterRank

P83-55

Questions

56

Outline

• Wisdom of the Crowd– Cognos Crowdsourcing Search for Topic Experts in Microblogs (SIGIR

2012)– Saptarshi Ghosh, Naveen Sharma, Fabricio Benevenuto, Niloy

Ganguly, Krishna P. Gummadi• Tweets and Link Relation

– TwitterRank: Finding Topic-sensitive Influential Twitterers (WSDM 2010)

– Jianshu Weng, Ee-Peng Lim, Jing Jiang, Qi He• #Hashtag and Link Relation

– Finding Trendsetters in Information Networks (SIGKDD 2012)– Diego Saez-Trumper, Giovanni Comarela, Virgílio Almeida, Ricardo

Baeza-Yates, Fabrício Benevenuto

P83-57

Finding Trendsetters in InformationNetworks

P83-58

What is a Trendsetter?

P83-59

What is a Trendsetter?

Trendsetters are people:

Adopt and spread new trends before these trends becomepopular.

Propagate these trends over the network.

P83-60

Finding trendsetters in a graph

P83-61

Who are the trendsetters?

P83-62

Key Point

P83-63

Time

P83-64

How to find Trendsetters?

P83-65

Weight edges and run PageRank

P83-66

Topics and Influence Model

P83-67

Topics

Topic: collection of trends (Urls, memes, #hashtags, quotes, etc)

For each node we store the timestamp when he adopt a trendh1

P83-68

Graph

• We denote as the induced graph of G(N,E) over the topic k.

• The set is obtained by considering all nodes of N that used at least one trend of k

• The set represent all edges (u, v) such that, if (u, v) E and ∈ u, v then (u, v) ∈ ∈

P83-69

Weight Edges

Let be the time when node v adopts the trend k (= 0, if v ∈ ∈does not adopt ).

We define two vectors, (for all v ) and (for all (u, v) ),∈ ∈

each one with components given respectively by:

𝑠1(𝑣 )𝑖={ 1 ,𝑖𝑓 𝑡 𝑖 (𝑣 )>00 , h𝑜𝑡 𝑒𝑟𝑤𝑖𝑠𝑒

And

𝑠2(𝑢 ,𝑣 )𝑖={𝑒−Δ𝛼 , 𝑖𝑓 𝑡𝑖 (𝑣 )>0 𝑎𝑛𝑑𝑡𝑖 (𝑣 )<𝑡 𝑖 (𝑢 )

0 , h𝑜𝑡 𝑒𝑟𝑤𝑖𝑠𝑒

For I = 1 , …, , where = - and > 0

P83-70

Weight Edges

Vector informs if node v adopted (or not) each trend of k, while shows if u adopted these trends after v and weights the relation as a function of the period of time between and .

For a fixed , if→ 0+ then → 1 and if → +∞ then → 0.

These limits mean that if the node u adopts a trend just after v then is very close to

P83-71

Weight Edges

(u, v) = ×

Let be an induced graph of a network G(N,E) over a topic k with trends. For each (u, v) we define ∈the influence of v over u by:

where the operator · refers to the scalar product, ||x|| tothe Euclidian norm of any vector x, and to thenumber of components of (u, v) that are different from 0.If ||s2(u, v)|| = 0, we define (u, v) = 0. It is important tonotice that, by definition, ||s1(v)|| 0 for all v .∈

𝐿(𝑠2(𝑢 ,𝑣))𝑁 𝑘

𝑠1(𝑣 )·  𝑠2(𝑢 ,𝑣)    

||  𝑠1(𝑣) ||  ×  ||  𝑠2(𝑢 ,𝑣)  || 

P83-72

One important fact is that u can be influenced to adopt atrend of k by several nodes in . So, we normalize(u, v) as follows:

Definition(u, v) =

Normalize

P83-73

TS Ranking

Definition

The trendsetters (TS) rank of node v in a network ,

denoted by , is given by:

= d *

where 0 ≤ d ≤ 1 is the damping factor and is a probability distribution over all nodes of . In this paper, we consider a uniform (v) = 1/| | for all v , but this distribution could be topic dependent.∈

P83-74

Evaluation

P83-75

Baseline

In-degree rankingPageRank

P83-76

Dataset

Twitter until August 2009.

Over 50 Millions users with all their followers and followees.1.6 Billions tweets

We use #tags as trends.

P83-77

Example:Iran Elections on Twitter

P83-78

Example

Iran Elections: {#iran, #iranelections,#tehran}

TS : @Lara (“Reporting from the Middle East”)PR : @cnnbr (“CNN Breaking News”)

P83-79

Category #Topics ExampleofHashtags #Tweets

Celebrity 16 #michaeljackson,#niley 1,036,101

Games 13 #mafiawars,#ps3# 2,556,437

Idioms 35 #musicmonday,#followfriday 7,882,209

Movies 29 #heroes,#tv 1,769,945

Music 33 #lastfm,#musicmonday 2,785,522

None 153 #quotes,#sale 2,227,971

Political 39 #honduras,#Iranelection, 8,156,786

Sports 27 #soccer,#rugby 1,914,061

Technology 41 #twitter,#android 7,459,471

BaselineWe use the #tag classification made by Romero et al.

P83-80

Trendsetters: early adopters?

P83-81

% o

f To

p−10

0 U

sers

bef

ore

the

peak

IDIOMS GAMES POLITICAL NONE MOVIES TECHNO. SPORTS0

CELEBRITY MUSIC

90

Category

Experiments I

100InDegree

PageRank

TrendSetters

80

70

60

50

40

30

20

10

P83-82

In-degree vs adoption time

P83-83

Nod

e I

nDeg

ree

−100 −80 −60 −40 −20 0 20

Time

Experiments II

4x 102.5

IDPRTS

2

1.5

1

0.5

0

P83-84

Influenced Followers Ratio

P83-85

Influenced Followers Ratio

IFk(v) is the fraction of followers of v that adopted at least one trendof the topic k after v.

CategoryPOLITICALCELEBRITYMUSICGAMESSPORTSIDIOMSNONETECHNOLOGYMOVIES

(%)ID0.0130.0150.0130.0220.0040.0010.0110.0060.006

(%)PR0.0840.0890.0960.0580.0540.0340.0010.0540.043

(%)TS0.1740.1480.1600.1150.0980.0880.0850.0780.067

P83-86

Num

ber

of T

op−

100

user

s f

ound

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 100.1

70

60

50

40

30

20

10

Ranking with Partial Information

100

90

80

Ratio of users considered (sorted by time)

musicmonday TSmusicmonday PRiranelection TSiranelection PRSwineflue TSswineflue PRfollowfriday TSfollowfriday PRmw2 TSmw2 PRfb TSfb PRf1 tsf1 prmichaeljackson tsmichaeljakson pr

P83-87

Final Remarks

Usually, follower hubs (celebrities) are late adopters.

Trendsetters have lower in-degree, but they spread new ideas.

P83-88

Questions

Recommended