People Search @ Study Group MSRA NLC

People search, TwitterRank and Trendsetters finding in Twitter

Beijing, September, 2012

MSRA NLC Study Group

Yi Lu, Jie Liu

• Input: A query including expertise topic, such as database system, software engineering.

• Output: A list including people ranked with topic relevance.

Query List of people

Background: People Search

• Input:

• Output:

An Illustrative Example

• A student looks for a machine learning supervisor

• A patient looks for doctors who have many successful cases on his disease

• A historian looks for people who have expertise on Maya culture

• A CTO looks for engineers who have related skills

• …

Scenarios of People Search

• Identify opinion leaders, experts• Advertisement• Turn to somebody for help• Select a team to do a specific task• A lot of challenges remains.

Motivation

• Wisdom of the Crowd– Cognos Crowdsourcing Search for Topic Experts in Microblogs (SIGIR

2012)– Saptarshi Ghosh, Naveen Sharma, Fabricio Benevenuto, Niloy

Ganguly, Krishna P. Gummadi• Tweets and Link Relation

– TwitterRank: Finding Topic-sensitive Influential Twitterers (WSDM 2010)

– Jianshu Weng, Ee-Peng Lim, Jing Jiang, Qi He• #Hashtag and Link Relation

– Finding Trendsetters in Information Networks (SIGKDD 2012)– Diego Saez-Trumper, Giovanni Comarela, Virgílio Almeida, Ricardo

Baeza-Yates, Fabrício Benevenuto

Outline

Cognos: Crowdsourcing Search for Topic Experts in Microblogs

Saptarshi Ghosh, Naveen Sharma, Fabricio Benevenuto,

Niloy Ganguly, Krishna P. Gummadi

• Twitter is now an important source of current news– 500 million users post 400 million tweets daily

• Quality of tweets posted by different users vary widely– News, pointless babble, conversational tweets,

spam, … • Challenge: to find topic experts

– Sources of authoritative information on specific topics

Topic experts in Twitter

• Existing approaches– Research studies: Pal [WSDM 11], Weng [WSDM 10]– Application systems: Twitter Who-To-Follow,

Wefollow, …• Existing approaches primarily rely on

information provided by the user herself– Bio, contents of tweets, network features e.g.

#followers• We rely on “wisdom of the Twitter crowd”

– How do others describe a user?

Identifying topic experts in Twitter

• Challenges in designing search system for topic experts in Twitter– How to infer topics of expertise of an individual

Twitter user? – How to rank the relative expertise of users

identified as experts on a topic?

Challenges

HOW TO INFER TOPICS OF EXPERTISE OF TWITTER USERS?

Challenge 1

Challenge #1

• A feature to organize tweets received from the people whom a user is following

• Create a List, add name & description, add Twitter users to the list

• Tweets from all listed users will be available as a separate List stream

Twitter Lists

• Collect Lists containing a given user U

• Identify U’s topics from List meta-data– Basic NLP techniques– Extract nouns and adjectives

• Extracted words collected to obtain a topic document for user

[movies tv hollywood stars entertainment celebrity hollywood …]

Mining Lists to infer expertise

• Collected Lists of 55 million Twitter users who joined before or in 2009

• All analyses consider 1.3 million users who are included in 10 or more Lists

Dataset

linux, tech, open, software, libre, gnu, computer, developer, ubuntu, unix

politics, senator, congress, government, republicans, Iowa, gop, conservative

politics, senate, government, congress, democrats, Missouri, progressive, women

Topics extracted from Lists

love, daily, people, time, GUI, movie, video, life, happy, game, cool

Most common words from tweets

celeb, actor, famous, movie, stars, comedy, music, Hollywood, pop culture

Most common words from Lists

Profile bio

Lists vs. other features

Fallon, happy, love, fun, video, song, game, hope, #fjoln, #fallonmono

Most common words from tweets

celeb, funny, humor, music, movies, laugh, comics, television, entertainers

Most common words from Lists

Profile bio

Lists vs. other features

• Top 20 WTF results for 200 queries 3495 users

• Whether the results returned by Cognos cover the results returned by Twitter WTF?

• For 83.4%, yes

• From among the rest 16.6%, manual inspection of a random sample shows two major cases

Cognos vs. Twitter Who-To-Follow

We can find Twitterer dineLA in Twitter if the query is “dining”

We can find Twitterer Space explorer HubbleHugger77 in Twitter if the query is “hubble”

Case 1 – topics inferred from Lists include semantically similar words, but not exact query-word

Topics from Lists – food, restaurant, recipes, Los Angeles

Topics from Lists – science, tech, space, cosmology, NASA

More than one way to express an idea

We can find Comedian jimmyfallon if the query is “astrophysicist”

Case 2 – results by Twitter unrelated to query

Topics from Lists – celebs, comedy, humor, actor

Results returned by Twitter is unrelated

• List-based method provides accurate & comprehensive inference of topics of expertise of Twitter users

• In many cases, more accurate than existing approaches that utilize profile information or tweets

Inferring expertise: Summary

HOW TO RANK EXPERTS ON A TOPIC?

Challenge 2

Challenge #2

• Used a ranking scheme solely based on Lists • Two components of ranking user U w.r.t. query

Q– Relevance of user to query – cover density ranking

between topic document TU of user and Q– Popularity of user – number of Lists including the

Topic relevance(TU, Q) × log(#Lists including U)

Ranking experts

• Search system for topic experts in Twitter

• Given a query (topic)– Identify experts on the topic using Lists– Rank identified experts

Cognos

Cognos results for “politics”

Cognos results for “stem cell”

• System deployed and evaluated ‘in-the-wild’• Evaluators were students & researchers from

the three home institutes of authors

Evaluation of Cognos

User-evaluation of Cognos

Sample queries for evaluation

• Overall 2136 relevance judgments– 1680 said relevant (78.7%)

• Large amount of subjectivity in evaluations– Same result for same query received both relevant

and non-relevant judgments– E.g., for query “cloud computing”, Werner Vogels

got 4 relevant judgments, 6 non-relevant judgments

Evaluation results

chief technology officer and vice President of Amazon.com in Seattle

• Considered only the results evaluated at least twice

• Result said to be relevant if voted relevant in the majority of evaluations

• Mean Average Precision considering top 10 results: 93.9 %

Evaluation results

• Considering 27 distinct queries asked at least twice• Judgment by majority voting

• Cognos judged better on 12 queries– Computer science, Linux, Mac, Apple, Ipad, Internet,

Windows phone, photography, political journalist, …

• Twitter Who-To-Follow judged better on 11 queries– Music, Sachin Tendulkar, Anjelina Jolie, Harry Potter,

metallica, cloud computing, IIT Kharagpur, …

Results for query music

P83-36

Questions

Outline

P83-38

TwitterRank: Finding Topic-sensitive Influential Twitterers

Jianshu Weng, Ee-Peng Lim, Jing JiangSingapore Management University

Qi HePennsylvania State University

• Introduction• Dataset• Topic Modeling• TwitterRank

Outline

• Given a set of twitterers, find the influential ones– for different topics

• Challenges:– Topics unknown

Introduction

Outline

• Crawled = a set of Singapore-based twitterers from twitterholic.com with highest number of followers.

• For each , crawled its followers and friends and

• For each get its published tweets. Denote the set of all tweets as

Data preparation

|S| 996

|S*| 6748 (4050 with more than 10 tweets)

|| 1,021,039

# following relationships 49,872

Min/Max/Avg #tweets/twitterer 1/3200/179.57

Data preparation

Reciprocity in the Following Relationships

• Friend count = # twitterers being followed• Follower count = # twitterers following• Correlation between friends count and follow

Reciprocity in the Following Relationships

• 72.4% of the users follow more than 80% of their followers

• 80.5% of the users have 80% of their friends follow them back

• Homophily• Twitters with “following” relationships are

more similar than those without, according to the topics they are interested in.

Explanations

Outline

• Apply LDA to distill topics automatically• Find topics in the twitterer’s content to represent his

interests– Twitterer’s content = aggregated tweets

• Pre-processing– Use only those words without non-English characters– Min word length= 3– Remove @userid, URL, All-digit word, stopwords

• Apply analysis on twitterers with more than 10tweets. (#twitterer=4050)

Topic Distillation

• Three matrices:– DT, a D x T matrix, where D is the number of twitterers and T

is the number of topics. contains the number of times a word in tweets of twitterer has been assigned to topic

– WT, a W x T matrix, where W is the number of unique words used in the tweets and T is the number of topics. captures the number of times unique word has been assigned to topic

– Z, a 1 x N vector, where N is the total number of words in the tweets. is the topic assignment forword

Results of Topic Distillation

Outline

• A topic-specific random walk model is applied to calculate each twitterer’s influential score.

• The transition matrix for topic t, denoted as . The transition probability of the random surfer from follower to friend .

– Where S is the set of ’s friends– = 1 - | |

Topic-specific TwitterRank

DT’ is row-normalized form of matrix DT

• This captures two notions:– The more publishes, the higher portion of tweets

reads is from . Generally, this leads to a higher influence on

– ’s influence on is also related to the topical similarity between the two as suggested by the homophily phenomenon.

• Topic-specific teleportation

• The influence scores of twitterers arecalculated iteratively

– is the t-th column of matrix DT’’, which is the column-normalized form of matrix DT

• General influence: can be set as the probabilities of different topics’ presence

• Perceived general influence: can also be set as the probabilities that a particular twitterer is interested in different topics.

Aggregation of Topic-specific TwitterRank

P83-55

Questions

Outline

P83-57

Finding Trendsetters in InformationNetworks

P83-58

What is a Trendsetter?

P83-59

What is a Trendsetter?

Trendsetters are people:

Adopt and spread new trends before these trends becomepopular.

Propagate these trends over the network.

P83-60

Finding trendsetters in a graph

P83-61

Who are the trendsetters?

P83-62

Key Point

P83-63

P83-64

How to find Trendsetters?

P83-65

Weight edges and run PageRank

P83-66

Topics and Influence Model

P83-67

Topics

Topic: collection of trends (Urls, memes, #hashtags, quotes, etc)

For each node we store the timestamp when he adopt a trendh1

P83-68

• We denote as the induced graph of G(N,E) over the topic k.

• The set is obtained by considering all nodes of N that used at least one trend of k

• The set represent all edges (u, v) such that, if (u, v) E and ∈ u, v then (u, v) ∈ ∈

P83-69

Weight Edges

Let be the time when node v adopts the trend k (= 0, if v ∈ ∈does not adopt ).

We define two vectors, (for all v ) and (for all (u, v) ),∈ ∈

each one with components given respectively by:

𝑠1(𝑣 )𝑖={ 1 ,𝑖𝑓 𝑡 𝑖 (𝑣 )>00 , h𝑜𝑡 𝑒𝑟𝑤𝑖𝑠𝑒

𝑠2(𝑢 ,𝑣 )𝑖={𝑒−Δ𝛼 , 𝑖𝑓 𝑡𝑖 (𝑣 )>0 𝑎𝑛𝑑𝑡𝑖 (𝑣 )<𝑡 𝑖 (𝑢 )

0 , h𝑜𝑡 𝑒𝑟𝑤𝑖𝑠𝑒

For I = 1 , …, , where = - and > 0

P83-70

Weight Edges

Vector informs if node v adopted (or not) each trend of k, while shows if u adopted these trends after v and weights the relation as a function of the period of time between and .

For a fixed , if→ 0+ then → 1 and if → +∞ then → 0.

These limits mean that if the node u adopts a trend just after v then is very close to

P83-71

Weight Edges

(u, v) = ×

Let be an induced graph of a network G(N,E) over a topic k with trends. For each (u, v) we define ∈the influence of v over u by:

where the operator · refers to the scalar product, ||x|| tothe Euclidian norm of any vector x, and to thenumber of components of (u, v) that are different from 0.If ||s2(u, v)|| = 0, we define (u, v) = 0. It is important tonotice that, by definition, ||s1(v)|| 0 for all v .∈

𝐿(𝑠2(𝑢 ,𝑣))𝑁 𝑘

𝑠1(𝑣 )· 𝑠2(𝑢 ,𝑣)

|| 𝑠1(𝑣) || × || 𝑠2(𝑢 ,𝑣) ||

P83-72

One important fact is that u can be influenced to adopt atrend of k by several nodes in . So, we normalize(u, v) as follows:

Definition(u, v) =

Normalize

P83-73

TS Ranking

Definition

The trendsetters (TS) rank of node v in a network ,

denoted by , is given by:

where 0 ≤ d ≤ 1 is the damping factor and is a probability distribution over all nodes of . In this paper, we consider a uniform (v) = 1/| | for all v , but this distribution could be topic dependent.∈

P83-74

Evaluation

P83-75

Baseline

In-degree rankingPageRank

P83-76

Dataset

Twitter until August 2009.

Over 50 Millions users with all their followers and followees.1.6 Billions tweets

We use #tags as trends.

P83-77

Example:Iran Elections on Twitter

P83-78

Example

Iran Elections: {#iran, #iranelections,#tehran}

TS : @Lara (“Reporting from the Middle East”)PR : @cnnbr (“CNN Breaking News”)

P83-79

Category #Topics ExampleofHashtags #Tweets

Celebrity 16 #michaeljackson,#niley 1,036,101

Games 13 #mafiawars,#ps3# 2,556,437

Idioms 35 #musicmonday,#followfriday 7,882,209

Movies 29 #heroes,#tv 1,769,945

Music 33 #lastfm,#musicmonday 2,785,522

None 153 #quotes,#sale 2,227,971

Political 39 #honduras,#Iranelection, 8,156,786

Sports 27 #soccer,#rugby 1,914,061

Technology 41 #twitter,#android 7,459,471

BaselineWe use the #tag classification made by Romero et al.

P83-80

Trendsetters: early adopters?

P83-81

p−10

IDIOMS GAMES POLITICAL NONE MOVIES TECHNO. SPORTS0

CELEBRITY MUSIC

People Search @ Study Group MSRA NLC

Technology

MSRA NEWSLETTER Inside › documents › newsletters › 2014 › MSRA...MSRA NEWSLETTER “MR Apprentices” ... 3. Statistical technique used with quantitative data to iden-tify

arxiv paper: · Pactera human trans 2 MSR Redmond 20180112 3 MSRA NLC 20180108 3 Sogou Knowing NMT 4 MSRA ML 20180111 4 Online-A-1710 4 Online-B-1710 5 February 2018 Rank Pactera

Next - NLC

NLC rating.pdf

Jonathan Tien - Advancing Technology With MSRA

Search Engine Overview - pudn.comread.pudn.com/.../Search_Search_Engine_Overview.pdf · Search Engine Overview Ji-Rong Wen WSM, MSRA 6/19/2006. 2 ... Web Graph Web Crawler User Interface

Benefits Handbook - MSRA

Jasper Grosskurth - Marketing & Social Research ...msra.or.ke/.../MSRA...in-Africa-Jasper-Grosskurth.pdf · Jasper Grosskurth MSRA Annual Conference 26 June 2014 jasper@researchsolutionsafrica.com

Msra talk smw+apps

MSRA 2009 Financials

Nlc fight trivia

MSRA Intern Application Tracking System (MIATS) Compass Team

NLC Report

Parent info nlc

Btg Tender Nlc

NLC Calendar, 2012

MSRA: Concurrent Session...MSRA: Concurrent Session Domestic Resource Mobilization (DRM) ... Promotes greater ownership of the development agenda Is a barometer of state capacity and

NLC - The Next Linear Collider Project NLC Program Overview D. L. Burke NLC Machine Advisory Committee Fermilab May 2002

Investment Section - MSRA

Mission Statement - MSRA