16
Clique: Graph & Network Analysis Cluster School of Computer Science & Informatics University College Dublin, Ireland Aggregating Content and Network Information to Curate Twitter User Lists Derek Greene Gavin Sheridan RecSys-12: Recommender Systems & The Social Web Barry Smyth Pádraig Cunningham

Aggregating Content and Network Information to Curate ... · RSWeb-2012 Supporting User List Curation • Idea: Use data analysis to identify important users that form the “community”

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Aggregating Content and Network Information to Curate ... · RSWeb-2012 Supporting User List Curation • Idea: Use data analysis to identify important users that form the “community”

Clique: Graph & Network Analysis Cluster School of Computer Science & Informatics

University College Dublin, Ireland

Aggregating Content and Network Information to Curate Twitter User Lists

Derek Greene

Gavin Sheridan

RecSys-12: Recommender Systems & The Social Web

Barry Smyth

Pádraig Cunningham

Page 2: Aggregating Content and Network Information to Curate ... · RSWeb-2012 Supporting User List Curation • Idea: Use data analysis to identify important users that form the “community”

RSWeb-2012

Task: User List Curation

• Twitter allows the grouping of users into topical user lists.

• Storyful maintains lists of users for news stories to monitor breaking news related to that story.

2

Page 3: Aggregating Content and Network Information to Curate ... · RSWeb-2012 Supporting User List Curation • Idea: Use data analysis to identify important users that form the “community”

RSWeb-2012

Supporting User List Curation

• Idea: Use data analysis to identify important users that form the “community” around a news story on Twitter.

• Goal: Recommend new users to expand an embryonic seed list, helping to find valuable content relating to the story.

3

Page 4: Aggregating Content and Network Information to Curate ... · RSWeb-2012 Supporting User List Curation • Idea: Use data analysis to identify important users that form the “community”

Bootstrap

CandidateList

Recommender

Updater

RankedList

Curator

CoreList'

SeedList

CandidateList

!"#$%&'(

)*+,-&*.&%#/012%3'

41&506*

7089*':*8

;&'&<&22%=&3

=*:*6>

?&;?0@?0A&%'3

5B"*+C+3DEFD

G%=>(&,,*1@

708$&*+801'0;&%*'=01%&@

43>HI&63

@0+.,*36A+%13

4''+86*"0+%=&%

(0+3&":=&2",&%>

6:0%*@=0

8=,,&%8&&>3

#03:*'"0+%=&%

506*J4)

B'&*8KE

$%+C&$%*,&H

506*!&1*'&

Curation System Overview

4

Phase 1

Phase 2

!"#$%&'(

)*+,-&*.&%

/0%12'1*3453.

6*''.%53&7*,8

9*(5+2&%&:+;2

<3&957*

=53>/=57&,,

:&'&?&44%1&2

@&:@58@5;&%'2

9A"*+/+2BCDB

E%1F(&,,*38

>*''GH01'*F&%G

*%'0+%6261'0

85+.,*27;+%32

705%*815

"0%12(*.&357#520*'"5+%1&%

61,,&%6&&F2

A'&*6IC

E8J1;;&''2

'85%6*3

,K33/*6:;&,,

L1/FAH*.3&%

M&3'!5%&3253

J0&$&*3H*,F&%

J16A,;%&/0'9A

#M534%2'

>/M13,&K45%957*

J56N*'0*61*0*7F

18*O&:%1/&

J56$&*+653'

"G@*3'28?*61253

5:&%*'153%&8

#&3314&%##*/5;2

<2FKL&72

26.G,K3/0

<''+67*"5+%1&%

(5+2&"01&4",&%F

M16@&K35,829A957*P<)

#*253G!/0+,'Q

!*38KGP%&13&%

;1,,2/01/F&,

$%+/&$%*,&K

957*!&3*'&

#*253",*K75%'0

Seed Set

Page 5: Aggregating Content and Network Information to Curate ... · RSWeb-2012 Supporting User List Curation • Idea: Use data analysis to identify important users that form the “community”

RSWeb-2012

Recommendation Overview

5

CandidateList

Recommender

RankedList

Curator

General Recommender

• Use embryonic seed list as initial training data.• Compute training data centroid. • Measure cosine similarity between centroid and

vectors representing candidate non-seed users.➡ Produce ranked list of top K non-seed users to

present to a human curator.

Which criterion should we use for recommendation?

➡ Propose variety of vectorised approaches, using criteria based on Twitter network and content analysis.

Page 6: Aggregating Content and Network Information to Curate ... · RSWeb-2012 Supporting User List Curation • Idea: Use data analysis to identify important users that form the “community”

RSWeb-2012

Content-Based Criteria

• Tweet profiles: Create a term vector for each user, containing the aggregation of their 50/100/200 most recent tweets.

6

User president health care reform gender costs@BarackObama 2 2 2 1 1 1 ...

Page 7: Aggregating Content and Network Information to Curate ... · RSWeb-2012 Supporting User List Curation • Idea: Use data analysis to identify important users that form the “community”

RSWeb-2012

Content-Based Criteria

• List names/descriptions: Create a term vector for each user, containing the aggregation of names and/or descriptions of user lists to which they have been assigned.

7

User@SarahPalinUSA

politics news useless2 1 1 ...

User@SarahPalinUSA

elected politics purpose1 1 1 ...

Page 8: Aggregating Content and Network Information to Curate ... · RSWeb-2012 Supporting User List Curation • Idea: Use data analysis to identify important users that form the “community”

RSWeb-2012

Network-Based Criteria

• Followed-by Profiles: Users are similar if they are "co-followed" by the same set of users. Represent each user as a binary follower profile vector.

8

• Mentioned-by Profiles: Users are similar if they are mentioned in tweets by the same users (i.e. "co-mentioned)".

• Retweeted-by Profiles: Users are similar if their posts are retweeted by the same users (i.e. "co-retweeted)".

User X is followed by...

...

User X @TeamGB @bradwiggins @Mo_Farah @britishswimming

@MichaelPhelps

@chrishoy

@TomDaley1994

1 0 0 1

1 1 1 0

1 1 0 1

Page 9: Aggregating Content and Network Information to Curate ... · RSWeb-2012 Supporting User List Curation • Idea: Use data analysis to identify important users that form the “community”

Network-Based Criteria

• Co-Listed Information: • Other media outlets and individuals will also be simultaneously

curating user lists on topics in the wider Twitosphere.➡ Would like to “crowdsource” these efforts to support list curation.

9Users @BrianODriscoll and @KearneyRob are co-listed on both lists.

Page 10: Aggregating Content and Network Information to Curate ... · RSWeb-2012 Supporting User List Curation • Idea: Use data analysis to identify important users that form the “community”

RSWeb-2012

Experimental Evaluation

• Collected 10 datasets around "Super Tuesday" GOP nominations in March 2012, with annotated seed users from Storyful.

10

Dataset Users Core Users Tweets Friends Followers ListedAlaska 948 41 185 208 269 89Georgia 966 34 211 235 295 126Idaho 743 20 186 264 273 47Massachusetts 821 24 209 244 293 122North Dakota 363 26 203 147 192 93Ohio 1051 97 178 171 207 115Oklahoma 693 32 205 178 211 109Tennessee 979 48 199 170 204 112Vermont 864 36 182 169 190 66Virginia 877 46 200 160 199 115

Table 1: Summary of 10 Twitter datasets used in our evaluations, including the number of manually curated“core” users, together with the mean number of tweets, friends, followers, and list memberships per user.

Retweeted-by profiles. The retweet graph is a weighteddirected graph, where an edge exists from one nodeto another if one user retweets another user. Fromthis graph, for each user ui we can construct a sparsereal-valued retweeted-by profile vector, where an entryvj indicates the number of times tweets posted by ui

were retweeted by another user uj . A pair of usersare deemed to be similar if their vectors have a highcosine similarity – i.e. their tweets are frequently “co-retweeted” by the same users.

Mentioned-by profiles. The mention graph is a weighteddirected graph, where an edge exists from one node toanother if one user mentions another user. From thisgraph, for each user ui we can construct a sparse real-valued mentioned-by profile vector, where an entry vj

indicates the number of times the user ui was men-tioned in the tweets posted by another user uj . A pairof users are deemed to be similar if their vectors havea high cosine similarity – i.e. they are “co-mentioned”by similar users.

Co-listed graph. Our primary motivation here is to iden-tify user list members on a given topic. It may often bethe case that other news organisations and private in-dividuals will also be simultaneously curating user liston the same topic in the wider Twitosphere. Ideally wewould like to be able to “crowdsource” these efforts tosupport list curation. Based on existing Twitter userlist memberships, we can construct a bipartite list-usergraph, where an edge between a list and a user nodeindicates that the list contains the specified user. Us-ing a list-user matrix representation, we can computethe cosine similarity of users with one another, indi-cating the similarity of their list memberships profiles.Users who are more frequently assigned to the samelists, or co-listed, will be deemed to be more similar.

4. EVALUATION

4.1 Datasets

For evaluation purposes, we constructed a collection ofTwitter datasets focused on news surrounding the Republi-can nomination for the United States presidential election of2012. Specifically, these datasets focus on ten states wherethe nomination was decided on “Super Tuesday” (March 6,2012). For each state, our partners at Storyful manually cu-rated a user list containing between 20 and 97 “core” users.

For each core user, we gathered a maximum of approxi-mately 300 tweets, friends, followers, and user list mem-berships using the Twitter API. For any user list that weencountered, we also retrieved its associated name and de-scription, if available. These limits reflects a realistic amountof data that might be retrieved when monitoring multiplenews stories in real-time and taking into account the com-paratively strict query limits imposed by the Twitter API.

We then retrieved the same data for a set of “non-core”users, constructed from up to 1,000 most frequently followedusers relative to the core set. The rationale here is that,while they may not have been manually selected as beingrelevant to the news story, they remain tied to the core setin some way. This yielded ten datasets for testing cura-tion criteria, containing on average ≈ 831 total users, ofwhich ≈ 5% are annotated as core users. In total 1,618,383tweets from 8,305 unique users were collected. Details ofthese datasets are listed in Table 1. These datasets are madeavailable online in pre-processed form3.

4.2 Comparison of Criteria

To compare the criteria introduced in Section 3, we per-formed cross-validation on the ten datasets, using the an-notated core users as training data. Due to the differingnumber of core users in each dataset, the number of foldsfor a given dataset was selected from ∈ [2, 5], to ensure therewas at least 10 users in each held out test set. The cross-validation process was repeated for 250 runs, with meanprecision and recall scores computed for the top k ∈ [10, 50]non-core user recommendations produced by each criterion.In the case of content-based datasets, we applied standardlog-based TF-IDF normalisation prior to generating recom-mendations.

Firstly, to examine the diversity of recommendations pro-duced by the various criteria, Fig. 1 illustrates the agree-ment between rankings generated across all 10 × 250 runs,in terms of their pairwise Spearman rank correlations. Theseaggregated correlations indicate that there are a number ofdistinct signals presents across different views of the samedata. As we would expect, the different tweet profile sets arevery highly-correlated. However, the rankings produced bylist content text (i.e. list names and descriptions) are consid-erably different, and correlated far more highly with the cor-responding list memberships (i.e. rankings generated on theco-listed graph). The latter criterion also shares some simi-larity with another network view, provided by the followed-

3http://mlg.ucd.ie/curation

Page 11: Aggregating Content and Network Information to Curate ... · RSWeb-2012 Supporting User List Curation • Idea: Use data analysis to identify important users that form the “community”

RSWeb-2012

Experimental Setup

• Ran 250 X k-fold cross validation experiments per dataset:- Held out a proportion of the seed set as test data.- Used the remaining seed users as training data.- Ranked users in test data using each criterion.- Calculated precision and recall relative to the test data.

11

Training Users

Test Users

Page 12: Aggregating Content and Network Information to Curate ... · RSWeb-2012 Supporting User List Curation • Idea: Use data analysis to identify important users that form the “community”

Comparison of Criteria

• Looked at diversity among the recommendations from the different criteria...

➡We see several distinct signals present across the different views.

12

!"##$%&'(")*+*,-./

!"##$%&'(")*+*,-/

!"##$%&'(")*+*0

!"1%(2'$3

4"%%"5$3167

8(2'*3$29#(:'(")2

8(2'*;$#<$3

8(2'*)&;$2

=$)'(")$3167

>$'5$$'$3167

?5$$'2*@/,A

?5$$'2*@0,,A

?5$$'2*@.,,A

!"1%(2'$3

4"%%"5$3167

8(2'*3$29#(:'(")2

8(2'*;$#<$3

8(2'*)&;$2

=$)'(")$3167

>$'5$$'$3167

?5$$'2*@/,A

?5$$'2*@0,,A

?5$$'2*@.,,A

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Mentio

ned-

by

Follo

wed

-by

List

nam

es

Co-li

sted

Twee

ts (2

00)

List

mer

ged

Twee

ts (1

00)

List

des

crip

tions

Twee

ts (5

0)

Retw

eete

d-by

% Top 3 Placements for Precision across all experiments

➡ No single criterion performs consistently well on all datasets.

Page 13: Aggregating Content and Network Information to Curate ... · RSWeb-2012 Supporting User List Curation • Idea: Use data analysis to identify important users that form the “community”

RSWeb-2012

Aggregating Multiple Rankings

• SVD Aggregation: Combine rankings generated on top 5 individual criteria into a single matrix, apply Singular Value Decomposition, then rank values in first singular vector.

13

0%!

10%!

20%!

30%!

40%!

50%!

60%!

70%!

80%!

90%!

100%!

SVD! Mentioned-by! Co-listed! Followed-by! Tweets (200)! List names!

Top

3 Pl

acem

ents!

% Top 3 Placements for Precision across all experiments

Comparison of SVD Aggregation V Top 5 Individual Criteria

Page 14: Aggregating Content and Network Information to Curate ... · RSWeb-2012 Supporting User List Curation • Idea: Use data analysis to identify important users that form the “community”

14

User List Curation System

Page 15: Aggregating Content and Network Information to Curate ... · RSWeb-2012 Supporting User List Curation • Idea: Use data analysis to identify important users that form the “community”

RSWeb-2012

Conclusions

• Summary:• Proposed variety of criteria for user list building.• Performed a comprehensive comparison of individual

criteria, demonstrating weaknesses of both content and network-based strategies.

• Demonstrated that more accurate results can be achieved using SVD aggregation.

15

• Future Work: • Stratification of networks to target recommendations for

niche communities - avoiding the "filter bubble".• Support for curation across multiple social networks.