44
Community-Centric Information Exploration on Social Content Sites Sihem Amer-Yahia and Cong Yu Yahoo! Research New York M3SN, Shanghai, China March 29 th , 2009

Community-Centric Information Exploration on Social Content Sites Sihem Amer-Yahia and Cong Yu Yahoo! Research New York M3SN, Shanghai, China March 29

  • View
    214

  • Download
    0

Embed Size (px)

Citation preview

Community-CentricInformation Explorationon Social Content Sites

Sihem Amer-Yahia and Cong Yu

Yahoo! Research New YorkM3SN, Shanghai, China

March 29th, 2009

2 | M3SN, Shanghai, China, 2009Sihem Amer-Yahia and Cong Yu

Yahoo! Research New York

Acknowledgement

• Michael Benedikt, Oxford

• Alban Galland, INRIA

• Jian Huang, Penn State

• Laks Lakshmanan, UBC

• Julia Stoyanovich, Columbia

Acknowledgement (thanks to Facebook)

3 | M3SN, Shanghai, China, 2009Sihem Amer-Yahia and Cong Yu

Yahoo! Research New York

Social Networking Sites Are Popular!

4 | M3SN, Shanghai, China, 2009Sihem Amer-Yahia and Cong Yu

Yahoo! Research New York

An Emerging Trend: Social Content Integration

5 | M3SN, Shanghai, China, 2009Sihem Amer-Yahia and Cong Yu

Yahoo! Research New York

Social Content Sites (SCS)

• Web destinations that let users:– Consume content-oriented information: videos, photos,

news articles, etc.

– Engage in social activities with their friends and people of similar interests

• Two major driving factors:– Incorporating social activities improves the

attractiveness of traditional content sites• The “similar traveler” feature improves the user duration on Yahoo! Travel significantly

– Incorporating content is critical to the value of social networking sites• A significant amount of user time is spent on browsing other people’s photos, notes, etc.

• Privacy: important but not the focus here

6 | M3SN, Shanghai, China, 2009Sihem Amer-Yahia and Cong Yu

Yahoo! Research New York

Main Challenges in SCS

• Social Content Management– Data must be maintained and analyzed at web-scale: the

underlying social content graph can be enormous. (Edward Chang’s keynote later)

– Information has to be integrated from various places.

• Information Discovery– Current: pure content based search or simple

popularity/rating based browsing

– Future: effectively and efficiently leverage the underlying social graph in more sophisticated ways

• Information Presentation– Current: a single ranked list based on relevance

– Future: metrics other than relevance / groups and clusters / explanations

InformationExploration

7 | M3SN, Shanghai, China, 2009Sihem Amer-Yahia and Cong Yu

Yahoo! Research New York

Talk Outline

• Emerging Trend of Social Content Sites (SCS)

• Information Exploration on SCS– Community Centric Model

– Information Discovery: Improving Search Performance through Clustering

– Information Presentation: Diversification via Explanation

• SocialScope and Jelly

• Further Challenges

• Conclusion

8 | M3SN, Shanghai, China, 2009Sihem Amer-Yahia and Cong Yu

Yahoo! Research New York

Information Exploration on SCS

• Major Paradigms– User-based: browsing the content of your friends or

other users you simply follow (Facebook, twitter)

– Search: content based keywords matching (most traditional content sites)

– Recommendations: hotlists, tag-based hotlists (Amazon, many collaborative tagging sites)

– Query: database-style querying with complex conditions (not many so far)

But which ones are more frequent?

9 | M3SN, Shanghai, China, 2009Sihem Amer-Yahia and Cong Yu

Yahoo! Research New York

Search vs. Recommendation

• Majority of those searches are in fact recommendations in disguise!– Users are usually NOT looking for specific items

– Majority of the sessions are seeking recommendations with geographical and topical constraints

• Ranking the paradigms (for neurotic people who insist on orders):“Recommendation” ~ User-based > Search > Query

10 | M3SN, Shanghai, China, 2009Sihem Amer-Yahia and Cong Yu

Yahoo! Research New York

Recommendation 2.0 (or 1.5)

• Users demand a lot more from us– Be able to specify the topic: “what are the good destinations

for a romantic trip in Asia?”

– Be able to specify the community: “what’s hot among my tech-savvy friends?”

– Be able to present the results in ways that can maximize the understanding of those results

• But we do know a lot more about the users and the contents (through the users)– Richer user profiles

– Friendships and relationships

– Content activities (rating movies, tagging travel destinations, etc.)

– Communication activities (IM, email, twitter follows, Facebook wall messages, etc.)

11 | M3SN, Shanghai, China, 2009Sihem Amer-Yahia and Cong Yu

Yahoo! Research New York

Data Model: Social Content Graph

John

Joe NYC

AAAmy

friendvisit

fun,club

Yankees

Football

id: 301destinationstate: NY

id: 302destinationstate: MIcollege: umich

id: 102people

id: 101people

id: 103people

A social content graph is a logical graph structure where the labeled nodes represent users and objects, and the labeled edges represent relations between users and items, as well as activities users perform on items or other users. Both nodes and edges can have structural attributes.

12 | M3SN, Shanghai, China, 2009Sihem Amer-Yahia and Cong Yu

Yahoo! Research New York

A Community-Centric Model

• Communities of users as the central concept– Community: a group of “similar” or “related” users

– Recommendations are performed per community and results from each community are assembled together in the end

– A user can belong to multiple communities:

• A vector-based representation: user:<c1:w1, c2:w2, ...>

• Challenge I: Community Generation– Topic based

– Relationship based

– Activity based

• Challenge II: Community Selection– Given a user session, determine the appropriate set of

communities to work with

14 | M3SN, Shanghai, China, 2009Sihem Amer-Yahia and Cong Yu

Yahoo! Research New York

A Case Study in Yahoo! Travel on Topic-Based Communities

15 | M3SN, Shanghai, China, 2009Sihem Amer-Yahia and Cong Yu

Yahoo! Research New York

Data in Yahoo! Travel

• Users– Yahoo! users

• Destinations– cities, attractions, etc.

• Tagging activities:– Destination tags: editorial + custom

– Self tags: mostly custom

• Goal:– Given a user session (user + tags + region),

recommending travel destinations to the user

• Approach:– Discover topics for users, tags and destinations based

on tagging actions

– Construct communities based on topics

16 | M3SN, Shanghai, China, 2009Sihem Amer-Yahia and Cong Yu

Yahoo! Research New York

• Joint work with Jian Huang

• Intuition– Each destination is modeled as a document consists of all tags

being used on it

– Tags are treated as words in the document

• Apply Latent Dirichlet Allocation [BNJ03]– A probabilistic generative model for document topic discovery

– Produces the following conditional probabilities• Prob(w|t): given a latent topic, the probability of the tag word

belongs to that topic

• Prob(t|d): given a document, the probability of the document being about a certain latent topic

• Prob(t|u) = (Prob(t|di)), where u tagged di

k

ik 1

1

Topic Discovery through LDA

17 | M3SN, Shanghai, China, 2009Sihem Amer-Yahia and Cong Yu

Yahoo! Research New York

Examples of Topic Discovery

Topic 1: Seaside/Water related

Tokens: beach, scuba, summer, diving, fishing, snorkeling, island, lake …

Top cities: San Diego, Honolulu,Chicago, Miami, Seattle …

Topic 3: Nightlife, City-life

Tokens: nightlife, gambling, wine, drinking, exciting, casino …

Top cities: Las Vegas, New York City, Los Angeles, San Francisco …

Topic 0: Romantic/Luxury

Tokens: romantic, 4-star, shopping, spa, golf, luxury, honeymoon …

Top cities: Cancun, San Juan, Grand Canyon, Bangkok …

Topic 2: Arts/Historical/Cultural

Tokens: architecture, sightseeing, art, history, culture, cathedral, castle …

Top cities: Paris, London, Boston, Istanbul Amsterdam, Rome, Hong Kong …

San Diego

Las Vegas

Paris

Cancun

18 | M3SN, Shanghai, China, 2009Sihem Amer-Yahia and Cong Yu

Yahoo! Research New York

Community Generation

• Identify destinations belong to the same topic– For each topic t, identify destination d belongs to t:

D(t) = { d | Prob(t|d) > threshold }

• Generate topical communities– D(u) = { d | user u visited destination d }

– Interests-based topical community

Cinterest(t) = { u | > threshold }

– Expertise-based topical community

Cexpert(t) = { u | > threshold }

)(

)()(

uD

uDtD

)(

)()(

tD

uDtD

D(u)

D(t)

19 | M3SN, Shanghai, China, 2009Sihem Amer-Yahia and Cong Yu

Yahoo! Research New York

Community Selection

• Each user session consists of three pieces of information– (user, tag, region)

– Region is treated as a Boolean constraint

– We combine and convert user and tag into a single topical vector

Prob(t|(u,w)) =

w1·Prob(t|u) + w2·(Prob(w|t)·Prob(t)/Prob(w))

• A community C(t) is selected if Prob(t|(u,w)) is greater than a threshold

• The choice between Interests or Experts Communities is heuristically decided– User is an expert: leverage interests-based community

– User is not an expert: leverage experts-based community

20 | M3SN, Shanghai, China, 2009Sihem Amer-Yahia and Cong Yu

Yahoo! Research New York

Example Results

21 | M3SN, Shanghai, China, 2009Sihem Amer-Yahia and Cong Yu

Yahoo! Research New York

Talk Outline

• Emerging Trend of Social Content Sites (SCS)

• Information Exploration on SCS– Community Centric Model

– Information Discovery: Improving Search Performance through Clustering

– Information Presentation: Diversification via Explanation

• SocialScope and Jelly

• Further Challenges

• Conclusion

22 | M3SN, Shanghai, China, 2009Sihem Amer-Yahia and Cong Yu

Yahoo! Research New York

Querying Items within Social Network

• Problem Definition:– Given a user and a query, retrieve items relevant to the

query based only on the user’s network

– The score of the item depends on keyword and user

– E.g., score(u,q,i) = cnt(v),

where tag(v,i,q) and friend(u,v)

– A common scenario on social tagging sites like del.icio.us

• Naïve Approach– One inverted index for each (user,keyword)

– Apply top-k threshold algorithms at query time

– Optimal efficiency

– But, high storage overhead

23 | M3SN, Shanghai, China, 2009Sihem Amer-Yahia and Cong Yu

Yahoo! Research New York

Space Overhead of Per-User Indexing

item score

i7

i1i2i3i4i5i6

i816

736562403918

16

user Jane

i7

i5i9i2i6i5i8

i3

user Ann

10

533630151410

5

scoreitem

tag = music

item score

i7

i1i8i4i2i3i6

i915

302927252320

13

user Jane

i4

i5i2i8i7i1i6

i3

user Ann

60

998078757263

50

scoreitem

tag = news

• Conservative Example:– 100K users, 1M items, 1K tags

– 20 tags/item from 5% of the taggers

– 10 bytes per inverted list entry

– 1 Terabyte index!

24 | M3SN, Shanghai, China, 2009Sihem Amer-Yahia and Cong Yu

Yahoo! Research New York

Clustering to the Rescue!• Joint work with Michael Benedikt, Laks Lakshamanan,

Julia Stoyanovich• Clustered Approach

– Group users based on shared behavior into communities• Shared items• Shared items with same tags

– One inverted index for each (group,keyword)– Replace item score with score upper bound (among all users in

the group)– Sacrifice a bit of query performance for indexing space

efficiency

• Community-based (seeker clustering):– Groups are keyword-independent– E.g., shared items, shared tags

• Behavior-based (tagger clustering):– Groups are keyword-specific– E.g., Jane may be similar to John on “travel”, but very

different from him on “sports”.

25 | M3SN, Shanghai, China, 2009Sihem Amer-Yahia and Cong Yu

Yahoo! Research New York

User Clustering

JaneChris Sarah Ann

Community-based Behavior-based

one user, one cluster

one inverted index per cluster

upper-bound score per item

one user, many clusters

one inverted index per cluster

upper-bound score per item

26 | M3SN, Shanghai, China, 2009Sihem Amer-Yahia and Cong Yu

Yahoo! Research New York

Community-based Clustering: Space

27 | M3SN, Shanghai, China, 2009Sihem Amer-Yahia and Cong Yu

Yahoo! Research New York

Community-based Clustering: Performance

28 | M3SN, Shanghai, China, 2009Sihem Amer-Yahia and Cong Yu

Yahoo! Research New York

Talk Outline

• Emerging Trend of Social Content Sites (SCS)

• Information Exploration on SCS– Community Centric Model

– Information Discovery: Improving Search Performance through Clustering

– Information Presentation: Diversification via Explanation

• SocialScope and Jelly

• Further Challenges

• Conclusion

29 | M3SN, Shanghai, China, 2009Sihem Amer-Yahia and Cong Yu

Yahoo! Research New York

Information Presentation• A broad definition:

Given the set of relevant items, identify the appropriate subset of items to be shown to the user(s) in the right organization and with the right amount of information.

• Appropriate Subset:– Timeliness

– Geographical closeness

– Diversity

– Etc.

• Right Organization:– Ranking

– Grouping

– Faceted Navigation

• Right Amount of Information– Explanation

30 | M3SN, Shanghai, China, 2009Sihem Amer-Yahia and Cong Yu

Yahoo! Research New York

A Case Study in Recommendation

• Joint work with Laks Lakshmanan

• While relevance is important to recommendation, others are critical too:– Novelty: avoid returning results that users are likely

to know already.

– Serendipity: aim to return less relevant results that might give users a pleasant surprise.

– Diversity: avoid returning results that are too similar to each other.

• Recommendation Diversification:

From the pool of candidate items, identify a list of items that are dissimilar to each other while maintaining a high cumulative relevance, i.e., strike a good balance between relevance and diversity.

31 | M3SN, Shanghai, China, 2009Sihem Amer-Yahia and Cong Yu

Yahoo! Research New York

Existing Solutions for Diversification

• Attribute-Based Diversification– Diversity semantics: pair-wise distance functions based on

item attributes (e.g., movie attributes).

– Combining with relevance:• Threshold either relevance or distance, maximize the other

• Optimize an overall score as a weighted combination of relevance and distance

– Algorithm• Perform traditional recommendation

• Obtain the attributes of each candidate item and compute the pair-wise distance

• Ad-hoc methods follow how diversity and relevance are combined

• A Major Problem:– Lack of attributes suitable for estimating distance between

pairs of items: e.g., URLs in del.icio.us, photos on Flickr, videos on Vimeo.

32 | M3SN, Shanghai, China, 2009Sihem Amer-Yahia and Cong Yu

Yahoo! Research New York

• Intuition– Explanation is the set of objects because of which a

particular item is recommended to the user.

– Two items share similar explanations are likely to be similar to each other.

• Explanation for Item-Based Strategies

• Explanation for Collaborative Filtering Strategies (social!)

Explanation-Based Diversification

33 | M3SN, Shanghai, China, 2009Sihem Amer-Yahia and Cong Yu

Yahoo! Research New York

Explanation-Based Diversity

• Pair-wise diversity distance between two recommended items– Standard similarity measures like Jaccard similarity and

cosine similarity

– E.g. (Distance based on Jaccard similarity)

• Diversity for the set of recommended results (S)

34 | M3SN, Shanghai, China, 2009Sihem Amer-Yahia and Cong Yu

Yahoo! Research New York

Benefits and Practicality of Explanation-Based Diversification

• Applicable to items without attributes or whose attributes are difficult to analyze– Common on social content sites

• Explanations are by-products of many recommendation processes– They can be maintained with little overhead

See our poster on Monday for details

35 | M3SN, Shanghai, China, 2009Sihem Amer-Yahia and Cong Yu

Yahoo! Research New York

Talk Outline

• Emerging Trend of Social Content Sites (SCS)

• Information Exploration on SCS– Community Centric Model

– Information Discovery: Improving Search Performance through Clustering

– Information Presentation: Diversification via Explanation

• SocialScope and Jelly

• Further Challenges

• Conclusion

36 | M3SN, Shanghai, China, 2009Sihem Amer-Yahia and Cong Yu

Yahoo! Research New York

SocialScope Platform

Facebook Y! Sports

Content Integrator

Social Content Analyzer Social Query Evaluator

Acti

vit

y

Man

ag

er

Social Content Grouper

Social Content

Graph

Data Manager

Query / Result

OpenSocial API

Activities

Y! IM

OpenSocial API

Co

nte

nt

Ma

na

ge

me

nt

Information Discovery

Important Information Flow(1) Raw/derived social nodes and links(2) Social content sub-graph(3) Relevant social content sub-graph(4) Relevant social nodes and links

Information Presentation Social Content Ranker

Social Content Admin

User

User

Inte

rface

37 | M3SN, Shanghai, China, 2009Sihem Amer-Yahia and Cong Yu

Yahoo! Research New York

A Graph Based Logical Algebra Framework

• Designed for information discovery on social content sites

• Aim to provide a declarative way of specifying analysis and query tasks– Uniformity and flexibility

– Opportunities for performance optimization

• Basic operators:– Node Selection (σN), Link Selection (σL)

– Composition, Semi-Join

– Node Aggregation, Link Aggregation

– Details in [CIDR09]

38 | M3SN, Shanghai, China, 2009Sihem Amer-Yahia and Cong Yu

Yahoo! Research New York

A Simple Search Task

John’s friends

People visited Denver

John’s friendswho visited Denver

Their activities

39 | M3SN, Shanghai, China, 2009Sihem Amer-Yahia and Cong Yu

Yahoo! Research New York

Jelly:A Language Over Social Content Sites

• Designed with a focus on community-centric information exploration applications– Most useful applications

• A restricted implementation of the SocialScope algebra– Based on nested relation model, instead of full graph

model

• Built-in primitives for topic and community generation– Topic Generation, Community Extraction

– Recommendation Generation

– Group Generation, Explanation Generation

40 | M3SN, Shanghai, China, 2009Sihem Amer-Yahia and Cong Yu

Yahoo! Research New York

A Simple (Incomplete) Example• Topic Generation and Community Extraction

• Recommendation

Generation

• Information

Presentation

generate topics for item into topicsfrom tagging Rusing LDA (seed, th=0.8)seed R.item group R.tag weight-with count()

generate communities into expertsfrom topics T, tagging Rwhere T.*.item = R.itemusing jaccard-similarity (seed, th=0.7)seed (R.user, T.topic) list R.item

generate recommendations into candidatesgiven user u, query qfrom experts Twhere Selected (T.topic, u, q)using count-users (seed)seed T.*.item list T.user

generate explanations into resultsfrom candidates C, tagging Twhere C.item = T.itemusing identity (seed)seed C.item list T.user weight-with count()

41 | M3SN, Shanghai, China, 2009Sihem Amer-Yahia and Cong Yu

Yahoo! Research New York

Talk Outline

• Emerging Trend of Social Content Sites (SCS)

• Information Exploration on SCS– Community Centric Model

– Information Discovery: Improving Search Performance through Clustering

– Information Presentation: Diversification via Explanation

• SocialScope and Jelly

• Further Challenges

• Conclusion

42 | M3SN, Shanghai, China, 2009Sihem Amer-Yahia and Cong Yu

Yahoo! Research New York

Scalability Challenges

• Social content graphs are large– 10’s of millions of users

– millions (movies, destinations) or billions (web links) of items

– Challenge 1: being able to analyze the graph efficiently• We need to design Map-Reduce frameworks that are suitable for those analyses

• Activities are being generated at a very high rate– millions of status updates per day (Facebook, twitter)

– Similar amount of content activities (browsing, tagging, etc.)

– Challenge 2: being able to incrementally incorporate new activities into the existing model

43 | M3SN, Shanghai, China, 2009Sihem Amer-Yahia and Cong Yu

Yahoo! Research New York

Semantic Challenges and Questions

• Connections are multi-dimensional– Two main categories on the surface: explicit

(friendship) vs. implicit (shared activities)

– However: not all friendship links are the same and not all activities are the same

– Challenge: for a given user on a given topic, identify the appropriate set of explicit and implicit links• expert versus interest based communities is one step toward that goal

• Temporal, geographical, and other external influences– How does a community evolve at time goes on and as

members moves around?

• Social graph interactions– How does one social graph (say Facebook) impact another

(say MySpace)?

44 | M3SN, Shanghai, China, 2009Sihem Amer-Yahia and Cong Yu

Yahoo! Research New York

Conclusion

• The emerging trend of social content sites present many challenges– Social information integration

– Information discovery

– Information presentation

• The notion of communities and topics are crucial for effective and efficient information exploration on those sites– Community discovery and analysis

– Community-based information exploration

• Lots more!

• At Yahoo!, we are focusing on building systems that can address those challenges

45 | M3SN, Shanghai, China, 2009Sihem Amer-Yahia and Cong Yu

Yahoo! Research New York

Questions?