67
Challenges and Opportunities in Building Personalized Online Content Aggregators Ka Cheung Sia Adviser: Prof. Junghoo Cho Oral Defense January 12 2009

Challenges and Opportunities in Building Personalized Online Content Aggregators Ka Cheung Sia Adviser: Prof. Junghoo Cho Oral Defense January 12 2009

  • View
    219

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Challenges and Opportunities in Building Personalized Online Content Aggregators Ka Cheung Sia Adviser: Prof. Junghoo Cho Oral Defense January 12 2009

Challenges and Opportunities in Building Personalized

Online Content Aggregators

Ka Cheung SiaAdviser: Prof. Junghoo Cho

Oral DefenseJanuary 12 2009

Page 2: Challenges and Opportunities in Building Personalized Online Content Aggregators Ka Cheung Sia Adviser: Prof. Junghoo Cho Oral Defense January 12 2009

Challenges and Opportunities in Building Personalized Online Content Aggregator

2

Outline

Emergence of Web 2.0 Online content aggregators Challenges and opportunities

RSS monitoring Personalized recommendations Social annotations

Conclusion

Page 3: Challenges and Opportunities in Building Personalized Online Content Aggregators Ka Cheung Sia Adviser: Prof. Junghoo Cho Oral Defense January 12 2009

Challenges and Opportunities in Building Personalized Online Content Aggregator

3

Web 1.0

A few professional content creators News Corporate sites Portal

One way consumption of information

Page 4: Challenges and Opportunities in Building Personalized Online Content Aggregators Ka Cheung Sia Adviser: Prof. Junghoo Cho Oral Defense January 12 2009

Challenges and Opportunities in Building Personalized Online Content Aggregator

4

Web 2.0

Facilitators of content sharing Wikipedia Blog Media file sharing Discussion group

Everyone can publish content easily

Handheld devices and innovation online applications

Being Web 2.0 publishers

Page 5: Challenges and Opportunities in Building Personalized Online Content Aggregators Ka Cheung Sia Adviser: Prof. Junghoo Cho Oral Defense January 12 2009

Challenges and Opportunities in Building Personalized Online Content Aggregator

5

Growth of UGC / blogs

In 2007 study Professional content : 2GB / day UGC : 8-10GB / day

Bloglines.com 26% users with >30 subscriptions

2006 person of the year - TIME

Page 6: Challenges and Opportunities in Building Personalized Online Content Aggregators Ka Cheung Sia Adviser: Prof. Junghoo Cho Oral Defense January 12 2009

Challenges and Opportunities in Building Personalized Online Content Aggregator

6

RSS

Really Simple Syndication XML Contains 10-15 latest posts

Machine readable Datetime of publications Title / content Permalink

Subscription RSS reader Personalized homepage

Page 7: Challenges and Opportunities in Building Personalized Online Content Aggregators Ka Cheung Sia Adviser: Prof. Junghoo Cho Oral Defense January 12 2009

Challenges and Opportunities in Building Personalized Online Content Aggregator

7

How RSS helps readers?

Without RSS(visit different URLs)

With RSS(centralized access)

Page 8: Challenges and Opportunities in Building Personalized Online Content Aggregators Ka Cheung Sia Adviser: Prof. Junghoo Cho Oral Defense January 12 2009

Challenges and Opportunities in Building Personalized Online Content Aggregator

8

RSS usage

High usage but low awareness 27% consume 4% aware

Common usage News feeds Podcasting My MSN / My Yahoo! / etc. Google reader / bloglines Indexing blogs

Time-sensitive content

“RSS – Crossing into the Mainstream” Yahoo white paper by Joshua Grossnickle Oct 2005

Page 9: Challenges and Opportunities in Building Personalized Online Content Aggregators Ka Cheung Sia Adviser: Prof. Junghoo Cho Oral Defense January 12 2009

Challenges and Opportunities in Building Personalized Online Content Aggregator

9

Online content aggregator

Centralized access to subscribed content in executive summary style

Leverage collaborative filtering Ubiquitous access Collect useful social annotation data

Page 10: Challenges and Opportunities in Building Personalized Online Content Aggregators Ka Cheung Sia Adviser: Prof. Junghoo Cho Oral Defense January 12 2009

Challenges and Opportunities in Building Personalized Online Content Aggregator

10

Online content aggregator (Google reader example)

Subscription listNewly updated articles

(Chapter 2 & 4)

Recommendations(Chapter 3)

Social annotations(Chapter 5)

Page 11: Challenges and Opportunities in Building Personalized Online Content Aggregators Ka Cheung Sia Adviser: Prof. Junghoo Cho Oral Defense January 12 2009

Challenges and Opportunities in Building Personalized Online Content Aggregator

11

Challenges and opportunities How to deliver up-to-date content?

New articles update quickly with recurring patterns Significance of articles deteriorates quickly over time

How to provide better personalization? Ranking articles/topics based on user interest Efficient computation to handle large number of users

What is the knowledge in Web 2.0 data? Improve Web resources categorization Vocabulary usage

Page 12: Challenges and Opportunities in Building Personalized Online Content Aggregators Ka Cheung Sia Adviser: Prof. Junghoo Cho Oral Defense January 12 2009

Challenges and Opportunities in Building Personalized Online Content Aggregator

12

Outline

Emergence of Web 2.0 Online content aggregator Challenges and opportunities

RSS monitoring How to deliver “fresh” content

Providing better personalization Web 2.0 knowledge mining

Conclusion

Page 13: Challenges and Opportunities in Building Personalized Online Content Aggregators Ka Cheung Sia Adviser: Prof. Junghoo Cho Oral Defense January 12 2009

Challenges and Opportunities in Building Personalized Online Content Aggregator

13

The retrieval problem

Research problem in proxies, search engines, … Source cooperativeness [DKP01, OW02] Priority of different content [CG03a] Resource constraints User satisfaction [PO05, WSY02] Politeness issues, …

Data source aggregator user

retrieval deliver

Page 14: Challenges and Opportunities in Building Personalized Online Content Aggregators Ka Cheung Sia Adviser: Prof. Junghoo Cho Oral Defense January 12 2009

Challenges and Opportunities in Building Personalized Online Content Aggregator

14

Metrics

Evaluation at time u1

Freshness: 0 Age:

Delay: Miss-penalty: 2

Push vs. Pull Push: All updates are known (e.g. RSS ping services) Pull: Future updates are estimated

)()()( 312111 ttt

)( 41 tu

Page 15: Challenges and Opportunities in Building Personalized Online Content Aggregators Ka Cheung Sia Adviser: Prof. Junghoo Cho Oral Defense January 12 2009

Challenges and Opportunities in Building Personalized Online Content Aggregator

15

Refined model

Commonly used Webpage change model Homogeneous Poisson model

λ(t) = λ at any t

RSS content update more frequently with recurring pattern Periodic inhomogeneous Poisson model

λ(t) = λ(t-nT), n=1,2,… , T is the period

user data source

Page 16: Challenges and Opportunities in Building Personalized Online Content Aggregators Ka Cheung Sia Adviser: Prof. Junghoo Cho Oral Defense January 12 2009

Challenges and Opportunities in Building Personalized Online Content Aggregator

16

Optimization problem

Resource allocation How often to contact a data source? O1 is more active and has more subscribers than O2, how

much often should we contact O1?

Retrieval scheduling When to contact a data source? Given 2 retrievals allocated for O1, when to retrieve from it?

Both in the morning, or one in the morning, one at night?

i i im w

Page 17: Challenges and Opportunities in Building Personalized Online Content Aggregators Ka Cheung Sia Adviser: Prof. Junghoo Cho Oral Defense January 12 2009

Challenges and Opportunities in Building Personalized Online Content Aggregator

17

Retrieval schedule intuition

t=1No postings missed

t=0 or 2All postings (in the same period) missed

Page 18: Challenges and Opportunities in Building Personalized Online Content Aggregators Ka Cheung Sia Adviser: Prof. Junghoo Cho Oral Defense January 12 2009

Challenges and Opportunities in Building Personalized Online Content Aggregator

18

Necessary optimal condition

Given λ(t) and u(t), schedule τj’s that minimizes delay / miss

Delay: Schedule right after large number of new postsMiss-penalty: Schedule right before lot’s of user access

Page 19: Challenges and Opportunities in Building Personalized Online Content Aggregators Ka Cheung Sia Adviser: Prof. Junghoo Cho Oral Defense January 12 2009

Challenges and Opportunities in Building Personalized Online Content Aggregator

19

Performance

Reduce miss by 33% compared to CGM03 for 1 retrieval per day Reduce miss further by 20% when consider user access pattern

Page 20: Challenges and Opportunities in Building Personalized Online Content Aggregators Ka Cheung Sia Adviser: Prof. Junghoo Cho Oral Defense January 12 2009

Challenges and Opportunities in Building Personalized Online Content Aggregator

20

Summary

Better RSS content update model Significantly improve “content freshness” under

same resource constraint Analysis of typical posting patterns and access

patterns

“Efficient Monitoring Algorithm for Fast News Alert”, with Junghoo Cho and Hyun-Kyu Cho, in IEEE TKDE 2007

“Monitoring RSS Feeds based on User Browsing Pattern”, with Junghoo Cho, Koji Hino, Yun Chi, Shenghuo Zhu and Belle L. Tseng in ICWSM 2007

Page 21: Challenges and Opportunities in Building Personalized Online Content Aggregators Ka Cheung Sia Adviser: Prof. Junghoo Cho Oral Defense January 12 2009

Challenges and Opportunities in Building Personalized Online Content Aggregator

21

Outline

Emergence of Web 2.0 Online content aggregator Challenges and opportunities

RSS monitoring Providing better personalization

Ranking articles/topics based on user interest Efficient computation to support large number

of users Social annotations

Conclusion

Page 22: Challenges and Opportunities in Building Personalized Online Content Aggregators Ka Cheung Sia Adviser: Prof. Junghoo Cho Oral Defense January 12 2009

Challenges and Opportunities in Building Personalized Online Content Aggregator

22

Learning user profile

Users are reluctant to indicate their interest

Cold-start problem Diversified

recommendations [ZMK05] Drift of user interest

[WBP01] Relevance feedback [Eft00,

KDF05]

Goal: Improve relevance of recommendations click utility

recommendationsfeedback

Learningprocess

Page 23: Challenges and Opportunities in Building Personalized Online Content Aggregators Ka Cheung Sia Adviser: Prof. Junghoo Cho Oral Defense January 12 2009

Challenges and Opportunities in Building Personalized Online Content Aggregator

23

Ranking model 1

Assumptions K predefined topics Every recommendation item belongs to one topic

User profile: Θi – Pr (click | read, topic i) Θi is estimated by α/(α+β) drawing from a beta

distribution with parameters α, β

Topic 1 2 3 4 5 6

α 2 0 2 5 3 2

β 10 0 1 0 3 1

Page 24: Challenges and Opportunities in Building Personalized Online Content Aggregators Ka Cheung Sia Adviser: Prof. Junghoo Cho Oral Defense January 12 2009

Challenges and Opportunities in Building Personalized Online Content Aggregator

24

Ranking model 2

Ranking bias: g(j) – Pr (read | j) Read probability decreases with rank Borrow from web search studies

Utility function: U(R; Θ)

R: ranking of topics Articles belong to the same topics are chosen randomly

Page 25: Challenges and Opportunities in Building Personalized Online Content Aggregators Ka Cheung Sia Adviser: Prof. Junghoo Cho Oral Defense January 12 2009

Challenges and Opportunities in Building Personalized Online Content Aggregator

25

Ranking topics

Updating posteriori distribution after each iteration Not clicked: βnew=βold + g(ri)

Clicked: αnew=αold + 1

Ranking function of topics Exploitation + λ*exploration

Mean + λ*variance

2( ) ( 1)

Example (λ=1)α=2, β=2Ranking 0.55

α=5, β=5Ranking 0.52

Page 26: Challenges and Opportunities in Building Personalized Online Content Aggregators Ka Cheung Sia Adviser: Prof. Junghoo Cho Oral Defense January 12 2009

Challenges and Opportunities in Building Personalized Online Content Aggregator

26

Simulation

Click utility improve in long run

Adapts to drift of interest

More accurate estimation of user interest Θ

Page 27: Challenges and Opportunities in Building Personalized Online Content Aggregators Ka Cheung Sia Adviser: Prof. Junghoo Cho Oral Defense January 12 2009

Challenges and Opportunities in Building Personalized Online Content Aggregator

27

User studies

10 users from UCLA and NEC 45 categories from dmoz.org

Arts/Archecture Computers/E-books Science/Biology …

Survey of user interest before experiment

7 articles (Webpages) per iteration

3 strategies interleaved

First 25 iterations

Drifted at 25th iteration

Page 28: Challenges and Opportunities in Building Personalized Online Content Aggregators Ka Cheung Sia Adviser: Prof. Junghoo Cho Oral Defense January 12 2009

Challenges and Opportunities in Building Personalized Online Content Aggregator

28

Summary

Learning framework Exploitation: recommend user interested items Exploration: explore user’s other potential interest

Proven to improve click utility and adapt to drift of user interest

“Capturing User Interest by Both Exploitation and Exploration”, with Shenghuo Zhu, Yun Chi, Koji Hino, and Belle L. Tseng, in UM 2007

Page 29: Challenges and Opportunities in Building Personalized Online Content Aggregators Ka Cheung Sia Adviser: Prof. Junghoo Cho Oral Defense January 12 2009

Challenges and Opportunities in Building Personalized Online Content Aggregator

29

Outline

Emergence of Web 2.0 Online content aggregator Challenges and opportunities

RSS monitoring Providing better personalization

Ranking articles/topics based on user interest Efficient computation to support large number

of users Social annotations

Conclusion

Page 30: Challenges and Opportunities in Building Personalized Online Content Aggregators Ka Cheung Sia Adviser: Prof. Junghoo Cho Oral Defense January 12 2009

Challenges and Opportunities in Building Personalized Online Content Aggregator

30

Aggregation as recommendation

User-generated content in Blogosphere and Web 2.0 services contain rich information of recent events

Aggregation of individual opinions often shows interesting popular topics

Page 31: Challenges and Opportunities in Building Personalized Online Content Aggregators Ka Cheung Sia Adviser: Prof. Junghoo Cho Oral Defense January 12 2009

Challenges and Opportunities in Building Personalized Online Content Aggregator

31

Personal recommendation

Dark KnightOlympics

Michael Phelps WALL-E

Las Vegas

RSS sources

Items(phrases)

Dark Knight is great, more entertaining

than watching Olympics and shows in Las

Vegas!

Um.. it will be good if there is a free show of Dark Knight and WALL-E

Michael Phelps performance in

Olympics is awesome...

Finished watching

Michael Phelps in Olympics, let me watch the

WALL-E DVD...

Page 32: Challenges and Opportunities in Building Personalized Online Content Aggregators Ka Cheung Sia Adviser: Prof. Junghoo Cho Oral Defense January 12 2009

Challenges and Opportunities in Building Personalized Online Content Aggregator

32

Matrix formulation

Reference Matrix (E) – the number of times a blogger mention a phrase/link in his blog post

Subscription matrix (T) – how often a user reads a blog Personalized score (TE)

321b4

475Total

101b3

030b2

023b1

o3o2o1E

0.50.500u3

0.60.60.20.2u2

000.80.8u1

b4b3b2b1T

21.01.0u3

2.42.21.8u2

0.04.02.4u1

o3o2o1TE

Page 33: Challenges and Opportunities in Building Personalized Online Content Aggregators Ka Cheung Sia Adviser: Prof. Junghoo Cho Oral Defense January 12 2009

Challenges and Opportunities in Building Personalized Online Content Aggregator

33

Database operation of matrix

Reference (rss-id, item, score) <b1, o1, 3>

<b1, o2, 2><b2, o2, 3>…

Grows over time Subscription (user-id, rss-

id, score) <u1, b1, 0.8>

<u1, b2, 0.8><u2, b1, 0.2>…

Relatively stable

0.50.500u3

0.60.60.20.2u2

000.80.8u1

b4b3b2b1T

E o1 o2 o3

b1 3 2 0

b2 0 3 0

b3 1 0 1

b4 1 2 3

Page 34: Challenges and Opportunities in Building Personalized Online Content Aggregators Ka Cheung Sia Adviser: Prof. Junghoo Cho Oral Defense January 12 2009

Challenges and Opportunities in Building Personalized Online Content Aggregator

34

Baselines

Aggregate QuerySELECT t.item, sum(t.score*e.score) As p_scoreFROM Endorsement e, Trust tWHERE e.blog-id = t.blog-id ANDt.user-id = <user id>GROUP BY t.itemsORDER BY p_score DESC LIMIT 20

On-the-fly (OTF) View

Page 35: Challenges and Opportunities in Building Personalized Online Content Aggregators Ka Cheung Sia Adviser: Prof. Junghoo Cho Oral Defense January 12 2009

Challenges and Opportunities in Building Personalized Online Content Aggregator

35

Two stage computation

Support large number of users and rss sources OTF – high query cost VIEW – high update cost

Identify “template” users Users often share similar

reading interest Example: template users

interested in sports / politics / technologies / …

Result are pre-computed and then combined in two stages

Page 36: Challenges and Opportunities in Building Personalized Online Content Aggregators Ka Cheung Sia Adviser: Prof. Junghoo Cho Oral Defense January 12 2009

Challenges and Opportunities in Building Personalized Online Content Aggregator

36

Discover user groups by NMF

Decompose subscription matrix T into sub-matrices W and H Non-negative matrix factorization (NMF) [Hoy04] W : [individual users : template users] relationship H : [template users : blogs] relationship

Example: user 2’s subscription vector is expressed as linear combination of two template users

NMF as an approximation of original subscription matrix Accurate Sparse

Page 37: Challenges and Opportunities in Building Personalized Online Content Aggregators Ka Cheung Sia Adviser: Prof. Junghoo Cho Oral Defense January 12 2009

Challenges and Opportunities in Building Personalized Online Content Aggregator

37

Reconstruction of results Personalized scores of template users are pre-computed (HE) is maintained as sorted lists for template users

W*(HE) becomes the personalized scores of all users Computed using Threshold Algorithm [FLN01]

Top-K list (HE) are sorted lists W*(HE) is weighted linear combination

Page 38: Challenges and Opportunities in Building Personalized Online Content Aggregators Ka Cheung Sia Adviser: Prof. Junghoo Cho Oral Defense January 12 2009

Challenges and Opportunities in Building Personalized Online Content Aggregator

38

Experiments

Bloglines.com: online RSS reader Subscription matrix T: (0 or 1) subscription profile

91k users 487k feeds

Reference matrix E: blog-keyword occurrence Feed content collected between Nov 2006 – Jul 2007 Top 20 nouns with highest tf-idf in each posts are selected as

keywords Platform

Python implementation of proposed method MySQL server on linux with data reside in RAID

Page 39: Challenges and Opportunities in Building Personalized Online Content Aggregators Ka Cheung Sia Adviser: Prof. Junghoo Cho Oral Defense January 12 2009

Challenges and Opportunities in Building Personalized Online Content Aggregator

39

The difference by personalization

Week 2007 Jan 7 – Jan 13 Major event: iphone released 3 users with large number of

subscriptions

Distinct difference between top-20 recommended words Among users – 1.13 Between users and global – 1.12

irangooglequarterphonesaddamcathartikpricesbusinesstroopsvideocompaniessoftwaredeptkibbutzappledevelopmentavenueargentinabushmanagementviewsvegasiraqiraqpresidentsearchchicagomanagerbushreutersiphoneappleiraqiguazubeefiphoneyorkerbrazilcattlesalesUser 91017User 90550User 90439Global

2007-01-07 to 2007-01-13

Page 40: Challenges and Opportunities in Building Personalized Online Content Aggregators Ka Cheung Sia Adviser: Prof. Junghoo Cho Oral Defense January 12 2009

Challenges and Opportunities in Building Personalized Online Content Aggregator

40

Efficiency of proposed method Update cost

OTF (222K) < NMF (3.2M) < VIEW (23.6M) Query response time

Average over 1000 users with highest number of subscription OTF : execute SQL query directly on MySQL server NMF: python implementation that interfaces with MySQL server

Average query response time reduced by 75%, eliminated outliers of significant delay

70% approximation

0.007s2.84s0.53s0.46sNMF

0.037s84.42s3.60s2.05sOTF

minmaxstdavgMethod

Page 41: Challenges and Opportunities in Building Personalized Online Content Aggregators Ka Cheung Sia Adviser: Prof. Junghoo Cho Oral Defense January 12 2009

Challenges and Opportunities in Building Personalized Online Content Aggregator

41

Summary

Provide personalized recommendation by selective aggregation

Proposed matrix model for personalized aggregation Optimization by NMF & Threshold Algorithm

Real life dataset study shows query response time can be reduced significantly with acceptable approximation accuracy

“Efficient Computation of Personal Aggregation Queries on Blogs”, with Junghoo Cho, Yun Chi, and Belle L. Tseng, in SIGKDD 2008

Page 42: Challenges and Opportunities in Building Personalized Online Content Aggregators Ka Cheung Sia Adviser: Prof. Junghoo Cho Oral Defense January 12 2009

Challenges and Opportunities in Building Personalized Online Content Aggregator

42

Outline

Emergence of Web 2.0 Online content aggregator Challenges and opportunities

RSS monitoring Providing better personalization Social annotations

Vocabulary usage effective advertising keyword selection

Conclusion

Page 43: Challenges and Opportunities in Building Personalized Online Content Aggregators Ka Cheung Sia Adviser: Prof. Junghoo Cho Oral Defense January 12 2009

Challenges and Opportunities in Building Personalized Online Content Aggregator

43

Social annotations

Bookmark tags, video/picture annotations, article tags Evolving vocabularies (itouch, wow, w00t, …) Emoticons (>_<, Orz, …)

Intensive human effort Latent Dirichlet Allocation [BNJ03]

Recover hidden topics z’s Represent words p(z|w) and documents

p(z|d) as distribution over hidden topics

Improving information retrieval Web document retrieval [WZY06, ZBZ08] Social tagging usability [CM07]

users(u)

tags(w)

documents(d)

Page 44: Challenges and Opportunities in Building Personalized Online Content Aggregators Ka Cheung Sia Adviser: Prof. Junghoo Cho Oral Defense January 12 2009

Challenges and Opportunities in Building Personalized Online Content Aggregator

44

Topic categorization

Page 45: Challenges and Opportunities in Building Personalized Online Content Aggregators Ka Cheung Sia Adviser: Prof. Junghoo Cho Oral Defense January 12 2009

Challenges and Opportunities in Building Personalized Online Content Aggregator

45

Desired properties of effective advertising keywords

Specific Reach target audience e.g. automobiles > ford, good > programming

Emerging Developing vs. stable Easier to attract user attention

Time-(in)sensitive Context change over time Watch for change in target audience

How can these properties be learned from social annotations collected in aggregators?

Page 46: Challenges and Opportunities in Building Personalized Online Content Aggregators Ka Cheung Sia Adviser: Prof. Junghoo Cho Oral Defense January 12 2009

Challenges and Opportunities in Building Personalized Online Content Aggregator

46

Emerging

Words correspond to emerging topics Users actively explore new pages and annotate evenly on different

pages Examples (between December 2007 and March 2008):

rails2.0 (ruby on rails webapp framework) kindle (amazon ebook) itouch (unofficial nickname of ipod touch) eeepc (subnotebook by Asus) obama (Barrack Obama) jailbreak (Apple iphone crack software)

Change of entropy

emerging stable

Page 47: Challenges and Opportunities in Building Personalized Online Content Aggregators Ka Cheung Sia Adviser: Prof. Junghoo Cho Oral Defense January 12 2009

Challenges and Opportunities in Building Personalized Online Content Aggregator

47

Effective advertising keyword classifier

10+ features extracted from social annotations for each word User study performed on Amazon Mechanical Turk

10-fold cross-validation on different classifiers SVM 70.3% Logistic regression 69.8% C4.5 73.3% Random forest 73.3% K-nn 67.3% Back-propagation neural nets 63.9% Naïve Bayes 59.9% Best-5 combined 73.8%

Page 48: Challenges and Opportunities in Building Personalized Online Content Aggregators Ka Cheung Sia Adviser: Prof. Junghoo Cho Oral Defense January 12 2009

Challenges and Opportunities in Building Personalized Online Content Aggregator

48

Summary

Leverage social annotations collected from online content aggregator users

Social annotation differ significantly from general text corpora New metrics / features Usage in online advertising

“Exploring Social Annotations for Effective Advertising Keyword Selection”, with Junghoo Cho, work in progress

Page 49: Challenges and Opportunities in Building Personalized Online Content Aggregators Ka Cheung Sia Adviser: Prof. Junghoo Cho Oral Defense January 12 2009

Challenges and Opportunities in Building Personalized Online Content Aggregator

49

Conclusion

Web 2.0 phenomenon More content sharing and diverse interest

Personalized online content aggregator Easier access to different information sources Deliver update content Deliver better personalized recommendations Leverage human effort collected in the aggregator

Page 50: Challenges and Opportunities in Building Personalized Online Content Aggregators Ka Cheung Sia Adviser: Prof. Junghoo Cho Oral Defense January 12 2009

Challenges and Opportunities in Building Personalized Online Content Aggregator

50

References [CG03a] Junghoo Cho and Hector Garcia-Molina. “Effective Page Referesh Policies

for Web Crawlers.” ACM TODS 28(4), 2003 [DKP01] Paven Deolasee, Amol Katkar, Ankur Panchbudhe, Krithi Ramamritham,

and Prashant Shenoy. “Adaptive Push-Pull: Disseminating Dynamic Web Data” WWW 2001

[OW02] Chris Olston and Jennifer Widom. “Best-Effort Cache Synchronization with Source cooperation” SIGMOD 2002

[PO05] Sandeep Pandy and Christopher Olston. “User-Centric Web Crawling” WWW 2005

[WSY02] J.L. Wolf, M.S. Squillante, P.S. Yu, J. Sethuraman, and L. Ozsen. “Optimal Crawling Strategies for Web Search Engines.” WWW 2002

[FLN01] Ronald Fagin, Amnon Lotem, and Moni Naor. “Optimal Aggregation Algorithms for Middleware.” PODS 2001

[Hoy04] Patrik Hoyer “Non-negative Matrix Factorization with Sparseness Constraints” Journal of Machine Learning Research, 5:1457-1469, 2004

[LWL07] Chengkai Li, Ming Wang, Lipyeow Lim, Haixun Wang, and Kevin Chen-Chuan Chang. “Supporting Ranking and Clustering as Generalized Order-By and Group-By” SIGMOD 2007

[PP07] Seung-Taek Park and David Pennock. “Applying Collaborative Filtering Techniques to Movie Search for Better Ranking and Browsing” SIGKDD 2007

Page 51: Challenges and Opportunities in Building Personalized Online Content Aggregators Ka Cheung Sia Adviser: Prof. Junghoo Cho Oral Defense January 12 2009

Challenges and Opportunities in Building Personalized Online Content Aggregator

51

References [Eft00] E.N. Efthimiadis “Interactive Query Expansion: A User-based Evaluation in a

Relevance Feedback Environment” JASIS 51(11), 2000 [KDF05] Diane Kelly, Vijay Deepak Dollu, and Xin Fu. “The Loquacious User: A

Document-Independent Source of Term for Query Expansion” SIGIR 2005 [WPB01] Geoffrey I. Webb, Michael J. Pazzani, and Daniel Billsus. “Machine

Learning for User Modeling” User Modeling and User-Adapted Interaction, 11(1-2)19-29, 2001

[ZMK05] Cai-Nicolas Ziegler, Sean M. McNee, Joseph A. Konstan, and Georg Lausen. “Improving Recommendation Lists Through Topic Diversification” WWW 2005

[BNJ03] David M. Blei, Andrew Y. Ng, and Michael I. Jordan. “Latent Dirichlet Allocation” Journal of Machine Learning Research, 3:993-1022, 2003

[CM07] Ed H. Chi and Todd Mytkowicz. “Understanding Navigability of Social Tagging Systems” CHI 2007

[WZY06] Xian Wu, Lei Zhang, and Yong Yu. “Exploring Social Annotation for the Semantic Web” WWW 2006

[ZBZ08] Ding Zhou, Jiang Bian, Shuyi Zheng, Hongyuan Zha, and C. Lee Giles. “Exploring Social Annotations for Information Retrieal” WWW 2008

Page 52: Challenges and Opportunities in Building Personalized Online Content Aggregators Ka Cheung Sia Adviser: Prof. Junghoo Cho Oral Defense January 12 2009

Thank you

Q & A

Page 53: Challenges and Opportunities in Building Personalized Online Content Aggregators Ka Cheung Sia Adviser: Prof. Junghoo Cho Oral Defense January 12 2009

Challenges and Opportunities in Building Personalized Online Content Aggregator

53

Additional slides RSS monitoring

Different data posting patterns Optimal size of estimation window Consistency of posting rate

Providing personalized recommendations Partition of trust matrix Threshold algorithm NMF approximation accuracy Approximation accuracy Multi-armed bandit problem

Social annotations Preferential attachment / usage URL in photography category Distribution of entropy change Performance of different classifiers

Page 54: Challenges and Opportunities in Building Personalized Online Content Aggregators Ka Cheung Sia Adviser: Prof. Junghoo Cho Oral Defense January 12 2009

Challenges and Opportunities in Building Personalized Online Content Aggregator

54

Different data posting patterns

Page 55: Challenges and Opportunities in Building Personalized Online Content Aggregators Ka Cheung Sia Adviser: Prof. Junghoo Cho Oral Defense January 12 2009

Challenges and Opportunities in Building Personalized Online Content Aggregator

55

Optimal size of estimation window

Resource constraint: 4 retrievals per day per feeds on average

2 weeks seems an appropriate choice

Page 56: Challenges and Opportunities in Building Personalized Online Content Aggregators Ka Cheung Sia Adviser: Prof. Junghoo Cho Oral Defense January 12 2009

Challenges and Opportunities in Building Personalized Online Content Aggregator

56

Consistency of posting rate

90% of the RSS feeds post consistently

Page 57: Challenges and Opportunities in Building Personalized Online Content Aggregators Ka Cheung Sia Adviser: Prof. Junghoo Cho Oral Defense January 12 2009

Challenges and Opportunities in Building Personalized Online Content Aggregator

57

Partition of subscription matrix

Decomposition is useful when matrix is dense Real life data is often skewed Hybrid method: uses NMF only in its effective region

Users with more subscription

Blogs withmore subscribers

Users with >30 subscriptionsFeeds with >30 subscribers

10k feeds, 24k users~1M subscription pairs

2.7M subscription pairs

1. OTF

2. VIEW

3. NMF

Page 58: Challenges and Opportunities in Building Personalized Online Content Aggregators Ka Cheung Sia Adviser: Prof. Junghoo Cho Oral Defense January 12 2009

Challenges and Opportunities in Building Personalized Online Content Aggregator

58

Threshold algorithm

Proposed by Fagin et.al. [FLN01]Efficient computation of top-K items from multiple lists with a monotone aggregate function

users

blogs

Template user’srecommendations

update

query

Page 59: Challenges and Opportunities in Building Personalized Online Content Aggregators Ka Cheung Sia Adviser: Prof. Junghoo Cho Oral Defense January 12 2009

Challenges and Opportunities in Building Personalized Online Content Aggregator

59

NMF approximation accuracy Dense region of subscription

matrix >30 subscribers: 10152 feeds >30 subscriptions: 24340 users

L2 norm comparison

Sparsity of W (23%), H (13%) NMF approximation is close to

SVD and sparse

833.0823.2120

837.9829.0110

844.6835.1100

850.1841.690

856.9848.580

NMFSVDRank

Page 60: Challenges and Opportunities in Building Personalized Online Content Aggregators Ka Cheung Sia Adviser: Prof. Junghoo Cho Oral Defense January 12 2009

Challenges and Opportunities in Building Personalized Online Content Aggregator

60

Approximation accuracy

How many items are approximated by NMF in the top 20 list? Ti – top 20 items of user i computed by OTF

Ai – top 20 items of user i computed by NMF

70% approximation and more accurate for higher rank items

Correlation with rank| | / | |i i iA T T

Page 61: Challenges and Opportunities in Building Personalized Online Content Aggregators Ka Cheung Sia Adviser: Prof. Junghoo Cho Oral Defense January 12 2009

Challenges and Opportunities in Building Personalized Online Content Aggregator

61

Multi-armed bandit problem

Well-studied problem in reinforcement learning / statistics

Problem statement Background: You are given n different choices Decision: For each choice you receive a numerical reward

chosen from an unknown stationary probability distribution Goal: maximize the total reward over some time period

Solutions Action-value methods (greedy & ε-greedy) Softmax Action Selection (decaying) Pursuit methods Associative search

Page 62: Challenges and Opportunities in Building Personalized Online Content Aggregators Ka Cheung Sia Adviser: Prof. Junghoo Cho Oral Defense January 12 2009

Challenges and Opportunities in Building Personalized Online Content Aggregator

62

Preferential attachment/usage

URL / Tag usage distribution

Page 63: Challenges and Opportunities in Building Personalized Online Content Aggregators Ka Cheung Sia Adviser: Prof. Junghoo Cho Oral Defense January 12 2009

Challenges and Opportunities in Building Personalized Online Content Aggregator

63

URL in photography category

Documents ranked by p(d|z) values

Page 64: Challenges and Opportunities in Building Personalized Online Content Aggregators Ka Cheung Sia Adviser: Prof. Junghoo Cho Oral Defense January 12 2009

Challenges and Opportunities in Building Personalized Online Content Aggregator

64

Specific

Stop word list Inverse document frequency (idf) Ontology based

Entropy

Least specific tags found idf – [web, reference, software, design, …] Entropy – [temp, for, important, good, …]

T

iii wzpwzpwH

1

))|(log()|()(

Page 65: Challenges and Opportunities in Building Personalized Online Content Aggregators Ka Cheung Sia Adviser: Prof. Junghoo Cho Oral Defense January 12 2009

Challenges and Opportunities in Building Personalized Online Content Aggregator

65

Time-sensitivity The usage / associated context changes over time

“holiday” Travel packages: [travel, eclipse, europe, guide, …] Christmas shopping: [christmas, gift, shopping, …]

“programmers” Programming: [programming, development, code, patterns, …] Job hunting: [work, jobs, career, job, …]

KL-divergence of two distributions Jaccard coefficient of two sets of tagged URL

Page 66: Challenges and Opportunities in Building Personalized Online Content Aggregators Ka Cheung Sia Adviser: Prof. Junghoo Cho Oral Defense January 12 2009

Challenges and Opportunities in Building Personalized Online Content Aggregator

66

Distribution of entropy change

Entropy increase over time (+0.1 over 3 months)

Page 67: Challenges and Opportunities in Building Personalized Online Content Aggregators Ka Cheung Sia Adviser: Prof. Junghoo Cho Oral Defense January 12 2009

Challenges and Opportunities in Building Personalized Online Content Aggregator

67

Performance of different classifiers

10-fold cross-validation

Classifier Specific Emerging Stable

SVM 65.7% 70.3% 63.9%

Logistic regression 66.3% 69.8% 60.7%

C4.5 64.5% 73.3% 59.0%

Random forest 65.1% 73.3% 57.4%

Knn (k=5) 60.1% 67.3% 64.5%

Multilayer perceptron 60.3% 63.9% 60.1%

Naïve Bayes 67.4% 59.9% 63.4%

Best-5 combined 66.3% 73.8% 63.4%