Challenges and Opportunities in Building Personalized Online Content Aggregators Ka Cheung Sia Adviser: Prof. Junghoo Cho Oral Defense January 12 2009

Challenges and Opportunities in Building Personalized

Online Content Aggregators

Ka Cheung SiaAdviser: Prof. Junghoo Cho

Oral DefenseJanuary 12 2009

Challenges and Opportunities in Building Personalized Online Content Aggregator

2

Outline

Emergence of Web 2.0 Online content aggregators Challenges and opportunities

RSS monitoring Personalized recommendations Social annotations

Conclusion


3

Web 1.0

A few professional content creators News Corporate sites Portal

One way consumption of information


4

Web 2.0

Facilitators of content sharing Wikipedia Blog Media file sharing Discussion group

Everyone can publish content easily

Handheld devices and innovation online applications

Being Web 2.0 publishers


5

Growth of UGC / blogs

In 2007 study Professional content : 2GB / day UGC : 8-10GB / day

Bloglines.com 26% users with >30 subscriptions

2006 person of the year - TIME


6

RSS

Really Simple Syndication XML Contains 10-15 latest posts

Machine readable Datetime of publications Title / content Permalink

Subscription RSS reader Personalized homepage


7

How RSS helps readers?

Without RSS(visit different URLs)

With RSS(centralized access)


8

RSS usage

High usage but low awareness 27% consume 4% aware

Common usage News feeds Podcasting My MSN / My Yahoo! / etc. Google reader / bloglines Indexing blogs

Time-sensitive content

“RSS – Crossing into the Mainstream” Yahoo white paper by Joshua Grossnickle Oct 2005


9

Online content aggregator

Centralized access to subscribed content in executive summary style

Leverage collaborative filtering Ubiquitous access Collect useful social annotation data


10

Online content aggregator (Google reader example)

Subscription listNewly updated articles

(Chapter 2 & 4)

Recommendations(Chapter 3)

Social annotations(Chapter 5)


11

Challenges and opportunities How to deliver up-to-date content?

New articles update quickly with recurring patterns Significance of articles deteriorates quickly over time

How to provide better personalization? Ranking articles/topics based on user interest Efficient computation to handle large number of users

What is the knowledge in Web 2.0 data? Improve Web resources categorization Vocabulary usage


12

Outline

Emergence of Web 2.0 Online content aggregator Challenges and opportunities

RSS monitoring How to deliver “fresh” content

Providing better personalization Web 2.0 knowledge mining

Conclusion


13

The retrieval problem

Research problem in proxies, search engines, … Source cooperativeness [DKP01, OW02] Priority of different content [CG03a] Resource constraints User satisfaction [PO05, WSY02] Politeness issues, …

Data source aggregator user

retrieval deliver


14

Metrics

Evaluation at time u1

Freshness: 0 Age:

Delay: Miss-penalty: 2

Push vs. Pull Push: All updates are known (e.g. RSS ping services) Pull: Future updates are estimated

)()()( 312111 ttt

)( 41 tu


15

Refined model

Commonly used Webpage change model Homogeneous Poisson model

λ(t) = λ at any t

RSS content update more frequently with recurring pattern Periodic inhomogeneous Poisson model

λ(t) = λ(t-nT), n=1,2,… , T is the period

user data source


16

Optimization problem

Resource allocation How often to contact a data source? O1 is more active and has more subscribers than O2, how

much often should we contact O1?

Retrieval scheduling When to contact a data source? Given 2 retrievals allocated for O1, when to retrieve from it?

Both in the morning, or one in the morning, one at night?

i i im w


17

Retrieval schedule intuition

t=1No postings missed

t=0 or 2All postings (in the same period) missed


18

Necessary optimal condition

Given λ(t) and u(t), schedule τj’s that minimizes delay / miss

Delay: Schedule right after large number of new postsMiss-penalty: Schedule right before lot’s of user access


19

Performance

Reduce miss by 33% compared to CGM03 for 1 retrieval per day Reduce miss further by 20% when consider user access pattern


20

Summary

Better RSS content update model Significantly improve “content freshness” under

same resource constraint Analysis of typical posting patterns and access

patterns

“Efficient Monitoring Algorithm for Fast News Alert”, with Junghoo Cho and Hyun-Kyu Cho, in IEEE TKDE 2007

“Monitoring RSS Feeds based on User Browsing Pattern”, with Junghoo Cho, Koji Hino, Yun Chi, Shenghuo Zhu and Belle L. Tseng in ICWSM 2007


21

Outline


RSS monitoring Providing better personalization

Ranking articles/topics based on user interest Efficient computation to support large number

of users Social annotations

Conclusion


22

Learning user profile

Users are reluctant to indicate their interest

Cold-start problem Diversified

recommendations [ZMK05] Drift of user interest

[WBP01] Relevance feedback [Eft00,

KDF05]

Goal: Improve relevance of recommendations click utility

recommendationsfeedback

Learningprocess


23

Ranking model 1

Assumptions K predefined topics Every recommendation item belongs to one topic

User profile: Θi – Pr (click | read, topic i) Θi is estimated by α/(α+β) drawing from a beta

distribution with parameters α, β

Topic 1 2 3 4 5 6

α 2 0 2 5 3 2

β 10 0 1 0 3 1


24

Ranking model 2

Ranking bias: g(j) – Pr (read | j) Read probability decreases with rank Borrow from web search studies

Utility function: U(R; Θ)

R: ranking of topics Articles belong to the same topics are chosen randomly


25

Ranking topics

Updating posteriori distribution after each iteration Not clicked: βnew=βold + g(ri)

Clicked: αnew=αold + 1

Ranking function of topics Exploitation + λ*exploration

Mean + λ*variance

2( ) ( 1)

Example (λ=1)α=2, β=2Ranking 0.55

α=5, β=5Ranking 0.52


26

Simulation

Click utility improve in long run

Adapts to drift of interest

More accurate estimation of user interest Θ


27

User studies

10 users from UCLA and NEC 45 categories from dmoz.org

Arts/Archecture Computers/E-books Science/Biology …

Survey of user interest before experiment

7 articles (Webpages) per iteration

3 strategies interleaved

First 25 iterations

Drifted at 25th iteration


28

Summary

Learning framework Exploitation: recommend user interested items Exploration: explore user’s other potential interest

Proven to improve click utility and adapt to drift of user interest

“Capturing User Interest by Both Exploitation and Exploration”, with Shenghuo Zhu, Yun Chi, Koji Hino, and Belle L. Tseng, in UM 2007


29

Outline


RSS monitoring Providing better personalization

Ranking articles/topics based on user interest Efficient computation to support large number

of users Social annotations

Conclusion


30

Aggregation as recommendation

User-generated content in Blogosphere and Web 2.0 services contain rich information of recent events

Aggregation of individual opinions often shows interesting popular topics


31

Personal recommendation

Dark KnightOlympics

Michael Phelps WALL-E

Las Vegas

RSS sources

Items(phrases)

Dark Knight is great, more entertaining

than watching Olympics and shows in Las

Vegas!

Um.. it will be good if there is a free show of Dark Knight and WALL-E

Michael Phelps performance in

Olympics is awesome...

Finished watching

Michael Phelps in Olympics, let me watch the

WALL-E DVD...


32

Matrix formulation

Reference Matrix (E) – the number of times a blogger mention a phrase/link in his blog post

Subscription matrix (T) – how often a user reads a blog Personalized score (TE)

321b4

475Total

101b3

030b2

023b1

o3o2o1E

0.50.500u3

0.60.60.20.2u2

000.80.8u1

b4b3b2b1T

21.01.0u3

2.42.21.8u2

0.04.02.4u1

o3o2o1TE


33

Database operation of matrix

Reference (rss-id, item, score) <b1, o1, 3>

<b1, o2, 2><b2, o2, 3>…

Grows over time Subscription (user-id, rss-

id, score) <u1, b1, 0.8>

<u1, b2, 0.8><u2, b1, 0.2>…

Relatively stable

0.50.500u3

0.60.60.20.2u2

000.80.8u1

b4b3b2b1T

E o1 o2 o3

b1 3 2 0

b2 0 3 0

b3 1 0 1

b4 1 2 3


34

Baselines

Aggregate QuerySELECT t.item, sum(t.score*e.score) As p_scoreFROM Endorsement e, Trust tWHERE e.blog-id = t.blog-id ANDt.user-id = <user id>GROUP BY t.itemsORDER BY p_score DESC LIMIT 20

On-the-fly (OTF) View


35

Two stage computation

Support large number of users and rss sources OTF – high query cost VIEW – high update cost

Identify “template” users Users often share similar

reading interest Example: template users

interested in sports / politics / technologies / …

Result are pre-computed and then combined in two stages


36

Discover user groups by NMF

Decompose subscription matrix T into sub-matrices W and H Non-negative matrix factorization (NMF) [Hoy04] W : [individual users : template users] relationship H : [template users : blogs] relationship

Example: user 2’s subscription vector is expressed as linear combination of two template users

NMF as an approximation of original subscription matrix Accurate Sparse


37

Reconstruction of results Personalized scores of template users are pre-computed (HE) is maintained as sorted lists for template users

W*(HE) becomes the personalized scores of all users Computed using Threshold Algorithm [FLN01]

Top-K list (HE) are sorted lists W*(HE) is weighted linear combination


38

Experiments

Bloglines.com: online RSS reader Subscription matrix T: (0 or 1) subscription profile

91k users 487k feeds

Reference matrix E: blog-keyword occurrence Feed content collected between Nov 2006 – Jul 2007 Top 20 nouns with highest tf-idf in each posts are selected as

keywords Platform

Python implementation of proposed method MySQL server on linux with data reside in RAID


39

The difference by personalization

Week 2007 Jan 7 – Jan 13 Major event: iphone released 3 users with large number of

subscriptions

Distinct difference between top-20 recommended words Among users – 1.13 Between users and global – 1.12

irangooglequarterphonesaddamcathartikpricesbusinesstroopsvideocompaniessoftwaredeptkibbutzappledevelopmentavenueargentinabushmanagementviewsvegasiraqiraqpresidentsearchchicagomanagerbushreutersiphoneappleiraqiguazubeefiphoneyorkerbrazilcattlesalesUser 91017User 90550User 90439Global

2007-01-07 to 2007-01-13


40

Efficiency of proposed method Update cost

OTF (222K) < NMF (3.2M) < VIEW (23.6M) Query response time

Average over 1000 users with highest number of subscription OTF : execute SQL query directly on MySQL server NMF: python implementation that interfaces with MySQL server

Average query response time reduced by 75%, eliminated outliers of significant delay

70% approximation

0.007s2.84s0.53s0.46sNMF

0.037s84.42s3.60s2.05sOTF

minmaxstdavgMethod


41

Summary

Provide personalized recommendation by selective aggregation

Proposed matrix model for personalized aggregation Optimization by NMF & Threshold Algorithm

Real life dataset study shows query response time can be reduced significantly with acceptable approximation accuracy

“Efficient Computation of Personal Aggregation Queries on Blogs”, with Junghoo Cho, Yun Chi, and Belle L. Tseng, in SIGKDD 2008


42

Outline


RSS monitoring Providing better personalization Social annotations

Vocabulary usage effective advertising keyword selection

Conclusion


43

Social annotations

Bookmark tags, video/picture annotations, article tags Evolving vocabularies (itouch, wow, w00t, …) Emoticons (>_<, Orz, …)

Intensive human effort Latent Dirichlet Allocation [BNJ03]

Recover hidden topics z’s Represent words p(z|w) and documents

p(z|d) as distribution over hidden topics

Improving information retrieval Web document retrieval [WZY06, ZBZ08] Social tagging usability [CM07]

users(u)

tags(w)

documents(d)


44

Topic categorization


45

Desired properties of effective advertising keywords

Specific Reach target audience e.g. automobiles > ford, good > programming

Emerging Developing vs. stable Easier to attract user attention

Time-(in)sensitive Context change over time Watch for change in target audience

How can these properties be learned from social annotations collected in aggregators?


46

Emerging

Words correspond to emerging topics Users actively explore new pages and annotate evenly on different

pages Examples (between December 2007 and March 2008):

rails2.0 (ruby on rails webapp framework) kindle (amazon ebook) itouch (unofficial nickname of ipod touch) eeepc (subnotebook by Asus) obama (Barrack Obama) jailbreak (Apple iphone crack software)

Change of entropy

emerging stable


47

Effective advertising keyword classifier

10+ features extracted from social annotations for each word User study performed on Amazon Mechanical Turk

10-fold cross-validation on different classifiers SVM 70.3% Logistic regression 69.8% C4.5 73.3% Random forest 73.3% K-nn 67.3% Back-propagation neural nets 63.9% Naïve Bayes 59.9% Best-5 combined 73.8%


48

Summary

Leverage social annotations collected from online content aggregator users

Social annotation differ significantly from general text corpora New metrics / features Usage in online advertising

“Exploring Social Annotations for Effective Advertising Keyword Selection”, with Junghoo Cho, work in progress


49

Conclusion

Web 2.0 phenomenon More content sharing and diverse interest

Personalized online content aggregator Easier access to different information sources Deliver update content Deliver better personalized recommendations Leverage human effort collected in the aggregator


50

References [CG03a] Junghoo Cho and Hector Garcia-Molina. “Effective Page Referesh Policies

for Web Crawlers.” ACM TODS 28(4), 2003 [DKP01] Paven Deolasee, Amol Katkar, Ankur Panchbudhe, Krithi Ramamritham,

and Prashant Shenoy. “Adaptive Push-Pull: Disseminating Dynamic Web Data” WWW 2001

[OW02] Chris Olston and Jennifer Widom. “Best-Effort Cache Synchronization with Source cooperation” SIGMOD 2002

[PO05] Sandeep Pandy and Christopher Olston. “User-Centric Web Crawling” WWW 2005

[WSY02] J.L. Wolf, M.S. Squillante, P.S. Yu, J. Sethuraman, and L. Ozsen. “Optimal Crawling Strategies for Web Search Engines.” WWW 2002

[FLN01] Ronald Fagin, Amnon Lotem, and Moni Naor. “Optimal Aggregation Algorithms for Middleware.” PODS 2001

[Hoy04] Patrik Hoyer “Non-negative Matrix Factorization with Sparseness Constraints” Journal of Machine Learning Research, 5:1457-1469, 2004

[LWL07] Chengkai Li, Ming Wang, Lipyeow Lim, Haixun Wang, and Kevin Chen-Chuan Chang. “Supporting Ranking and Clustering as Generalized Order-By and Group-By” SIGMOD 2007

[PP07] Seung-Taek Park and David Pennock. “Applying Collaborative Filtering Techniques to Movie Search for Better Ranking and Browsing” SIGKDD 2007


51

References [Eft00] E.N. Efthimiadis “Interactive Query Expansion: A User-based Evaluation in a

Relevance Feedback Environment” JASIS 51(11), 2000 [KDF05] Diane Kelly, Vijay Deepak Dollu, and Xin Fu. “The Loquacious User: A

Document-Independent Source of Term for Query Expansion” SIGIR 2005 [WPB01] Geoffrey I. Webb, Michael J. Pazzani, and Daniel Billsus. “Machine

Learning for User Modeling” User Modeling and User-Adapted Interaction, 11(1-2)19-29, 2001

[ZMK05] Cai-Nicolas Ziegler, Sean M. McNee, Joseph A. Konstan, and Georg Lausen. “Improving Recommendation Lists Through Topic Diversification” WWW 2005

[BNJ03] David M. Blei, Andrew Y. Ng, and Michael I. Jordan. “Latent Dirichlet Allocation” Journal of Machine Learning Research, 3:993-1022, 2003

[CM07] Ed H. Chi and Todd Mytkowicz. “Understanding Navigability of Social Tagging Systems” CHI 2007

[WZY06] Xian Wu, Lei Zhang, and Yong Yu. “Exploring Social Annotation for the Semantic Web” WWW 2006

[ZBZ08] Ding Zhou, Jiang Bian, Shuyi Zheng, Hongyuan Zha, and C. Lee Giles. “Exploring Social Annotations for Information Retrieal” WWW 2008

Thank you

Q & A


53

Additional slides RSS monitoring

Different data posting patterns Optimal size of estimation window Consistency of posting rate

Providing personalized recommendations Partition of trust matrix Threshold algorithm NMF approximation accuracy Approximation accuracy Multi-armed bandit problem

Social annotations Preferential attachment / usage URL in photography category Distribution of entropy change Performance of different classifiers


54

Different data posting patterns


55

Optimal size of estimation window

Resource constraint: 4 retrievals per day per feeds on average

2 weeks seems an appropriate choice


56

Consistency of posting rate

90% of the RSS feeds post consistently


57

Partition of subscription matrix

Decomposition is useful when matrix is dense Real life data is often skewed Hybrid method: uses NMF only in its effective region

Users with more subscription

Blogs withmore subscribers

Users with >30 subscriptionsFeeds with >30 subscribers

10k feeds, 24k users~1M subscription pairs

2.7M subscription pairs

1. OTF

2. VIEW

3. NMF


58

Threshold algorithm

Proposed by Fagin et.al. [FLN01]Efficient computation of top-K items from multiple lists with a monotone aggregate function

users

blogs

Template user’srecommendations

update

query


59

NMF approximation accuracy Dense region of subscription

matrix >30 subscribers: 10152 feeds >30 subscriptions: 24340 users

L2 norm comparison

Sparsity of W (23%), H (13%) NMF approximation is close to

SVD and sparse

833.0823.2120

837.9829.0110

844.6835.1100

850.1841.690

856.9848.580

NMFSVDRank


60

Approximation accuracy

How many items are approximated by NMF in the top 20 list? Ti – top 20 items of user i computed by OTF

Ai – top 20 items of user i computed by NMF

70% approximation and more accurate for higher rank items

Correlation with rank| | / | |i i iA T T


61

Multi-armed bandit problem

Well-studied problem in reinforcement learning / statistics

Problem statement Background: You are given n different choices Decision: For each choice you receive a numerical reward

chosen from an unknown stationary probability distribution Goal: maximize the total reward over some time period

Solutions Action-value methods (greedy & ε-greedy) Softmax Action Selection (decaying) Pursuit methods Associative search


62

Preferential attachment/usage

URL / Tag usage distribution


63

URL in photography category

Documents ranked by p(d|z) values


64

Specific

Stop word list Inverse document frequency (idf) Ontology based

Entropy

Least specific tags found idf – [web, reference, software, design, …] Entropy – [temp, for, important, good, …]

T

iii wzpwzpwH

1

))|(log()|()(


65

Time-sensitivity The usage / associated context changes over time

“holiday” Travel packages: [travel, eclipse, europe, guide, …] Christmas shopping: [christmas, gift, shopping, …]

“programmers” Programming: [programming, development, code, patterns, …] Job hunting: [work, jobs, career, job, …]

KL-divergence of two distributions Jaccard coefficient of two sets of tagged URL


66

Distribution of entropy change

Entropy increase over time (+0.1 over 3 months)


67

Performance of different classifiers

10-fold cross-validation

Classifier Specific Emerging Stable

SVM 65.7% 70.3% 63.9%

Logistic regression 66.3% 69.8% 60.7%

C4.5 64.5% 73.3% 59.0%

Random forest 65.1% 73.3% 57.4%

Knn (k=5) 60.1% 67.3% 64.5%

Multilayer perceptron 60.3% 63.9% 60.1%

Naïve Bayes 67.4% 59.9% 63.4%

Best-5 combined 66.3% 73.8% 63.4%