View
219
Download
0
Tags:
Embed Size (px)
Citation preview
Challenges and Opportunities in Building Personalized
Online Content Aggregators
Ka Cheung SiaAdviser: Prof. Junghoo Cho
Oral DefenseJanuary 12 2009
Challenges and Opportunities in Building Personalized Online Content Aggregator
2
Outline
Emergence of Web 2.0 Online content aggregators Challenges and opportunities
RSS monitoring Personalized recommendations Social annotations
Conclusion
Challenges and Opportunities in Building Personalized Online Content Aggregator
3
Web 1.0
A few professional content creators News Corporate sites Portal
One way consumption of information
Challenges and Opportunities in Building Personalized Online Content Aggregator
4
Web 2.0
Facilitators of content sharing Wikipedia Blog Media file sharing Discussion group
Everyone can publish content easily
Handheld devices and innovation online applications
Being Web 2.0 publishers
Challenges and Opportunities in Building Personalized Online Content Aggregator
5
Growth of UGC / blogs
In 2007 study Professional content : 2GB / day UGC : 8-10GB / day
Bloglines.com 26% users with >30 subscriptions
2006 person of the year - TIME
Challenges and Opportunities in Building Personalized Online Content Aggregator
6
RSS
Really Simple Syndication XML Contains 10-15 latest posts
Machine readable Datetime of publications Title / content Permalink
Subscription RSS reader Personalized homepage
Challenges and Opportunities in Building Personalized Online Content Aggregator
7
How RSS helps readers?
Without RSS(visit different URLs)
With RSS(centralized access)
Challenges and Opportunities in Building Personalized Online Content Aggregator
8
RSS usage
High usage but low awareness 27% consume 4% aware
Common usage News feeds Podcasting My MSN / My Yahoo! / etc. Google reader / bloglines Indexing blogs
Time-sensitive content
“RSS – Crossing into the Mainstream” Yahoo white paper by Joshua Grossnickle Oct 2005
Challenges and Opportunities in Building Personalized Online Content Aggregator
9
Online content aggregator
Centralized access to subscribed content in executive summary style
Leverage collaborative filtering Ubiquitous access Collect useful social annotation data
Challenges and Opportunities in Building Personalized Online Content Aggregator
10
Online content aggregator (Google reader example)
Subscription listNewly updated articles
(Chapter 2 & 4)
Recommendations(Chapter 3)
Social annotations(Chapter 5)
Challenges and Opportunities in Building Personalized Online Content Aggregator
11
Challenges and opportunities How to deliver up-to-date content?
New articles update quickly with recurring patterns Significance of articles deteriorates quickly over time
How to provide better personalization? Ranking articles/topics based on user interest Efficient computation to handle large number of users
What is the knowledge in Web 2.0 data? Improve Web resources categorization Vocabulary usage
Challenges and Opportunities in Building Personalized Online Content Aggregator
12
Outline
Emergence of Web 2.0 Online content aggregator Challenges and opportunities
RSS monitoring How to deliver “fresh” content
Providing better personalization Web 2.0 knowledge mining
Conclusion
Challenges and Opportunities in Building Personalized Online Content Aggregator
13
The retrieval problem
Research problem in proxies, search engines, … Source cooperativeness [DKP01, OW02] Priority of different content [CG03a] Resource constraints User satisfaction [PO05, WSY02] Politeness issues, …
Data source aggregator user
retrieval deliver
Challenges and Opportunities in Building Personalized Online Content Aggregator
14
Metrics
Evaluation at time u1
Freshness: 0 Age:
Delay: Miss-penalty: 2
Push vs. Pull Push: All updates are known (e.g. RSS ping services) Pull: Future updates are estimated
)()()( 312111 ttt
)( 41 tu
Challenges and Opportunities in Building Personalized Online Content Aggregator
15
Refined model
Commonly used Webpage change model Homogeneous Poisson model
λ(t) = λ at any t
RSS content update more frequently with recurring pattern Periodic inhomogeneous Poisson model
λ(t) = λ(t-nT), n=1,2,… , T is the period
user data source
Challenges and Opportunities in Building Personalized Online Content Aggregator
16
Optimization problem
Resource allocation How often to contact a data source? O1 is more active and has more subscribers than O2, how
much often should we contact O1?
Retrieval scheduling When to contact a data source? Given 2 retrievals allocated for O1, when to retrieve from it?
Both in the morning, or one in the morning, one at night?
i i im w
Challenges and Opportunities in Building Personalized Online Content Aggregator
17
Retrieval schedule intuition
t=1No postings missed
t=0 or 2All postings (in the same period) missed
Challenges and Opportunities in Building Personalized Online Content Aggregator
18
Necessary optimal condition
Given λ(t) and u(t), schedule τj’s that minimizes delay / miss
Delay: Schedule right after large number of new postsMiss-penalty: Schedule right before lot’s of user access
Challenges and Opportunities in Building Personalized Online Content Aggregator
19
Performance
Reduce miss by 33% compared to CGM03 for 1 retrieval per day Reduce miss further by 20% when consider user access pattern
Challenges and Opportunities in Building Personalized Online Content Aggregator
20
Summary
Better RSS content update model Significantly improve “content freshness” under
same resource constraint Analysis of typical posting patterns and access
patterns
“Efficient Monitoring Algorithm for Fast News Alert”, with Junghoo Cho and Hyun-Kyu Cho, in IEEE TKDE 2007
“Monitoring RSS Feeds based on User Browsing Pattern”, with Junghoo Cho, Koji Hino, Yun Chi, Shenghuo Zhu and Belle L. Tseng in ICWSM 2007
Challenges and Opportunities in Building Personalized Online Content Aggregator
21
Outline
Emergence of Web 2.0 Online content aggregator Challenges and opportunities
RSS monitoring Providing better personalization
Ranking articles/topics based on user interest Efficient computation to support large number
of users Social annotations
Conclusion
Challenges and Opportunities in Building Personalized Online Content Aggregator
22
Learning user profile
Users are reluctant to indicate their interest
Cold-start problem Diversified
recommendations [ZMK05] Drift of user interest
[WBP01] Relevance feedback [Eft00,
KDF05]
Goal: Improve relevance of recommendations click utility
recommendationsfeedback
Learningprocess
Challenges and Opportunities in Building Personalized Online Content Aggregator
23
Ranking model 1
Assumptions K predefined topics Every recommendation item belongs to one topic
User profile: Θi – Pr (click | read, topic i) Θi is estimated by α/(α+β) drawing from a beta
distribution with parameters α, β
Topic 1 2 3 4 5 6
α 2 0 2 5 3 2
β 10 0 1 0 3 1
Challenges and Opportunities in Building Personalized Online Content Aggregator
24
Ranking model 2
Ranking bias: g(j) – Pr (read | j) Read probability decreases with rank Borrow from web search studies
Utility function: U(R; Θ)
R: ranking of topics Articles belong to the same topics are chosen randomly
Challenges and Opportunities in Building Personalized Online Content Aggregator
25
Ranking topics
Updating posteriori distribution after each iteration Not clicked: βnew=βold + g(ri)
Clicked: αnew=αold + 1
Ranking function of topics Exploitation + λ*exploration
Mean + λ*variance
2( ) ( 1)
Example (λ=1)α=2, β=2Ranking 0.55
α=5, β=5Ranking 0.52
Challenges and Opportunities in Building Personalized Online Content Aggregator
26
Simulation
Click utility improve in long run
Adapts to drift of interest
More accurate estimation of user interest Θ
Challenges and Opportunities in Building Personalized Online Content Aggregator
27
User studies
10 users from UCLA and NEC 45 categories from dmoz.org
Arts/Archecture Computers/E-books Science/Biology …
Survey of user interest before experiment
7 articles (Webpages) per iteration
3 strategies interleaved
First 25 iterations
Drifted at 25th iteration
Challenges and Opportunities in Building Personalized Online Content Aggregator
28
Summary
Learning framework Exploitation: recommend user interested items Exploration: explore user’s other potential interest
Proven to improve click utility and adapt to drift of user interest
“Capturing User Interest by Both Exploitation and Exploration”, with Shenghuo Zhu, Yun Chi, Koji Hino, and Belle L. Tseng, in UM 2007
Challenges and Opportunities in Building Personalized Online Content Aggregator
29
Outline
Emergence of Web 2.0 Online content aggregator Challenges and opportunities
RSS monitoring Providing better personalization
Ranking articles/topics based on user interest Efficient computation to support large number
of users Social annotations
Conclusion
Challenges and Opportunities in Building Personalized Online Content Aggregator
30
Aggregation as recommendation
User-generated content in Blogosphere and Web 2.0 services contain rich information of recent events
Aggregation of individual opinions often shows interesting popular topics
Challenges and Opportunities in Building Personalized Online Content Aggregator
31
Personal recommendation
Dark KnightOlympics
Michael Phelps WALL-E
Las Vegas
RSS sources
Items(phrases)
Dark Knight is great, more entertaining
than watching Olympics and shows in Las
Vegas!
Um.. it will be good if there is a free show of Dark Knight and WALL-E
Michael Phelps performance in
Olympics is awesome...
Finished watching
Michael Phelps in Olympics, let me watch the
WALL-E DVD...
Challenges and Opportunities in Building Personalized Online Content Aggregator
32
Matrix formulation
Reference Matrix (E) – the number of times a blogger mention a phrase/link in his blog post
Subscription matrix (T) – how often a user reads a blog Personalized score (TE)
321b4
475Total
101b3
030b2
023b1
o3o2o1E
0.50.500u3
0.60.60.20.2u2
000.80.8u1
b4b3b2b1T
21.01.0u3
2.42.21.8u2
0.04.02.4u1
o3o2o1TE
Challenges and Opportunities in Building Personalized Online Content Aggregator
33
Database operation of matrix
Reference (rss-id, item, score) <b1, o1, 3>
<b1, o2, 2><b2, o2, 3>…
Grows over time Subscription (user-id, rss-
id, score) <u1, b1, 0.8>
<u1, b2, 0.8><u2, b1, 0.2>…
Relatively stable
0.50.500u3
0.60.60.20.2u2
000.80.8u1
b4b3b2b1T
E o1 o2 o3
b1 3 2 0
b2 0 3 0
b3 1 0 1
b4 1 2 3
Challenges and Opportunities in Building Personalized Online Content Aggregator
34
Baselines
Aggregate QuerySELECT t.item, sum(t.score*e.score) As p_scoreFROM Endorsement e, Trust tWHERE e.blog-id = t.blog-id ANDt.user-id = <user id>GROUP BY t.itemsORDER BY p_score DESC LIMIT 20
On-the-fly (OTF) View
Challenges and Opportunities in Building Personalized Online Content Aggregator
35
Two stage computation
Support large number of users and rss sources OTF – high query cost VIEW – high update cost
Identify “template” users Users often share similar
reading interest Example: template users
interested in sports / politics / technologies / …
Result are pre-computed and then combined in two stages
Challenges and Opportunities in Building Personalized Online Content Aggregator
36
Discover user groups by NMF
Decompose subscription matrix T into sub-matrices W and H Non-negative matrix factorization (NMF) [Hoy04] W : [individual users : template users] relationship H : [template users : blogs] relationship
Example: user 2’s subscription vector is expressed as linear combination of two template users
NMF as an approximation of original subscription matrix Accurate Sparse
Challenges and Opportunities in Building Personalized Online Content Aggregator
37
Reconstruction of results Personalized scores of template users are pre-computed (HE) is maintained as sorted lists for template users
W*(HE) becomes the personalized scores of all users Computed using Threshold Algorithm [FLN01]
Top-K list (HE) are sorted lists W*(HE) is weighted linear combination
Challenges and Opportunities in Building Personalized Online Content Aggregator
38
Experiments
Bloglines.com: online RSS reader Subscription matrix T: (0 or 1) subscription profile
91k users 487k feeds
Reference matrix E: blog-keyword occurrence Feed content collected between Nov 2006 – Jul 2007 Top 20 nouns with highest tf-idf in each posts are selected as
keywords Platform
Python implementation of proposed method MySQL server on linux with data reside in RAID
Challenges and Opportunities in Building Personalized Online Content Aggregator
39
The difference by personalization
Week 2007 Jan 7 – Jan 13 Major event: iphone released 3 users with large number of
subscriptions
Distinct difference between top-20 recommended words Among users – 1.13 Between users and global – 1.12
irangooglequarterphonesaddamcathartikpricesbusinesstroopsvideocompaniessoftwaredeptkibbutzappledevelopmentavenueargentinabushmanagementviewsvegasiraqiraqpresidentsearchchicagomanagerbushreutersiphoneappleiraqiguazubeefiphoneyorkerbrazilcattlesalesUser 91017User 90550User 90439Global
2007-01-07 to 2007-01-13
Challenges and Opportunities in Building Personalized Online Content Aggregator
40
Efficiency of proposed method Update cost
OTF (222K) < NMF (3.2M) < VIEW (23.6M) Query response time
Average over 1000 users with highest number of subscription OTF : execute SQL query directly on MySQL server NMF: python implementation that interfaces with MySQL server
Average query response time reduced by 75%, eliminated outliers of significant delay
70% approximation
0.007s2.84s0.53s0.46sNMF
0.037s84.42s3.60s2.05sOTF
minmaxstdavgMethod
Challenges and Opportunities in Building Personalized Online Content Aggregator
41
Summary
Provide personalized recommendation by selective aggregation
Proposed matrix model for personalized aggregation Optimization by NMF & Threshold Algorithm
Real life dataset study shows query response time can be reduced significantly with acceptable approximation accuracy
“Efficient Computation of Personal Aggregation Queries on Blogs”, with Junghoo Cho, Yun Chi, and Belle L. Tseng, in SIGKDD 2008
Challenges and Opportunities in Building Personalized Online Content Aggregator
42
Outline
Emergence of Web 2.0 Online content aggregator Challenges and opportunities
RSS monitoring Providing better personalization Social annotations
Vocabulary usage effective advertising keyword selection
Conclusion
Challenges and Opportunities in Building Personalized Online Content Aggregator
43
Social annotations
Bookmark tags, video/picture annotations, article tags Evolving vocabularies (itouch, wow, w00t, …) Emoticons (>_<, Orz, …)
Intensive human effort Latent Dirichlet Allocation [BNJ03]
Recover hidden topics z’s Represent words p(z|w) and documents
p(z|d) as distribution over hidden topics
Improving information retrieval Web document retrieval [WZY06, ZBZ08] Social tagging usability [CM07]
users(u)
tags(w)
documents(d)
Challenges and Opportunities in Building Personalized Online Content Aggregator
44
Topic categorization
Challenges and Opportunities in Building Personalized Online Content Aggregator
45
Desired properties of effective advertising keywords
Specific Reach target audience e.g. automobiles > ford, good > programming
Emerging Developing vs. stable Easier to attract user attention
Time-(in)sensitive Context change over time Watch for change in target audience
How can these properties be learned from social annotations collected in aggregators?
Challenges and Opportunities in Building Personalized Online Content Aggregator
46
Emerging
Words correspond to emerging topics Users actively explore new pages and annotate evenly on different
pages Examples (between December 2007 and March 2008):
rails2.0 (ruby on rails webapp framework) kindle (amazon ebook) itouch (unofficial nickname of ipod touch) eeepc (subnotebook by Asus) obama (Barrack Obama) jailbreak (Apple iphone crack software)
Change of entropy
emerging stable
Challenges and Opportunities in Building Personalized Online Content Aggregator
47
Effective advertising keyword classifier
10+ features extracted from social annotations for each word User study performed on Amazon Mechanical Turk
10-fold cross-validation on different classifiers SVM 70.3% Logistic regression 69.8% C4.5 73.3% Random forest 73.3% K-nn 67.3% Back-propagation neural nets 63.9% Naïve Bayes 59.9% Best-5 combined 73.8%
Challenges and Opportunities in Building Personalized Online Content Aggregator
48
Summary
Leverage social annotations collected from online content aggregator users
Social annotation differ significantly from general text corpora New metrics / features Usage in online advertising
“Exploring Social Annotations for Effective Advertising Keyword Selection”, with Junghoo Cho, work in progress
Challenges and Opportunities in Building Personalized Online Content Aggregator
49
Conclusion
Web 2.0 phenomenon More content sharing and diverse interest
Personalized online content aggregator Easier access to different information sources Deliver update content Deliver better personalized recommendations Leverage human effort collected in the aggregator
Challenges and Opportunities in Building Personalized Online Content Aggregator
50
References [CG03a] Junghoo Cho and Hector Garcia-Molina. “Effective Page Referesh Policies
for Web Crawlers.” ACM TODS 28(4), 2003 [DKP01] Paven Deolasee, Amol Katkar, Ankur Panchbudhe, Krithi Ramamritham,
and Prashant Shenoy. “Adaptive Push-Pull: Disseminating Dynamic Web Data” WWW 2001
[OW02] Chris Olston and Jennifer Widom. “Best-Effort Cache Synchronization with Source cooperation” SIGMOD 2002
[PO05] Sandeep Pandy and Christopher Olston. “User-Centric Web Crawling” WWW 2005
[WSY02] J.L. Wolf, M.S. Squillante, P.S. Yu, J. Sethuraman, and L. Ozsen. “Optimal Crawling Strategies for Web Search Engines.” WWW 2002
[FLN01] Ronald Fagin, Amnon Lotem, and Moni Naor. “Optimal Aggregation Algorithms for Middleware.” PODS 2001
[Hoy04] Patrik Hoyer “Non-negative Matrix Factorization with Sparseness Constraints” Journal of Machine Learning Research, 5:1457-1469, 2004
[LWL07] Chengkai Li, Ming Wang, Lipyeow Lim, Haixun Wang, and Kevin Chen-Chuan Chang. “Supporting Ranking and Clustering as Generalized Order-By and Group-By” SIGMOD 2007
[PP07] Seung-Taek Park and David Pennock. “Applying Collaborative Filtering Techniques to Movie Search for Better Ranking and Browsing” SIGKDD 2007
Challenges and Opportunities in Building Personalized Online Content Aggregator
51
References [Eft00] E.N. Efthimiadis “Interactive Query Expansion: A User-based Evaluation in a
Relevance Feedback Environment” JASIS 51(11), 2000 [KDF05] Diane Kelly, Vijay Deepak Dollu, and Xin Fu. “The Loquacious User: A
Document-Independent Source of Term for Query Expansion” SIGIR 2005 [WPB01] Geoffrey I. Webb, Michael J. Pazzani, and Daniel Billsus. “Machine
Learning for User Modeling” User Modeling and User-Adapted Interaction, 11(1-2)19-29, 2001
[ZMK05] Cai-Nicolas Ziegler, Sean M. McNee, Joseph A. Konstan, and Georg Lausen. “Improving Recommendation Lists Through Topic Diversification” WWW 2005
[BNJ03] David M. Blei, Andrew Y. Ng, and Michael I. Jordan. “Latent Dirichlet Allocation” Journal of Machine Learning Research, 3:993-1022, 2003
[CM07] Ed H. Chi and Todd Mytkowicz. “Understanding Navigability of Social Tagging Systems” CHI 2007
[WZY06] Xian Wu, Lei Zhang, and Yong Yu. “Exploring Social Annotation for the Semantic Web” WWW 2006
[ZBZ08] Ding Zhou, Jiang Bian, Shuyi Zheng, Hongyuan Zha, and C. Lee Giles. “Exploring Social Annotations for Information Retrieal” WWW 2008
Thank you
Q & A
Challenges and Opportunities in Building Personalized Online Content Aggregator
53
Additional slides RSS monitoring
Different data posting patterns Optimal size of estimation window Consistency of posting rate
Providing personalized recommendations Partition of trust matrix Threshold algorithm NMF approximation accuracy Approximation accuracy Multi-armed bandit problem
Social annotations Preferential attachment / usage URL in photography category Distribution of entropy change Performance of different classifiers
Challenges and Opportunities in Building Personalized Online Content Aggregator
54
Different data posting patterns
Challenges and Opportunities in Building Personalized Online Content Aggregator
55
Optimal size of estimation window
Resource constraint: 4 retrievals per day per feeds on average
2 weeks seems an appropriate choice
Challenges and Opportunities in Building Personalized Online Content Aggregator
56
Consistency of posting rate
90% of the RSS feeds post consistently
Challenges and Opportunities in Building Personalized Online Content Aggregator
57
Partition of subscription matrix
Decomposition is useful when matrix is dense Real life data is often skewed Hybrid method: uses NMF only in its effective region
Users with more subscription
Blogs withmore subscribers
Users with >30 subscriptionsFeeds with >30 subscribers
10k feeds, 24k users~1M subscription pairs
2.7M subscription pairs
1. OTF
2. VIEW
3. NMF
Challenges and Opportunities in Building Personalized Online Content Aggregator
58
Threshold algorithm
Proposed by Fagin et.al. [FLN01]Efficient computation of top-K items from multiple lists with a monotone aggregate function
users
blogs
Template user’srecommendations
update
query
Challenges and Opportunities in Building Personalized Online Content Aggregator
59
NMF approximation accuracy Dense region of subscription
matrix >30 subscribers: 10152 feeds >30 subscriptions: 24340 users
L2 norm comparison
Sparsity of W (23%), H (13%) NMF approximation is close to
SVD and sparse
833.0823.2120
837.9829.0110
844.6835.1100
850.1841.690
856.9848.580
NMFSVDRank
Challenges and Opportunities in Building Personalized Online Content Aggregator
60
Approximation accuracy
How many items are approximated by NMF in the top 20 list? Ti – top 20 items of user i computed by OTF
Ai – top 20 items of user i computed by NMF
70% approximation and more accurate for higher rank items
Correlation with rank| | / | |i i iA T T
Challenges and Opportunities in Building Personalized Online Content Aggregator
61
Multi-armed bandit problem
Well-studied problem in reinforcement learning / statistics
Problem statement Background: You are given n different choices Decision: For each choice you receive a numerical reward
chosen from an unknown stationary probability distribution Goal: maximize the total reward over some time period
Solutions Action-value methods (greedy & ε-greedy) Softmax Action Selection (decaying) Pursuit methods Associative search
Challenges and Opportunities in Building Personalized Online Content Aggregator
62
Preferential attachment/usage
URL / Tag usage distribution
Challenges and Opportunities in Building Personalized Online Content Aggregator
63
URL in photography category
Documents ranked by p(d|z) values
Challenges and Opportunities in Building Personalized Online Content Aggregator
64
Specific
Stop word list Inverse document frequency (idf) Ontology based
Entropy
Least specific tags found idf – [web, reference, software, design, …] Entropy – [temp, for, important, good, …]
T
iii wzpwzpwH
1
))|(log()|()(
Challenges and Opportunities in Building Personalized Online Content Aggregator
65
Time-sensitivity The usage / associated context changes over time
“holiday” Travel packages: [travel, eclipse, europe, guide, …] Christmas shopping: [christmas, gift, shopping, …]
“programmers” Programming: [programming, development, code, patterns, …] Job hunting: [work, jobs, career, job, …]
KL-divergence of two distributions Jaccard coefficient of two sets of tagged URL
Challenges and Opportunities in Building Personalized Online Content Aggregator
66
Distribution of entropy change
Entropy increase over time (+0.1 over 3 months)
Challenges and Opportunities in Building Personalized Online Content Aggregator
67
Performance of different classifiers
10-fold cross-validation
Classifier Specific Emerging Stable
SVM 65.7% 70.3% 63.9%
Logistic regression 66.3% 69.8% 60.7%
C4.5 64.5% 73.3% 59.0%
Random forest 65.1% 73.3% 57.4%
Knn (k=5) 60.1% 67.3% 64.5%
Multilayer perceptron 60.3% 63.9% 60.1%
Naïve Bayes 67.4% 59.9% 63.4%
Best-5 combined 66.3% 73.8% 63.4%