Upload
mikasi
View
42
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Information Filtering. Hassan Bashiri May 2009. Agenda. Information filtering Automatic profile learning Social filtering. Information Access Problems. Different Each Time. Retrieval. Information Need. Data Mining. Filtering. Stable. Stable. Different Each Time. Collection. - PowerPoint PPT Presentation
Citation preview
Information Filtering
Hassan BashiriMay 2009
Agenda
• Information filtering• Automatic profile learning• Social filtering
Information Access Problems
Collection
Info
rmat
ion
Nee
d
Stable
Stable
DifferentEach Time
DataMining
Retrieval
Filtering
DifferentEach Time
Information Filtering
• An abstract problem in which:– The information need is stable
• Characterized by a “profile”– A stream of documents is arriving
• Each must either be presented to the user or not
• Introduced by Luhn in 1958– As “Selective Dissemination of Information”
• Named “Filtering” by Denning in 1975
A Simple Filtering Strategy
• Use any information retrieval system– Boolean, vector space, probabilistic, …
• Have the user specify a “standing query”– This will be the profile
• Limit the standing query by date– Each use, show what arrived since the last use
What’s Wrong With That?
• Unnecessary indexing overhead– Indexing only speeds up retrospective searches
• Every profile is treated separately– The same work might be done repeatedly
• Forming effective queries by hand is hard– The computer might be able to help
• It is OK for text, but what about audio, video, …– Are words the only possible basis for filtering?
The Fast Data Finder• Fast Boolean filtering using custom hardware
– Up to 10,000 documents per second• Words pass through a pipeline architecture
– Each element looks for one word
good partygreat aid
OR
AND
NOT
Profile Indexing in SIFT
• Build an inverted file of profiles– Postings are profiles that contain each term
• RAM can hold 5,000 profiles/megabyte– And several machines can be run in parallel
• Both Boolean and vector space matching– User-selected threshold for each ranked profile
• Hand-tuned on a web page using today’s news
• Delivered by email
ER
SIFT Time vs. Documents
SIFT Time vs. Profiles
SIFT Limitations
• Centralization– Requires some form of cost sharing
• Privacy– Centrally registered profiles– Each profile associated with an email address
• Usability– Manually specified profiles– Thresholds based on obscure retrieval status value
Adaptive Filtering
• Automatic learning for stable profiles– Based on observations of user behavior
• Explicit ratings are easy to use– Binary ratings are simply relevance feedback
• Unobtrusive observations are easier to get– Reading time, save/delete, …– But we only have a few clues on how to use them
Relevance Feedback for Filtering
MakeProfileVector
ComputeSimilarity
Select andExamine
(user)
AssignRatings
(user)
UpdateUser Model
NewDocuments
Vector
Documents,Vectors,
Rank Order
Document,Vector
Rating,Vector
Vector)s(
MakeDocument
Vectors
InitialProfile Terms
Vectors
Rocchio Formula
ectorfeedback v negativeectorfeedback v positive
vectorprofile original vectorprofile
0 4 0 8 0 0
1 2 4 0 0 1
2 0 1 1 0 4
-1 6 3 7 0 -3
0 4 0 8 0 0
2 4 8 0 0 2
8 0 4 4 0 16
Original profile
Positive Feedback
Negative feedback
0.1
5.0
25.0
)+(
)-(New profile
Adaptive Filtering With LSI
SVD
ReduceDimensions
MakeProfileVector
MakeDocument
Vectors
ComputeSimilarity
Select andExamine
(user)
AssignRatings
(user)
UpdateUser Model
RepresentativeDocuments
SparseVectors
Matrix
NewDocuments
SparseVector
DenseVector
Documents,Dense Vectors,Rank Order
Document,Dense Vector
Rating,Dense Vector
DenseVector)s(
ReduceDimensions
MakeDocument
Vectors
Matrix
InitialProfile Terms
SparseVectors
DenseVectors
Machine Learning
• Given a set of vectors with associated values– e.g., LSI vectors with relevance judgments
• Predict the values associated with new vectors– i.e., learn a mapping from vectors to values
• All learning systems share two problems– They need some basis for making predictions
• This is called an “inductive bias”– They must balance adaptation with generalization
Machine Learning Techniques
• Hill climbing )Rocchio(• Instance-based learning• Rule induction• Regression• Neural networks• Genetic algorithms• Statistical classification
Instance-Based Learning
• Remember examples of desired documents– Keep the interested documents by user
• Compare new documents to each example– Vector similarity, probabilistic, …
• Base rank order on the most similar example• Useful for multimodal profiles
Regression
• Fit a simple function to the observations– Straight line, s-curve, …
• Logistic regression matches binary relevance– S-curves fit better than straight lines– Same calculation as a 3-layer neural network
• But with a different training technique
Neural Networks• Design inspired by the human brain
– Neurons, connected by synapses• Organized into layers for simplicity
– Synapse strengths are learned from experience• Seek to minimize predict-observe differences
• Can be fast with special purpose hardware– Biological neurons are far slower than computers
• Work best with application-specific structures– In biology this is achieved through evolution
Social Filtering
• Exploit ratings from other users as features– Like personal recommendations, peer review, …
• Reaches beyond topicality to:– Accuracy, coherence, depth, novelty, style, …
• Applies equally well to other modalities– Movies, recorded music, …
• Sometimes called “collaborative” filtering
Implicit Feedback
• Observe user behavior to infer a set of ratings– Examine )reading time, scrolling behavior, …(– Retain )bookmark, save, save & annotate, print, …(– Refer to )reply, forward, include link, cut & paste, …(
• Some measurements are directly useful– e.g., use reading time to predict reading time
• Others require some inference– Should you treat cut & paste as an endorsement?
Some Things We Don’t Know
• How to use implicit feedback– No large scale integrated experiments yet
• How to protect privacy– Pseudonyms can be defeated fairly easily
• How to merge social and content-based filtering– Social filtering never finds anything new
• How to evaluate social filtering effectiveness