43
Evidence from Behavior LBSC 796/CMSC 828o Douglas W. Oard Session 5, February 23, 2004

Evidence from Behavior LBSC 796/CMSC 828o Douglas W. Oard Session 5, February 23, 2004

  • View
    214

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Evidence from Behavior LBSC 796/CMSC 828o Douglas W. Oard Session 5, February 23, 2004

Evidence from Behavior

LBSC 796/CMSC 828o

Douglas W. Oard

Session 5, February 23, 2004

Page 2: Evidence from Behavior LBSC 796/CMSC 828o Douglas W. Oard Session 5, February 23, 2004

Agenda

• Questions

• Observable Behavior

• Information filtering

Page 3: Evidence from Behavior LBSC 796/CMSC 828o Douglas W. Oard Session 5, February 23, 2004

View Listen

Select

Print Bookmark Save Purchase Delete

Subscribe

Copy / paste Quote

Forward Reply Link Cite

Mark up Rate Publish

Organize

Some Observable Behaviors

Page 4: Evidence from Behavior LBSC 796/CMSC 828o Douglas W. Oard Session 5, February 23, 2004

Examine View Listen

Select

Retain Print Bookmark Save Purchase Delete

Subscribe

Reference Copy / paste Quote

Forward Reply Link Cite

Annotate Mark up Rate Publish

Organize

Beh

avio

r C

ateg

ory

Page 5: Evidence from Behavior LBSC 796/CMSC 828o Douglas W. Oard Session 5, February 23, 2004

Segment Object Class

Examine View Listen

Select

Retain Print Bookmark Save Purchase Delete

Subscribe

Reference Copy / paste Quote

Forward Reply Link Cite

Annotate Mark up Rate Publish

Organize

Beh

avio

r C

ateg

ory

Minimum Scope

Page 6: Evidence from Behavior LBSC 796/CMSC 828o Douglas W. Oard Session 5, February 23, 2004

Some Examples

• Read/Ignored, Saved/Deleted, Replied to

(Stevens, 1993)

• Reading time

(Morita & Shinoda, 1994; Konstan et al., 1997)

• Hypertext Link

(Brin & Page, 1998)

Page 7: Evidence from Behavior LBSC 796/CMSC 828o Douglas W. Oard Session 5, February 23, 2004

Estimating Authority from Links

Authority

Authority

Hub

Page 8: Evidence from Behavior LBSC 796/CMSC 828o Douglas W. Oard Session 5, February 23, 2004

Collecting Click Streams

• Browsing histories are easily captured– Make all links initially point to a central site

• Encode the desired URL as a parameter

– Build a time-annotated transition graph for each user• Cookies identify users (when they use the same machine)

– Redirect the browser to the desired page

• Reading time is correlated with interest– Can be used to build individual profiles– Used to target advertising by doubleclick.com

Page 9: Evidence from Behavior LBSC 796/CMSC 828o Douglas W. Oard Session 5, February 23, 2004

0

20

40

60

80

100

120

140

160

180

NoInterest

LowInterest

ModerateInterest

HighInterest Rating

Rea

din

g T

ime

(sec

on

ds)

Full Text Articles (Telecommunications)

50

32

5843

Page 10: Evidence from Behavior LBSC 796/CMSC 828o Douglas W. Oard Session 5, February 23, 2004

More Complete Observations• User selects an article

– Interpretation: Summary was interesting

• User quickly prints the article– Interpretation: They want to read it

• User selects a second article– Interpretation: another interesting summary

• User scrolls around in the article– Interpretation: Parts with high dwell time and/or repeated

revisits are interesting

• User stops scrolling for an extended period– Interpretation: User was interrupted

Page 11: Evidence from Behavior LBSC 796/CMSC 828o Douglas W. Oard Session 5, February 23, 2004

NoInterest

0

20

40

60

80

100

120

140

160

180

200

00 01 02 03 04

Rating

Re

adin

g T

ime

NoInterest

LowInterest

ModerateInterest

HighInterest

Abstracts (Pharmaceuticals)

4255

52 51

Page 12: Evidence from Behavior LBSC 796/CMSC 828o Douglas W. Oard Session 5, February 23, 2004

Information Access Problems

Collection

Info

rmat

ion

Nee

d

Stable

Stable

DifferentEach Time

Retrieval

Filtering

DifferentEach Time

Page 13: Evidence from Behavior LBSC 796/CMSC 828o Douglas W. Oard Session 5, February 23, 2004

Information Filtering

User Profile

Matching

New Documents

Recommendation

Rating

Page 14: Evidence from Behavior LBSC 796/CMSC 828o Douglas W. Oard Session 5, February 23, 2004

Information Filtering

• An abstract problem in which:– The information need is stable

• Characterized by a “profile”

– A stream of documents is arriving• Each must either be presented to the user or not

• Introduced by Luhn in 1958– As “Selective Dissemination of Information”

• Named “Filtering” by Denning in 1975

Page 15: Evidence from Behavior LBSC 796/CMSC 828o Douglas W. Oard Session 5, February 23, 2004

A Simple Filtering Strategy

• Use any information retrieval system– Boolean, vector space, probabilistic, …

• Have the user specify a “standing query”– This will be the profile

• Limit the standing query by date– Each use, show what arrived since the last use

Page 16: Evidence from Behavior LBSC 796/CMSC 828o Douglas W. Oard Session 5, February 23, 2004

Social Filtering

• Exploit ratings from other users as features– Like personal recommendations, peer review, …

• Reaches beyond topicality to:– Accuracy, coherence, depth, novelty, style, …

• Applies equally well to other modalities– Movies, recorded music, …

• Sometimes called “collaborative” filtering

Page 17: Evidence from Behavior LBSC 796/CMSC 828o Douglas W. Oard Session 5, February 23, 2004

Rating-Based Recommendation

• Use ratings as to describe objects– Personal recommendations, peer review, …

• Beyond topicality:– Accuracy, coherence, depth, novelty, style, …

• Has been applied to many modalities– Books, Usenet news, movies, music, jokes, beer, …

Page 18: Evidence from Behavior LBSC 796/CMSC 828o Douglas W. Oard Session 5, February 23, 2004

Using Positive InformationSmallWorld

SpaceMtn

MadTea Pty

Dumbo Speed-way

CntryBear

Joe D A B D ? ?Ellen A F D FMickey A A A A A AGoofy D A CJohn A C A C ABen F A FNathan D A A

Source: Jon Herlocker, SIGIR 1999

Page 19: Evidence from Behavior LBSC 796/CMSC 828o Douglas W. Oard Session 5, February 23, 2004

Using Negative InformationSmallWorld

SpaceMtn

MadTea Pty

Dumbo Speed-way

CntryBear

Joe D A B D ? ?Ellen A F D FMickey A A A A A AGoofy D A CJohn A C A C ABen F A FNathan D A A

Source: Jon Herlocker, SIGIR 1999

Page 20: Evidence from Behavior LBSC 796/CMSC 828o Douglas W. Oard Session 5, February 23, 2004

The Cold Start Problem

• Social filtering will not work in isolation– Without ratings, we get no recommendations– Without recommendations, we read nothing– Without reading, we get no ratings

• An initial recommendation strategy is needed– Stereotypes– Content-based search

• The need for both leads to hybrid strategies

Page 21: Evidence from Behavior LBSC 796/CMSC 828o Douglas W. Oard Session 5, February 23, 2004

Some Things We (Sort of) Know

• Treating each genre separately can be useful– Separate predictions for separate tastes

• Negative information can be useful– “I hate everything my parents like”

• People like to know who provided ratings

• Popularity provides a useful fallback

• People don’t like to provide ratings– Few experiments have achieved sufficient scale

Page 22: Evidence from Behavior LBSC 796/CMSC 828o Douglas W. Oard Session 5, February 23, 2004

Challenges

• Any form of sharing necessarily incurs:– Distribution costs– Privacy concerns– Competitive concerns

• Requiring explicit ratings also:– Increases the cognitive load on users– Can adversely affect ease-of-use

Page 23: Evidence from Behavior LBSC 796/CMSC 828o Douglas W. Oard Session 5, February 23, 2004

Motivations to Provide Ratings

• Self-interest– Use the ratings to improve system’s user model

• Economic benefit– If a market for ratings is created

• Altruism

Page 24: Evidence from Behavior LBSC 796/CMSC 828o Douglas W. Oard Session 5, February 23, 2004

The Problem With Self-Interest

Number of Ratings

Val

ue o

f ra

ting

s

Marginal value to rater

Marginal value to community

Few Lots

Marginal cost

Page 25: Evidence from Behavior LBSC 796/CMSC 828o Douglas W. Oard Session 5, February 23, 2004

Solving the Cost vs. Value Problem

• Maximize the value– Provide for continuous user model adaptation

• Minimize the costs– Use implicit feedback rather than explicit ratings– Minimize privacy concerns through encryption– Build an efficient scalable architecture– Limit the scope to noncompetitive activities

Page 26: Evidence from Behavior LBSC 796/CMSC 828o Douglas W. Oard Session 5, February 23, 2004

Solution: Reduce the Marginal Cost

Number of Ratings

Marginal value to rater

Marginal value to community

Few Lots

Marginal cost

Page 27: Evidence from Behavior LBSC 796/CMSC 828o Douglas W. Oard Session 5, February 23, 2004

Implicit Feedback

• Observe user behavior to infer a set of ratings– Examine (reading time, scrolling behavior, …)– Retain (bookmark, save, save & annotate, print, …)– Refer to (reply, forward, include link, cut & paste, …)

• Some measurements are directly useful– e.g., use reading time to predict reading time

• Others require some inference– Should you treat cut & paste as an endorsement?

Page 28: Evidence from Behavior LBSC 796/CMSC 828o Douglas W. Oard Session 5, February 23, 2004

Recommending w/Implicit Feedback

EstimateRating

UserModel

Ratings Server

User Ratings

CommunityRatings

PredictedRatings

UserObservations

User Ratings

UserModel

EstimateRatings

Observations Server

PredictedObservations

CommunityObservations

PredictedRatings

UserObservations

Page 29: Evidence from Behavior LBSC 796/CMSC 828o Douglas W. Oard Session 5, February 23, 2004

Beyond Information Filtering

• Citation indexing– Exploits reference behavior

• Search for people based on their behavior– Discovery of potential collaborators

• Collaborative data mining in large collections– Discoveries migrate to people with similar interests

Page 30: Evidence from Behavior LBSC 796/CMSC 828o Douglas W. Oard Session 5, February 23, 2004

Relevance Feedback

MakeProfileVector

ComputeSimilarity

Select andExamine

(user)

AssignRatings(user)

UpdateUser Model

NewDocuments

Vector

Documents,Vectors,

Rank Order

Document,Vector

Rating,Vector

Vector(s)

MakeDocument

Vectors

InitialProfile Terms

Vectors

Page 31: Evidence from Behavior LBSC 796/CMSC 828o Douglas W. Oard Session 5, February 23, 2004

Rocchio Formula

ectorfeedback v negative

ectorfeedback v positive

vectorprofile original vectorprofile

0 4 0 8 0 0

1 2 4 0 0 1

2 0 1 1 0 4

-1 6 3 7 0 -3

0 4 0 8 0 0

2 4 8 0 0 2

8 0 4 4 0 16

Original profile

Positive Feedback

Negative feedback

0.1

5.0

25.0

(+)

(-)

New profile

Page 32: Evidence from Behavior LBSC 796/CMSC 828o Douglas W. Oard Session 5, February 23, 2004

Supervised Learning• Given a set of vectors with associated values

– e.g., term vectors with relevance judgments

• Predict the values associated with new vectors– i.e., learn a mapping from vectors to values

• All learning systems share two problems– They need some basis for making predictions

• This is called an “inductive bias”

– They must balance adaptation with generalization

Page 33: Evidence from Behavior LBSC 796/CMSC 828o Douglas W. Oard Session 5, February 23, 2004

Machine Learning Approaches

• Hill climbing (Rocchio)

• Instance-based learning

• Rule induction

• Regression

• Neural networks

• Genetic algorithms

• Statistical classification

Page 34: Evidence from Behavior LBSC 796/CMSC 828o Douglas W. Oard Session 5, February 23, 2004

Statistical Classification

• Represent relevant docs as one random vector– And nonrelevant docs as another

• Build a statistical model for each distribution– e.g., model each with mean and covariance

• Find the surface separating the distributions– e.g., a hyperplane for linear discriminant analysis

• Rank documents by distance from that surface– Possibly based on the shape of the distributions

Page 35: Evidence from Behavior LBSC 796/CMSC 828o Douglas W. Oard Session 5, February 23, 2004

Rule Induction

• Automatically derived Boolean profiles– (Hopefully) effective and easily explained

• Specificity from the “perfect query”– AND terms in a document, OR the documents

• Generality from a bias favoring short profiles– e.g., penalize rules with more Boolean operators– Balanced by rewards for precision, recall, …

Page 36: Evidence from Behavior LBSC 796/CMSC 828o Douglas W. Oard Session 5, February 23, 2004

Training Strategies

• Overtraining can hurt performance– Performance on training data rises and plateaus– Performance on new data rises, then falls

• One strategy is to learn less each time– But it is hard to guess the right learning rate

• Splitting the training set is a useful alternative– Part provides the content for training– Part for assessing performance on unseen data

Page 37: Evidence from Behavior LBSC 796/CMSC 828o Douglas W. Oard Session 5, February 23, 2004

Critical Issues

• Protecting privacy– What absolute assurances can we provide?– How can we make remaining risks understood?

• Scalable rating servers– Is a fully distributed architecture practical?

• Non-cooperative users– How can the effect of spamming be limited?

Page 38: Evidence from Behavior LBSC 796/CMSC 828o Douglas W. Oard Session 5, February 23, 2004

Gaining Access to Observations

• Observe public behavior– Hypertext linking, publication, citing, …

• Policy protection– EU: Privacy laws– US: Privacy policies + FTC enforcement

• Architectural assurance of privacy– Distributed architecture– Model and mitigate privacy risks

Page 39: Evidence from Behavior LBSC 796/CMSC 828o Douglas W. Oard Session 5, February 23, 2004

A More Secure Data Flow

Item

Behavior

Feature

Recommendation

RecommendationsIxR

Personal FeaturesIxF

BehaviorsIxB

Community FeaturesIxF

Page 40: Evidence from Behavior LBSC 796/CMSC 828o Douglas W. Oard Session 5, February 23, 2004

Low Entropy Attack

Community FeaturesIxF

Sideinformation

IxBFor user Uadversary

• Solution space– Read access to IxF requires minimum number

of unique contributors.• Cryptographic data structure support• Controlled mixing.

Page 41: Evidence from Behavior LBSC 796/CMSC 828o Douglas W. Oard Session 5, February 23, 2004

Matrix Difference Attack

Community Features(IxF)

adversaryUser U

Community Features(IxF)’

Matrix Difference(IxF) - (IxF)’

IxBFor user U

• Solution space– Users can’t control “next hop”

– Routing can hide real source and destination

Page 42: Evidence from Behavior LBSC 796/CMSC 828o Douglas W. Oard Session 5, February 23, 2004

Identity Integrity Attack

Community Features(IxF)

User U

Community Features(IxF)’

Matrix Difference(IxF) - (IxF)’

IxBFor user U

adversary

adversary

adversary

adversary

• Solution space– Registrar service

• Blinded Credentials• Attribute Membership

Credentials

Page 43: Evidence from Behavior LBSC 796/CMSC 828o Douglas W. Oard Session 5, February 23, 2004

One Minute Paper

• What do you think is the most significant factor that limits the utility of recommender systems?

• What was the muddiest point in today’s lecture?