66
The Mechanical Librarian Recommending Journal Articles in a Scientific Digital Library Andre Vellino [email protected] Group Leader, CISTI Research Canada Institute for Scientific and Technical Information Chef de groupe, Recherche ICIST Institute canadien de l'information scientifique et technique

Mechanical Librarian

Embed Size (px)

Citation preview

Page 1: Mechanical Librarian

The Mechanical Librarian

Recommending Journal Articles

in a Scientific Digital Library

Andre Vellino

[email protected]

Group Leader, CISTI Research

Canada Institute for Scientific and Technical Information

Chef de groupe, Recherche ICIST

Institute canadien de l'information scientifique et technique

Page 2: Mechanical Librarian

2

Outline of Talk

• The Mechanical Librarian

• How Recommenders Work

• Recommenders in Digital Libraries

• Problems for Science Article Recommenders and

Strategies for CISTI’s Recommender Research

• Synthese on CISTI Lab

• Alternative Approaches

• Future Work

Acknowledgements to: Glen Newton, Jeff Demaine and Greg Kresko &

Students : Dave Zeber, Matthew Rutledge-Taylor and Aurel Constantinescu

Page 3: Mechanical Librarian

The Human (Reference)

Librarian

3

World Knowledge Experience

Authoritative

Trustworthy

References

Vocabularies

Databases

Page 4: Mechanical Librarian

The Mechanical Librarian

The Web, they say, is leaving the era of search and entering

one of discovery. What's the difference? Search is what you

do when you're looking for something. Discovery is when

something wonderful that you didn't know existed, or didn't

know how to ask for, finds you.

Jeffrey M. O'Brien, Fortune Magazine4

Page 5: Mechanical Librarian

Knowledge Discovery

Technologies

• Text Mining

– Enhances the researcher’s ability to

discover new and meaningful information

from existing text repositories

• Network Analysis

– Distills the structural relationships among

bibliographic elements to reveal trends

and patterns in science

• User Behaviour

– Infers “wisdom of the crowds” from

usage statistics

5

Page 6: Mechanical Librarian

What is a “Recommender”?

• A recommender is a software system which attempts to predict

items that a user may be interested in, given information about

– the user's interests

– the content in the items

– the usage patterns of other users

• Items may be:

– Merchandise: movies, music, books

– Text: news, blogs, web pages, and, why not,

Scientific Journal Articles

Page 7: Mechanical Librarian

Amazon Recommender

System

Page 8: Mechanical Librarian

Explanations

User Ratings

Category Filter

Personalized

User

Control

Page 9: Mechanical Librarian

Companies That Offer

Recommenders to Users

9

Books

Web SitesMovies

Music

Page 10: Mechanical Librarian

Companies That Sell

Recommender Services

10

Product Merchandise Placement

Database Mining

Advertizing / Product Placement

Software as a Service Platform

Page 11: Mechanical Librarian

Recommendation is Hard

Netflix Prize: $1M

• Netflix Prize

– To develop a recommender that improves quality of

recommendations by 10% over Netflix’s

– http://www.netflixprize.com/

• Current Leader Board

– BellKor (9.6%)

– … + 39 others

• NY Times Magazine Article

http://www.nytimes.com/2008/11/23/magazine/23Netflix-t.html

11

Page 12: Mechanical Librarian

Good Recommendations

are REALLY Hard

12

Page 13: Mechanical Librarian

13

Outline of Talk

• The Mechanical Librarian

• How Recommenders Work

• Recommenders in Digital Libraries

• Problems for Science Article Recommenders and

Strategies for CISTI’s Recommender Research

• Demonstration of Synthese on CISTI Lab

• Alternative Approaches

• Future Work

Page 14: Mechanical Librarian

Taxonomy of

Recommender Systems

Collaborative Filtering

• Usage based, with item-ratings

– User-Based (“similar users”)

– Item-Based (“like items”)

• Algorithms

– Memory-based

– Model-based

Content-Based Filtering

• Content (text / waveform / pixel) analysis to

– Find “similar users”

– Find “similar items”

J. Breese, D. Heckerman, C. Kadie, et al. Empirical Analysis of Predictive Algorithms for Collaborative

Filtering. Proceedings of the Fourteenth Conference on Uncertainty in Artificial Intelligence, 461, 1998.

Page 15: Mechanical Librarian

15

How Collaborative

Filtering (CF) Works

• User-Based CF

– Given user A find all the other users {U} that have the most

“similar” item-rating patterns

– For each item I not yet rated by A, predict the likely rating A will

assign to I given the ratings for I given by {U}

– Present the Top-N ordered list of items {I} to the user

• Item-Based CF

– Given user A and the set of items {I} to which A has given

ratings, find all the other items {O} that are “similar” to {I}

– Present the Top-N ordered list of items {O} to the user

Sarwar, Badrul M., George Karypis, Joseph A. Konstan, and John Reidl. "Item-based

collaborative filtering recommendation algorithms." World Wide Web. 2001, 285-295.

Page 16: Mechanical Librarian

16

Find “Nearest Neighbour”

and Predict Rating

• Find Nearest Neighbours (e.g. cosine similarity)

• Predict Rating (item i for user u)

– Weighted average of user’s ratings on N similar users

Page 17: Mechanical Librarian

17

User-Based

Collaborative Filtering

5

2

Reader

?44Ted

5345Alice

434Carol

51Bob

BoltDark NightDoubtMilk

• Goal: predict the rating Ted will give to the movie “Bolt”

• Step 1 – eliminate the user-profiles of users who didn’t rate “Bolt”

• Step 2 – find Ted’s “K-nearest neighbours” who rated “Bolt” and at

least 2 other movies (Alice)

• R(Ted,Bolt) ~= 5.

5

Movies

Users

Page 18: Mechanical Librarian

Things that can go wrong

with Collaborative Filtering

• False “product ratings” to artificially boost ranking (spamming)

• Losing the diversity in the “Long Tail” – converges to “Top N”.

18Fleder, D. and K. Hosanagar. 2008. Blockbuster culture's next rise or fall: The effect of

recommender systems on sales diversity. NET Institute Working Paper 07-10.

Page 19: Mechanical Librarian

Content-Based

Recommenders

“These things are similar (in content) to that”.

• Depends only on a measure of similarity between the content in

the items (text, music, images)

• Typical Steps for Content Based Recommenders

1. Cluster the user’s purchased or highly-rated items by

content-similarity

2. Find other similar items not purchased or rated by the user

3. Recommend the “Top N” to the user

19

Page 20: Mechanical Librarian

Search Engine as

“Content-Based

Recommender”

Collaborative filtering

Page 21: Mechanical Librarian

“Similar Pages” is a

Content-Based

Recommender

Page 22: Mechanical Librarian

What can go wrong with

Content Based

Recommenders

that use only Metadata

• Bad Men Do What Good Men Dream: A Forensic Psychiatrist Illuminates the Darker Side of Human Behavior

• Do Animals Dream?: Children's Questions about Animals Most Often asked of the Natural History Museum

• All I Do is Dream of You The other end of the leash : why we do what we do around dogs

• Why do Catholics do that : a guide to the teachings and practices of the Catholic Church

• Electric universe : the shocking true story of electricity

• The Island of Sheep

Page 23: Mechanical Librarian

23

Outline of Talk

• The Mechanical Librarian

• How Recommenders Work

• Recommenders in Digital Libraries

• Problems for Science Article Recommenders and

Strategies for CISTI’s Recommender Research

• Demonstration of Synthese on CISTI Lab

• Alternative Approaches

• Future Work

Page 24: Mechanical Librarian

Value of Recommenders

in a Digital Library

24

• For the Researcher

– Provide serendipity in a Browse / Search / Retrieve portal

• Broaden scope of search to cognate but otherwise disparate domains

• For the Library

– Increase customer loyalty by creating dynamic, adaptive,

customized services

• Alerts & notifications based on usage and collaborative filtering rather

than stored queries

• For Authors

– Given a draft article (with citations), find additional citations

• For Publishers & Journal Reviewers

– Given a submitted article, recommending peer-reviewers

Page 25: Mechanical Librarian

Recommender Systems in

Digital Libraries

– Techlens (University of Minnesota) (2002)

• Uses ACM DL, full text Mixed Hybrid

– BibTip (University of Karlsruhe) (2003)

• Uses OPAC (Library Catalog) usage data for collaborative filtering

– IngentaConnect (2007)

• Uses Baynote (SaaS) customer tracking

– DSpace (2008)

• Content-based recommender based on user-bookmarks

– CiteULike (academic experiment 2008)

• Collaborative filtering on user bookmarks from CiteULike

– “bX” system from Ex Libris (2009)

• Uses SFX resolver logs

– NextBio (to be announced in March 2009)

• Life sciences search engine that uses collaborative filtering + ontologiesto suggest new content (trials / abstracts / data)

25

Page 26: Mechanical Librarian

TechLens

26

Page 27: Mechanical Librarian

“bX”

Recommender (Jan „09)

27

Features

• Uses log data from SFX resolvers

• Applies Collaborative Filtering

• Uses lots of aggregated data

• Developed w/ the Los Alamos National Laboratory.

Possible issues

• Infers identity of users only through IP address

• May not be accurate when http proxies are used

• Same IP address can have several “IR objectives”

• Identical resolved objects may not be recognized

Page 28: Mechanical Librarian

28

Outline of Talk

• The Mechanical Librarian

• How Recommenders Work

• Recommenders in Digital Libraries

• Problems for Science Article Recommenders and

Strategies for CISTI’s Recommender Research

• Demonstration of Synthese on CISTI Lab

• Alternative Approaches

• Future Work

Page 29: Mechanical Librarian

29

Typical Problems with CF

Recommenders in General

• Data Sparsity

– Ratio of Users / Items is low (~ 1:10)

– Number of Ratings per User is low

– Ratings matrix sparsity ~ 95%

• Cold Start Problem

– First-time users get poor or no recommendations because CF matrix has no entries

• Rating Items

– CF recommender must be trained (explicitly or implicitly) by providing ratings to items

• Principle of Induction

– People who exhibited similar behaviour in the past will tend to exhibit similar behaviour in the future.

Page 30: Mechanical Librarian

30

Specific Problems for

Collaborative Filtering in

Science Digital Libraries

• Data Sparsity– Many More Articles & Far Fewer Users (10x)

– Fewer Item / Ratings (~ 99% sparsity)

• Rating Articles– Explicit ratings are more difficult to obtain

• DL users have less need to “express themselves” by explicitly rating items than movie watchers

– Implicit ratings depend on UI features of DL• No reliable method for inferring ratings from browsing and query

behaviour

• Principle of Induction (that past is a good predictor of the future) not necessarily true in digital libraries– Interest drift

– Context shifts

Page 31: Mechanical Librarian

Recommender Research

Strategy @ CISTI

• Follow in footsteps of TechLens+

– Collaborative Filtering (CF) among users

– Seed CF recommender with citation matrix

– Extended with

• PageRank on Citations

• User Contexts

– Future Extensions

• Add Content-Based Filtering (“Fusion Mixed Hybrid” model)

• Distributed Multi-Dimensional Recommender

• Explanation-based interface

31

A. Vellino and D. Zeber. (2007) “A Hybrid, Multi-dimensional Recommender for Journal

Articles in a Scientific Digital Library.” Conference Proceedings on Web Intelligence and

Intelligent Agent Technology

Page 32: Mechanical Librarian

Making a Reference Rating

32

Page 33: Mechanical Librarian

33

Recommender Citation

Seeding

• Articles either cite or don’t cite other articles

• Some articles that are cited are not in collection

• Users’ “article collection profile” citations

TechLens approach to Cold Start / Data Sparsity problem

Page 34: Mechanical Librarian

34

Outline of Talk

• The Mechanical Librarian

• How Recommenders Work

• Recommenders in Digital Libraries

• Problems for Science Article Recommenders and

Strategies for CISTI’s Recommender Research

• Demonstration of Synthese on CISTI Lab

• Alternative Approaches

• Future Work

Page 35: Mechanical Librarian

Synthese Recommender

on CISTI Lab

35

Page 36: Mechanical Librarian

36

Query Index

Page 37: Mechanical Librarian

Add Important Articles to

“Basket” (1)

37

Page 38: Mechanical Librarian

Add Important Articles to

“Basket” (2)

38

Page 39: Mechanical Librarian

Add Important Articles to

“Basket” (3)

39

Page 40: Mechanical Librarian

Add Important Articles to

“Basket” (4)

40

Page 41: Mechanical Librarian

Query Again

41

Page 42: Mechanical Librarian

Add More Articles to

“Basket” (1)

42

Page 43: Mechanical Librarian

Add More Articles to

“Basket” (2)

43

Page 44: Mechanical Librarian

Recommend Based on

Current “Basket”

44

Page 45: Mechanical Librarian

View Recommendations

45

Page 46: Mechanical Librarian

Evaluate Recommender

46

Page 47: Mechanical Librarian

Search and Basket

History

47

Page 48: Mechanical Librarian

Multiple Profiles

48

Page 49: Mechanical Librarian

Synthese Performance

49

Ratings

Perc

enta

ge

0

5

10

15

20

25

30

35

1 2 3 4 5

Ratings of Recommendations

Page 50: Mechanical Librarian

50

Recommender Citation

Seeding

Can we improve on 0 / 1 (Boolean) citation seeding?

Page 51: Mechanical Librarian

51

Apply PageRank to

Citation Matrix

Aurel Constantinescu “Ranking Full-Text Articles using Citation Based Methods”

Master’s Thesis, University of Ottawa

PageRank algorithm applied to citations

Page 52: Mechanical Librarian

52

PageRank-weighted

Citation matrix

• Apply Page Rank on Citations

– Use citation data (as in TechLens+)

– Apply PageRank to weight the citation-based “ratings”

• Done before but only at the Journal level (http://www.eigenfactor.org/)

0.30.2

0.60.30.5

0.50.7

0.60.2

0.40.5

0.4

p6p1 p5p2 p4p3

u2

p1

u1

p2

p4

p3

articles

citationsp7 p8

= constantusers

Page 53: Mechanical Librarian

PageRank Experimental

Results

53A. Vellino “The Effect of PageRank on the Collaborative Filtering of Journal Articles”

NRC Research Report, 2008.

Page 54: Mechanical Librarian

54

Outline of Talk

• The Mechanical Librarian

• How Recommenders Work

• Recommenders in Digital Libraries

• Problems for Science Article Recommenders and

Strategies for CISTI’s Recommender Research

• Demonstration of Synthese on CISTI Lab

• Alternative Approaches

• Future Work

Page 55: Mechanical Librarian

What is a Holographic

Memory System?

• A Holographic Memory System (HMS) stores information in

a manner analogous to the storage of an image on a

holographic plate.

• HMS is composed of units called items

– Each item represents some content

• e.g, a concept, a word, a bibliographic item

– Items are analogous to points on the surface of

holographic film (or, plate)

– Each item stores information about the associations it

has with other items

T. A. Plate, 2003 Holographic Reduced Representations: Distributed Representations for

Cognitive Structures (Stanford, CA: CSLI Publications)

Page 56: Mechanical Librarian

Holographic Memory

System (HMS)

Apple

Red

Spherical

Fruit

Each item stores information about

many other items in the system

HMS

Each point on the Holographic plate

stores information about many parts

of the image

Holography

Page 57: Mechanical Librarian

HMS Recommender for

Journal Articles

• We compared DSHM and user-based CF on journal article

recommendation on 2 small collections

• 90% - 10% Cross Validation

• systematically removed one reference at a time

• tested whether recommender predicts the reference.

• compared DSHM and user-based CF

Medicine Biology

7495 articles 38,667 articles

0.55 references per article 1.15 references per article

M. F. Rutledge-Taylor, A. Vellino and R. L. West. “A Holographic Associative Memory

Recommender System” 3rd Int. Conference on Digital Information Management, London, 2008.

Page 58: Mechanical Librarian

Experimental Results

58

Page 59: Mechanical Librarian

• Advantages

– Holographic System outperformed standard user-based

CF on very sparse bibliographic datasets

– DSHM is better able to exploit the available information

– The uniformly consistent model of DSHM gives it good

potential for success on multi-dimensional datasets

• Disadvantages

– Requires a lot of computational resources

– Unclear about how it works on a large scale.

Holographic Recommender:

Discussion

Page 60: Mechanical Librarian

60

Outline of Talk

• The Mechanical Librarian

• How Recommenders Work

• Recommenders in Digital Libraries

• Problems for Science Article Recommenders and

Strategies for CISTI’s Recommender Research

• Demonstration of Synthese on CISTI Lab

• Alternative Approaches

• Future Work

Page 61: Mechanical Librarian

61

Multi-Dimensional Ratings

Matrix

G. Adomavicious, R. Sankaranarayanan, S. Sen, A. Tuzhilin, ACM Transactions on Information Systems 2005

Incorporating Contextual Information in Recommender Systems Using a Multidimensional Approach

Page 62: Mechanical Librarian

62

Scaling Strategy:

Distributed

Recommenders

• Multiple ratings matrices decomposed by subject area

• Merge separate recommendations by subject

• Reduces matrix sparsity

• Improves accuracy of recommendations

S. Berkovsky, T.Kuflik, and F. Ricci Distributed Collaborative Filtering with

Domain Specialization Proceedings of Recommender Systems 2007

Page 63: Mechanical Librarian

What predicts overall usefulness of a System?

0

0.1

0.2

0.3

0.4

0.5

0.6

Good Rec. Useful Rec. Trust

Generating

Rec.

Adequate

Item

Description

Ease of

Use

Co

rre

lati

on

Importance of Quality and

Trust

63Rashmi Sinha & Kirsten Swearingen – UC Berkeley

Page 64: Mechanical Librarian

64

UI for Navigating

Recommendations

• Explanation-based

Recommendations

– Provide transparency

increase user trust

– Allow users to cluster by

type of reason

– Filter out unwanted

recommendations

P. Pu and L. Chen. Trust Building with Explanation Interfaces. In IUI ’06: Proceedings of

the 11th International Conference On Intelligent User Interfaces, pages 93–100

Page 65: Mechanical Librarian

Conclusions

• Recommender technology is only 12 years old, but mature

enough for widespread commercial use.

• Digital Libraries / Web 2.0 Bibliographic applications are

beginning to use recommenders.

• Digital Libraries create new problems for recommenders

(“context drift” / “data sparsity” / “multiple dimensions”)

• Recommenders insufficiently understood in Digital Libraries.

• Recommender as mechanism for enhancing the process of

scientific discovery promising but still uncertain.

65