Mechanical Librarian

The Mechanical Librarian

Recommending Journal Articles

in a Scientific Digital Library

Andre Vellino

[email protected]

Group Leader, CISTI Research

Canada Institute for Scientific and Technical Information

Chef de groupe, Recherche ICIST

Institute canadien de l'information scientifique et technique

mailto:[email protected]

2

Outline of Talk

• The Mechanical Librarian

• How Recommenders Work

• Recommenders in Digital Libraries

• Problems for Science Article Recommenders and

Strategies for CISTI’s Recommender Research

• Synthese on CISTI Lab

• Alternative Approaches

• Future Work

Acknowledgements to: Glen Newton, Jeff Demaine and Greg Kresko &

Students : Dave Zeber, Matthew Rutledge-Taylor and Aurel Constantinescu

The Human (Reference)

Librarian

3

World Knowledge Experience

Authoritative

Trustworthy

References

Vocabularies

Databases

The Mechanical Librarian

The Web, they say, is leaving the era of search and entering

one of discovery. What's the difference? Search is what you

do when you're looking for something. Discovery is when

something wonderful that you didn't know existed, or didn't

know how to ask for, finds you.

Jeffrey M. O'Brien, Fortune Magazine4

Knowledge Discovery

Technologies

• Text Mining

– Enhances the researcher’s ability to

discover new and meaningful information

from existing text repositories

• Network Analysis

– Distills the structural relationships among

bibliographic elements to reveal trends

and patterns in science

• User Behaviour

– Infers “wisdom of the crowds” from

usage statistics

5

What is a “Recommender”?

• A recommender is a software system which attempts to predict

items that a user may be interested in, given information about

– the user's interests

– the content in the items

– the usage patterns of other users

• Items may be:

– Merchandise: movies, music, books

– Text: news, blogs, web pages, and, why not,

Scientific Journal Articles

Amazon Recommender

System

Explanations

User Ratings

Category Filter

Personalized

User

Control

Companies That Offer

Recommenders to Users

9

Books

Web SitesMovies

Music

Companies That Sell

Recommender Services

10

Product Merchandise Placement

Database Mining

Advertizing / Product Placement

Software as a Service Platform

Recommendation is Hard

Netflix Prize: $1M

• Netflix Prize

– To develop a recommender that improves quality of

recommendations by 10% over Netflix’s

– http://www.netflixprize.com/

• Current Leader Board

– BellKor (9.6%)

– … + 39 others

• NY Times Magazine Article

http://www.nytimes.com/2008/11/23/magazine/23Netflix-t.html

11

http://www.netflixprize.com/




Good Recommendations

are REALLY Hard

12

13

Outline of Talk






• Demonstration of Synthese on CISTI Lab


• Future Work

Taxonomy of

Recommender Systems

Collaborative Filtering

• Usage based, with item-ratings

– User-Based (“similar users”)

– Item-Based (“like items”)

• Algorithms

– Memory-based

– Model-based

Content-Based Filtering

• Content (text / waveform / pixel) analysis to

– Find “similar users”

– Find “similar items”

J. Breese, D. Heckerman, C. Kadie, et al. Empirical Analysis of Predictive Algorithms for Collaborative

Filtering. Proceedings of the Fourteenth Conference on Uncertainty in Artificial Intelligence, 461, 1998.

15

How Collaborative

Filtering (CF) Works

• User-Based CF

– Given user A find all the other users {U} that have the most

“similar” item-rating patterns

– For each item I not yet rated by A, predict the likely rating A will

assign to I given the ratings for I given by {U}

– Present the Top-N ordered list of items {I} to the user

• Item-Based CF

– Given user A and the set of items {I} to which A has given

ratings, find all the other items {O} that are “similar” to {I}

– Present the Top-N ordered list of items {O} to the user

Sarwar, Badrul M., George Karypis, Joseph A. Konstan, and John Reidl. "Item-based

collaborative filtering recommendation algorithms." World Wide Web. 2001, 285-295.

16

Find “Nearest Neighbour”

and Predict Rating

• Find Nearest Neighbours (e.g. cosine similarity)

• Predict Rating (item i for user u)

– Weighted average of user’s ratings on N similar users

17

User-Based

Collaborative Filtering

5

2

Reader

?44Ted

5345Alice

434Carol

51Bob

BoltDark NightDoubtMilk

• Goal: predict the rating Ted will give to the movie “Bolt”

• Step 1 – eliminate the user-profiles of users who didn’t rate “Bolt”

• Step 2 – find Ted’s “K-nearest neighbours” who rated “Bolt” and at

least 2 other movies (Alice)

• R(Ted,Bolt) ~= 5.

5

Movies

Users

Things that can go wrong

with Collaborative Filtering

• False “product ratings” to artificially boost ranking (spamming)

• Losing the diversity in the “Long Tail” – converges to “Top N”.

18Fleder, D. and K. Hosanagar. 2008. Blockbuster culture's next rise or fall: The effect of

recommender systems on sales diversity. NET Institute Working Paper 07-10.

Content-Based

Recommenders

“These things are similar (in content) to that”.

• Depends only on a measure of similarity between the content in

the items (text, music, images)

• Typical Steps for Content Based Recommenders

1. Cluster the user’s purchased or highly-rated items by

content-similarity

2. Find other similar items not purchased or rated by the user

3. Recommend the “Top N” to the user

19

Search Engine as

“Content-Based

Recommender”

Collaborative filtering

“Similar Pages” is a

Content-Based

Recommender

What can go wrong with

Content Based

Recommenders

that use only Metadata

• Bad Men Do What Good Men Dream: A Forensic Psychiatrist Illuminates the Darker Side of Human Behavior

• Do Animals Dream?: Children's Questions about Animals Most Often asked of the Natural History Museum

• All I Do is Dream of You The other end of the leash : why we do what we do around dogs

• Why do Catholics do that : a guide to the teachings and practices of the Catholic Church

• Electric universe : the shocking true story of electricity

• The Island of Sheep

23

Outline of Talk








• Future Work

Value of Recommenders

in a Digital Library

24

• For the Researcher

– Provide serendipity in a Browse / Search / Retrieve portal

• Broaden scope of search to cognate but otherwise disparate domains

• For the Library

– Increase customer loyalty by creating dynamic, adaptive,

customized services

• Alerts & notifications based on usage and collaborative filtering rather

than stored queries

• For Authors

– Given a draft article (with citations), find additional citations

• For Publishers & Journal Reviewers

– Given a submitted article, recommending peer-reviewers

Recommender Systems in

Digital Libraries

– Techlens (University of Minnesota) (2002)

• Uses ACM DL, full text Mixed Hybrid

– BibTip (University of Karlsruhe) (2003)

• Uses OPAC (Library Catalog) usage data for collaborative filtering

– IngentaConnect (2007)

• Uses Baynote (SaaS) customer tracking

– DSpace (2008)

• Content-based recommender based on user-bookmarks

– CiteULike (academic experiment 2008)

• Collaborative filtering on user bookmarks from CiteULike

– “bX” system from Ex Libris (2009)

• Uses SFX resolver logs

– NextBio (to be announced in March 2009)

• Life sciences search engine that uses collaborative filtering + ontologiesto suggest new content (trials / abstracts / data)

25

http://techlens.cs.umn.edu/tl3/

http://www.bibtip.org/bibtip_en.html

http://www.bibtip.org/bibtip_en.html

http://techlens.cs.umn.edu/tl3/

http://allmyeye.blogspot.com/2007/11/social-media-news-release.html

http://www.hpl.hp.com/techreports/2008/HPL-2008-21.html?mtxs=rss-hpl-tr

http://www.hpl.hp.com/techreports/2008/HPL-2008-21.html?mtxs=rss-hpl-tr

http://ilk.uvt.nl/~toine/publications/bogers.2008.recsys2008-paper.pdf

http://www.exlibrisgroup.com/category/bXOverview




http://www.nextbio.com/

TechLens

26

“bX”

Recommender (Jan „09)

27

Features

• Uses log data from SFX resolvers

• Applies Collaborative Filtering

• Uses lots of aggregated data

• Developed w/ the Los Alamos National Laboratory.

Possible issues

• Infers identity of users only through IP address

• May not be accurate when http proxies are used

• Same IP address can have several “IR objectives”

• Identical resolved objects may not be recognized

28

Outline of Talk








• Future Work

29

Typical Problems with CF

Recommenders in General

• Data Sparsity

– Ratio of Users / Items is low (~ 1:10)

– Number of Ratings per User is low

– Ratings matrix sparsity ~ 95%

• Cold Start Problem

– First-time users get poor or no recommendations because CF matrix has no entries

• Rating Items

– CF recommender must be trained (explicitly or implicitly) by providing ratings to items

• Principle of Induction

– People who exhibited similar behaviour in the past will tend to exhibit similar behaviour in the future.

30

Specific Problems for

Collaborative Filtering in

Science Digital Libraries

• Data Sparsity– Many More Articles & Far Fewer Users (10x)

– Fewer Item / Ratings (~ 99% sparsity)

• Rating Articles– Explicit ratings are more difficult to obtain

• DL users have less need to “express themselves” by explicitly rating items than movie watchers

– Implicit ratings depend on UI features of DL• No reliable method for inferring ratings from browsing and query

behaviour

• Principle of Induction (that past is a good predictor of the future) not necessarily true in digital libraries– Interest drift

– Context shifts

Recommender Research

Strategy @ CISTI

• Follow in footsteps of TechLens+

– Collaborative Filtering (CF) among users

– Seed CF recommender with citation matrix

– Extended with

• PageRank on Citations

• User Contexts

– Future Extensions

• Add Content-Based Filtering (“Fusion Mixed Hybrid” model)

• Distributed Multi-Dimensional Recommender

• Explanation-based interface

31

A. Vellino and D. Zeber. (2007) “A Hybrid, Multi-dimensional Recommender for Journal

Articles in a Scientific Digital Library.” Conference Proceedings on Web Intelligence and

Intelligent Agent Technology

Making a Reference Rating

32

33

Recommender Citation

Seeding

• Articles either cite or don’t cite other articles

• Some articles that are cited are not in collection

• Users’ “article collection profile” citations

TechLens approach to Cold Start / Data Sparsity problem

34

Outline of Talk








• Future Work

Synthese Recommender

on CISTI Lab

35

36

Query Index

Add Important Articles to

“Basket” (1)

37


“Basket” (2)

38


“Basket” (3)

39


“Basket” (4)

40

Query Again

41

Add More Articles to

“Basket” (1)

42

Add More Articles to

“Basket” (2)

43

Recommend Based on

Current “Basket”

44

View Recommendations

45

Evaluate Recommender

46

Search and Basket

History

47

Multiple Profiles

48

Synthese Performance

49

Ratings

Perc

enta

ge

0

5

10

15

20

25

30

35

1 2 3 4 5

Ratings of Recommendations

50

Recommender Citation

Seeding

Can we improve on 0 / 1 (Boolean) citation seeding?

51

Apply PageRank to

Citation Matrix

Aurel Constantinescu “Ranking Full-Text Articles using Citation Based Methods”

Master’s Thesis, University of Ottawa

PageRank algorithm applied to citations

52

PageRank-weighted

Citation matrix

• Apply Page Rank on Citations

– Use citation data (as in TechLens+)

– Apply PageRank to weight the citation-based “ratings”

• Done before but only at the Journal level (http://www.eigenfactor.org/)

0.30.2

0.60.30.5

0.50.7

0.60.2

0.40.5

0.4

p6p1 p5p2 p4p3

u2

p1

u1

p2

p4

p3

articles

citationsp7 p8

= constantusers

http://www.eigenfactor.org/

PageRank Experimental

Results

53A. Vellino “The Effect of PageRank on the Collaborative Filtering of Journal Articles”

NRC Research Report, 2008.

54

Outline of Talk








• Future Work

What is a Holographic

Memory System?

• A Holographic Memory System (HMS) stores information in

a manner analogous to the storage of an image on a

holographic plate.

• HMS is composed of units called items

– Each item represents some content

• e.g, a concept, a word, a bibliographic item

– Items are analogous to points on the surface of

holographic film (or, plate)

– Each item stores information about the associations it

has with other items

T. A. Plate, 2003 Holographic Reduced Representations: Distributed Representations for

Cognitive Structures (Stanford, CA: CSLI Publications)

Holographic Memory

System (HMS)

Apple

Red

Spherical

Fruit

Each item stores information about

many other items in the system

HMS

Each point on the Holographic plate

stores information about many parts

of the image

Holography

HMS Recommender for

Journal Articles

• We compared DSHM and user-based CF on journal article

recommendation on 2 small collections

• 90% - 10% Cross Validation

• systematically removed one reference at a time

• tested whether recommender predicts the reference.

• compared DSHM and user-based CF

Medicine Biology

7495 articles 38,667 articles

0.55 references per article 1.15 references per article

M. F. Rutledge-Taylor, A. Vellino and R. L. West. “A Holographic Associative Memory

Recommender System” 3rd Int. Conference on Digital Information Management, London, 2008.

Experimental Results

58

• Advantages

– Holographic System outperformed standard user-based

CF on very sparse bibliographic datasets

– DSHM is better able to exploit the available information

– The uniformly consistent model of DSHM gives it good

potential for success on multi-dimensional datasets

• Disadvantages

– Requires a lot of computational resources

– Unclear about how it works on a large scale.

Holographic Recommender:

Discussion

60

Outline of Talk








• Future Work

61

Multi-Dimensional Ratings

Matrix

G. Adomavicious, R. Sankaranarayanan, S. Sen, A. Tuzhilin, ACM Transactions on Information Systems 2005

Incorporating Contextual Information in Recommender Systems Using a Multidimensional Approach

62

Scaling Strategy:

Distributed

Recommenders

• Multiple ratings matrices decomposed by subject area

• Merge separate recommendations by subject

• Reduces matrix sparsity

• Improves accuracy of recommendations

S. Berkovsky, T.Kuflik, and F. Ricci Distributed Collaborative Filtering with

Domain Specialization Proceedings of Recommender Systems 2007

What predicts overall usefulness of a System?

0

0.1

0.2

0.3

0.4

0.5

0.6

Good Rec. Useful Rec. Trust

Generating

Rec.

Adequate

Item

Description

Ease of

Use

Co

rre

lati

on

Importance of Quality and

Trust

63Rashmi Sinha & Kirsten Swearingen – UC Berkeley

64

UI for Navigating

Recommendations

• Explanation-based

Recommendations

– Provide transparency

increase user trust

– Allow users to cluster by

type of reason

– Filter out unwanted

recommendations

P. Pu and L. Chen. Trust Building with Explanation Interfaces. In IUI ’06: Proceedings of

the 11th International Conference On Intelligent User Interfaces, pages 93–100

Conclusions

• Recommender technology is only 12 years old, but mature

enough for widespread commercial use.

• Digital Libraries / Web 2.0 Bibliographic applications are

beginning to use recommenders.

• Digital Libraries create new problems for recommenders

(“context drift” / “data sparsity” / “multiple dimensions”)

• Recommenders insufficiently understood in Digital Libraries.

• Recommender as mechanism for enhancing the process of

scientific discovery promising but still uncertain.

65

Thank You!

Questions?

http://lab.cisti-icist.nrc-cnrc.gc.ca/synthese/

http://lab.cisti-icist.nrc-cnrc.gc.ca/






Education

Mechanical Librarian