85
IRG IR Group @ UAM Recommender Systems Evaluation Beyond Accuracy ACM Latin American School on Recommender Systems (LARS 2019) Fortaleza, Brazil, October 10, 2019 Recommender Systems Evaluation Beyond Accuracy ACM Latin American School on Recommender Systems Pablo Castells Universidad Autónoma de Madrid http://ir.ii.uam.es/castells Fortaleza, Brazil, October 10, 2019

Recommender systems evaluation beyond accuracyir.ii.uam.es/castells/lars2019.pdf · ACM Latin American School on Recommender Systems (LARS 2019) Fortaleza, Brazil, October 10, 2019

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Recommender systems evaluation beyond accuracyir.ii.uam.es/castells/lars2019.pdf · ACM Latin American School on Recommender Systems (LARS 2019) Fortaleza, Brazil, October 10, 2019

IRGIR Group @ UAM

Recommender Systems Evaluation Beyond AccuracyACM Latin American School on Recommender Systems (LARS 2019)

Fortaleza, Brazil, October 10, 2019

Recommender Systems EvaluationBeyond Accuracy

ACM Latin American School on Recommender Systems

Pablo CastellsUniversidad Autónoma de Madrid

http://ir.ii.uam.es/castells

Fortaleza, Brazil, October 10, 2019

Page 2: Recommender systems evaluation beyond accuracyir.ii.uam.es/castells/lars2019.pdf · ACM Latin American School on Recommender Systems (LARS 2019) Fortaleza, Brazil, October 10, 2019

IRGIR Group @ UAM

Recommender Systems Evaluation Beyond AccuracyACM Latin American School on Recommender Systems (LARS 2019)

Fortaleza, Brazil, October 10, 2019

Outline

1. Motivation: beyond relevance

2. Measuring novelty and diversity

3. Enhancing novelty and diversity

4. Biases in recommendation

Page 3: Recommender systems evaluation beyond accuracyir.ii.uam.es/castells/lars2019.pdf · ACM Latin American School on Recommender Systems (LARS 2019) Fortaleza, Brazil, October 10, 2019

IRGIR Group @ UAM

Recommender Systems Evaluation Beyond AccuracyACM Latin American School on Recommender Systems (LARS 2019)

Fortaleza, Brazil, October 10, 2019

Outline

1. Motivation: beyond relevance

2. Measuring novelty and diversity

3. Enhancing novelty and diversity

4. Biases in recommendation

Page 4: Recommender systems evaluation beyond accuracyir.ii.uam.es/castells/lars2019.pdf · ACM Latin American School on Recommender Systems (LARS 2019) Fortaleza, Brazil, October 10, 2019

IRGIR Group @ UAM

Recommender Systems Evaluation Beyond AccuracyACM Latin American School on Recommender Systems (LARS 2019)

Fortaleza, Brazil, October 10, 2019

Motivation

What is the purpose of recommendation?

Satisfying users …by making suggestions they like

If we recommend things that a user likes,

then the user will be satisfied

Ergo…

Page 5: Recommender systems evaluation beyond accuracyir.ii.uam.es/castells/lars2019.pdf · ACM Latin American School on Recommender Systems (LARS 2019) Fortaleza, Brazil, October 10, 2019

IRGIR Group @ UAM

Recommender Systems Evaluation Beyond AccuracyACM Latin American School on Recommender Systems (LARS 2019)

Fortaleza, Brazil, October 10, 2019

Motivation

Do you like this?

Book Tourist attraction Music albumMovie

Page 6: Recommender systems evaluation beyond accuracyir.ii.uam.es/castells/lars2019.pdf · ACM Latin American School on Recommender Systems (LARS 2019) Fortaleza, Brazil, October 10, 2019

IRGIR Group @ UAM

Recommender Systems Evaluation Beyond AccuracyACM Latin American School on Recommender Systems (LARS 2019)

Fortaleza, Brazil, October 10, 2019

Motivation

Would you find it useful to recommend it?

Probably notEverybody knows those already

Movie Book Tourist attraction Music album

Page 7: Recommender systems evaluation beyond accuracyir.ii.uam.es/castells/lars2019.pdf · ACM Latin American School on Recommender Systems (LARS 2019) Fortaleza, Brazil, October 10, 2019

IRGIR Group @ UAM

Recommender Systems Evaluation Beyond AccuracyACM Latin American School on Recommender Systems (LARS 2019)

Fortaleza, Brazil, October 10, 2019

Motivation

Would you find it useful to recommend this?

Maybe, provided they are liked…

Movie Book Tourist attraction Music album

Page 8: Recommender systems evaluation beyond accuracyir.ii.uam.es/castells/lars2019.pdf · ACM Latin American School on Recommender Systems (LARS 2019) Fortaleza, Brazil, October 10, 2019

IRGIR Group @ UAM

Recommender Systems Evaluation Beyond AccuracyACM Latin American School on Recommender Systems (LARS 2019)

Fortaleza, Brazil, October 10, 2019

Motivation

Would you find it useful to recommend this?

Not obvious or widely known

…but too much of the same genre?

Sci-fi Sci-fi Sci-fi Sci-fi Sci-fi

Page 9: Recommender systems evaluation beyond accuracyir.ii.uam.es/castells/lars2019.pdf · ACM Latin American School on Recommender Systems (LARS 2019) Fortaleza, Brazil, October 10, 2019

IRGIR Group @ UAM

Recommender Systems Evaluation Beyond AccuracyACM Latin American School on Recommender Systems (LARS 2019)

Fortaleza, Brazil, October 10, 2019

Motivation

Would you find it useful to recommend this?

Sci-fi AnimationComedy Adventure Documentary

Seems better?

Page 10: Recommender systems evaluation beyond accuracyir.ii.uam.es/castells/lars2019.pdf · ACM Latin American School on Recommender Systems (LARS 2019) Fortaleza, Brazil, October 10, 2019

IRGIR Group @ UAM

Recommender Systems Evaluation Beyond AccuracyACM Latin American School on Recommender Systems (LARS 2019)

Fortaleza, Brazil, October 10, 2019

Beyond accuracy…

How to improve?

Define

Understand

Measure

…then try to improve

NoveltyDiversity

Page 11: Recommender systems evaluation beyond accuracyir.ii.uam.es/castells/lars2019.pdf · ACM Latin American School on Recommender Systems (LARS 2019) Fortaleza, Brazil, October 10, 2019

IRGIR Group @ UAM

Recommender Systems Evaluation Beyond AccuracyACM Latin American School on Recommender Systems (LARS 2019)

Fortaleza, Brazil, October 10, 2019

Definition

How different recommendations are

from “something else”

E.g. user knowledge or experience

Novelty

(Vargas & Castells RecSys 2011, Castells et al. Handbook 2015)

Page 12: Recommender systems evaluation beyond accuracyir.ii.uam.es/castells/lars2019.pdf · ACM Latin American School on Recommender Systems (LARS 2019) Fortaleza, Brazil, October 10, 2019

IRGIR Group @ UAM

Recommender Systems Evaluation Beyond AccuracyACM Latin American School on Recommender Systems (LARS 2019)

Fortaleza, Brazil, October 10, 2019

Definition

How different recommendations are

to each other

How novel each item is to the other

recommended items

Diversity

(Vargas & Castells RecSys 2011, Castells et al. Handbook 2015)

Page 13: Recommender systems evaluation beyond accuracyir.ii.uam.es/castells/lars2019.pdf · ACM Latin American School on Recommender Systems (LARS 2019) Fortaleza, Brazil, October 10, 2019

IRGIR Group @ UAM

Recommender Systems Evaluation Beyond AccuracyACM Latin American School on Recommender Systems (LARS 2019)

Fortaleza, Brazil, October 10, 2019

Why diverse and novel recommendations

For the sake of it: direct user satisfaction

Natural variety-seeking drive in human behavior

– Within a recommendation and over time

– Desire for the unfamiliar, alternation among the familiar

– Ideal level of stimulation

Broaden the user’s horizon / avoid bubbles

The task is often explicitly about discovery

(Castells et al. Handbook 2015, Kaminskas & Bridge ACM TIIS 2017)

Page 14: Recommender systems evaluation beyond accuracyir.ii.uam.es/castells/lars2019.pdf · ACM Latin American School on Recommender Systems (LARS 2019) Fortaleza, Brazil, October 10, 2019

IRGIR Group @ UAM

Recommender Systems Evaluation Beyond AccuracyACM Latin American School on Recommender Systems (LARS 2019)

Fortaleza, Brazil, October 10, 2019

Why diverse and novel recommendations

For enhanced business performance

Sales diversity: mitigate risk, expand the business

Long tail: draw revenues from market niches

– “Sell less of more”

– Higher profit margin on cheaper long-tail products

Fairness! Give all stakeholders a fair chance

(Castells et al. Handbook 2015, Kaminskas & Bridge ACM TIIS 2017)

Page 15: Recommender systems evaluation beyond accuracyir.ii.uam.es/castells/lars2019.pdf · ACM Latin American School on Recommender Systems (LARS 2019) Fortaleza, Brazil, October 10, 2019

IRGIR Group @ UAM

Recommender Systems Evaluation Beyond AccuracyACM Latin American School on Recommender Systems (LARS 2019)

Fortaleza, Brazil, October 10, 2019

Why diverse and novel recommendations

For better system effectiveness (“a safer bet”)

Uncertainty about user preferences

– System observations are ambiguous, very incomplete

– User preferences are multiple, dynamic, contextual…

Increase chances of at least some relevant item

(Castells et al. Handbook 2015, Kaminskas & Bridge ACM TIIS 2017)

Page 16: Recommender systems evaluation beyond accuracyir.ii.uam.es/castells/lars2019.pdf · ACM Latin American School on Recommender Systems (LARS 2019) Fortaleza, Brazil, October 10, 2019

IRGIR Group @ UAM

Recommender Systems Evaluation Beyond AccuracyACM Latin American School on Recommender Systems (LARS 2019)

Fortaleza, Brazil, October 10, 2019

Outline

1. Motivation: beyond relevance

2. Measuring novelty and diversity

3. Enhancing novelty and diversity

4. Biases in recommendation

Page 17: Recommender systems evaluation beyond accuracyir.ii.uam.es/castells/lars2019.pdf · ACM Latin American School on Recommender Systems (LARS 2019) Fortaleza, Brazil, October 10, 2019

IRGIR Group @ UAM

Recommender Systems Evaluation Beyond AccuracyACM Latin American School on Recommender Systems (LARS 2019)

Fortaleza, Brazil, October 10, 2019

Outline

1. Motivation: beyond relevance

2. Measuring novelty and diversity

3. Enhancing novelty and diversity

4. Biases in recommendation

Page 18: Recommender systems evaluation beyond accuracyir.ii.uam.es/castells/lars2019.pdf · ACM Latin American School on Recommender Systems (LARS 2019) Fortaleza, Brazil, October 10, 2019

IRGIR Group @ UAM

Recommender Systems Evaluation Beyond AccuracyACM Latin American School on Recommender Systems (LARS 2019)

Fortaleza, Brazil, October 10, 2019

Measuring accuracy

Rating matrix with some available cell values, most cells empty

Rank items by predicting missing ratings

4 4 2 2 2

4 1 4

4 3 2 5 2

4 3 5 2

1 5 1

Use

rs

Items

Abstraction of user-iteminteraction

The “rating” matrix

Page 19: Recommender systems evaluation beyond accuracyir.ii.uam.es/castells/lars2019.pdf · ACM Latin American School on Recommender Systems (LARS 2019) Fortaleza, Brazil, October 10, 2019

IRGIR Group @ UAM

Recommender Systems Evaluation Beyond AccuracyACM Latin American School on Recommender Systems (LARS 2019)

Fortaleza, Brazil, October 10, 2019

Measuring accuracy

4 4 2 2 2

4 1 4

4 3 ? 2 5 ? 2

4 3 5 2

1 5 1

Use

rs

Items

Abstraction of user-iteminteraction

The “rating” matrix

Rating matrix with some available cell values, most cells empty

Rank items by predicting missing ratings

Evaluation: see if predictions match reality

Page 20: Recommender systems evaluation beyond accuracyir.ii.uam.es/castells/lars2019.pdf · ACM Latin American School on Recommender Systems (LARS 2019) Fortaleza, Brazil, October 10, 2019

IRGIR Group @ UAM

Recommender Systems Evaluation Beyond AccuracyACM Latin American School on Recommender Systems (LARS 2019)

Fortaleza, Brazil, October 10, 2019

Measuring accuracy

4 4 2 2 2

4 1 4

4 3 2 5 2

4 3 5 2

1 5 1

Use

rs

Items

Abstraction of user-iteminteraction

The “rating” matrix

Rating matrix with some available cell values, most cells empty

Rank items by predicting missing ratings

Evaluation: see if predictions match reality

Offline evaluation: just hide a few cell values and use them as test

Page 21: Recommender systems evaluation beyond accuracyir.ii.uam.es/castells/lars2019.pdf · ACM Latin American School on Recommender Systems (LARS 2019) Fortaleza, Brazil, October 10, 2019

IRGIR Group @ UAM

Recommender Systems Evaluation Beyond AccuracyACM Latin American School on Recommender Systems (LARS 2019)

Fortaleza, Brazil, October 10, 2019

Diversity

Measuring diversity and novelty

Different metrics for different notions

Common notions

– Unpopularity

– Unexpectedness

– Serendipity

– Intra-list dissimilarity

– Sales diversity

– Aspect-based diversity

Many other, more particular metrics

𝑅. . .

𝑖1𝑖2𝑖3𝑖4𝑖5

𝑢

Novelty

Page 22: Recommender systems evaluation beyond accuracyir.ii.uam.es/castells/lars2019.pdf · ACM Latin American School on Recommender Systems (LARS 2019) Fortaleza, Brazil, October 10, 2019

IRGIR Group @ UAM

Recommender Systems Evaluation Beyond AccuracyACM Latin American School on Recommender Systems (LARS 2019)

Fortaleza, Brazil, October 10, 2019

Novelty: never seen vs. not familiar

I have not seen this movie

But I have seen these movies…

Measuring novelty

(Vargas & Castells RecSys 2011, Castells et al. Handbook 2015)

Page 23: Recommender systems evaluation beyond accuracyir.ii.uam.es/castells/lars2019.pdf · ACM Latin American School on Recommender Systems (LARS 2019) Fortaleza, Brazil, October 10, 2019

IRGIR Group @ UAM

Recommender Systems Evaluation Beyond AccuracyACM Latin American School on Recommender Systems (LARS 2019)

Fortaleza, Brazil, October 10, 2019

𝑝𝑖

𝑖

Shorthead

Notnovel Novel

Long tail

Long-tail novelty

What is the chance the user has never seen the items

How “not popular” are the recommended items

E.g. mean self-information

MSI = −1

𝑅

𝑖∈𝑅

log2 𝑝𝑖

Popularity of 𝑖

Measuring novelty – never seen

𝑝𝑖 =#users who have interacted with 𝑖

total #users

(Zhou et al. PNAS 2010, Vargas & Castells RecSys 2011, Zhang et al. WSDM 2012, etc.)

Page 24: Recommender systems evaluation beyond accuracyir.ii.uam.es/castells/lars2019.pdf · ACM Latin American School on Recommender Systems (LARS 2019) Fortaleza, Brazil, October 10, 2019

IRGIR Group @ UAM

Recommender Systems Evaluation Beyond AccuracyACM Latin American School on Recommender Systems (LARS 2019)

Fortaleza, Brazil, October 10, 2019

Unexpectedness

User-specific

How unfamiliar the items are to the user experience

E.g. average distance to items in user profile

𝑅

𝑑 𝑖, 𝑗

Measuring novelty – not familiar

Unexp =1

𝑅 𝑢 𝑖∈𝑅𝑗∈𝑢

𝑑 𝑖, 𝑗Items“rated”by 𝑢

𝑢

(Adamopoulos & Tuzhilin ACM TIST 2014, Hurley & Zhang ACM TOIT 2011, Zhang et al. WSDM 2012, etc.)

Page 25: Recommender systems evaluation beyond accuracyir.ii.uam.es/castells/lars2019.pdf · ACM Latin American School on Recommender Systems (LARS 2019) Fortaleza, Brazil, October 10, 2019

IRGIR Group @ UAM

Recommender Systems Evaluation Beyond AccuracyACM Latin American School on Recommender Systems (LARS 2019)

Fortaleza, Brazil, October 10, 2019

Serendipity novelty + relevance

Novel

Relevant

Measuring novelty – serendipity

E.g. compute a novelty metric counting only relevant items(Iaquinta HIS 2008, Ge et al. RecSys 2010, Zhang et al. WSDM 2012)

Page 26: Recommender systems evaluation beyond accuracyir.ii.uam.es/castells/lars2019.pdf · ACM Latin American School on Recommender Systems (LARS 2019) Fortaleza, Brazil, October 10, 2019

IRGIR Group @ UAM

Recommender Systems Evaluation Beyond AccuracyACM Latin American School on Recommender Systems (LARS 2019)

Fortaleza, Brazil, October 10, 2019

Intra-list dissimilarity: average pairwise distance

ILD =2

𝑅 𝑅 − 1 𝑖,𝑗∈𝑅𝑖≠𝑗

𝑑 𝑖, 𝑗

Measuring diversity

𝑅

𝑑 𝑖, 𝑗 = 1 − 𝑠𝑖𝑚 𝑖, 𝑗 (based on item features)

(Smyth & McClave ICCBR 2001, Ziegler et al. WWW 2005, etc.)

Page 27: Recommender systems evaluation beyond accuracyir.ii.uam.es/castells/lars2019.pdf · ACM Latin American School on Recommender Systems (LARS 2019) Fortaleza, Brazil, October 10, 2019

IRGIR Group @ UAM

Recommender Systems Evaluation Beyond AccuracyACM Latin American School on Recommender Systems (LARS 2019)

Fortaleza, Brazil, October 10, 2019

Aspect-based diversity

With respect to a space of user “subtastes”:

genres, categories, etc.

Inspired on intent-oriented search diversity

(Vargas et al. SIGIR 2011, Wasilewski & Hurley UMAP 2018, Kaya & Bridge UMUAI 2019, etc.)

Page 28: Recommender systems evaluation beyond accuracyir.ii.uam.es/castells/lars2019.pdf · ACM Latin American School on Recommender Systems (LARS 2019) Fortaleza, Brazil, October 10, 2019

IRGIR Group @ UAM

Recommender Systems Evaluation Beyond AccuracyACM Latin American School on Recommender Systems (LARS 2019)

Fortaleza, Brazil, October 10, 2019

Diversity in search

Page 29: Recommender systems evaluation beyond accuracyir.ii.uam.es/castells/lars2019.pdf · ACM Latin American School on Recommender Systems (LARS 2019) Fortaleza, Brazil, October 10, 2019

IRGIR Group @ UAM

Recommender Systems Evaluation Beyond AccuracyACM Latin American School on Recommender Systems (LARS 2019)

Fortaleza, Brazil, October 10, 2019

Diversity in search

Page 30: Recommender systems evaluation beyond accuracyir.ii.uam.es/castells/lars2019.pdf · ACM Latin American School on Recommender Systems (LARS 2019) Fortaleza, Brazil, October 10, 2019

IRGIR Group @ UAM

Recommender Systems Evaluation Beyond AccuracyACM Latin American School on Recommender Systems (LARS 2019)

Fortaleza, Brazil, October 10, 2019

“Avoid redundancy of possible user intents (aspects)

as a means to cope with the uncertainty in the query”

Diversity in search

(Carbonell & Goldstein SIGIR 1998, Clarke et al. SIGIR 2008, Agrawal et al. WSDM 2009, Santos et al. WWW 2010, Santos et al. Found. & Trends in IR 2015)

Page 31: Recommender systems evaluation beyond accuracyir.ii.uam.es/castells/lars2019.pdf · ACM Latin American School on Recommender Systems (LARS 2019) Fortaleza, Brazil, October 10, 2019

IRGIR Group @ UAM

Recommender Systems Evaluation Beyond AccuracyACM Latin American School on Recommender Systems (LARS 2019)

Fortaleza, Brazil, October 10, 2019

Search result diversity

Added utility Added utility

Rel

evan

t d

ocu

men

t ra

nk

Relevan

t do

cum

ent ran

k

Query senses / aspects

. . .

. . .

(query ambiguity / incompleteness)

Uniformresults Diverse

results

Page 32: Recommender systems evaluation beyond accuracyir.ii.uam.es/castells/lars2019.pdf · ACM Latin American School on Recommender Systems (LARS 2019) Fortaleza, Brazil, October 10, 2019

IRGIR Group @ UAM

Recommender Systems Evaluation Beyond AccuracyACM Latin American School on Recommender Systems (LARS 2019)

Fortaleza, Brazil, October 10, 2019

Metrics

Aspect recall

Intent-aware metrics

ERR−IA =

𝑑𝑘∈𝑅

1

𝑘

𝑎∈𝒜𝑞

𝑟𝑒𝑙 𝑑𝑘 𝑎

𝑗<𝑘

1 − 𝛼 𝑟𝑒𝑙 𝑑𝑗 𝑎

Search diversity evaluation

=1

𝒜𝑞# 𝑎 ∈ 𝒜𝑞 ∃𝑑 ∈ 𝑅 that covers 𝑎

Novelty

Diversity

RelevanceRanking

(Clarke et al. SIGIR 2008, Agrawal et al. WSDM 2009, Chapelle et al. Inf. Ret. 2011)

Page 33: Recommender systems evaluation beyond accuracyir.ii.uam.es/castells/lars2019.pdf · ACM Latin American School on Recommender Systems (LARS 2019) Fortaleza, Brazil, October 10, 2019

IRGIR Group @ UAM

Recommender Systems Evaluation Beyond AccuracyACM Latin American School on Recommender Systems (LARS 2019)

Fortaleza, Brazil, October 10, 2019

Metrics

Aspect recall

Intent-aware metrics

ERR−IA =

𝑑𝑘∈𝑅

1

𝑘

𝑎∈𝒜𝑞

𝑟𝑒𝑙 𝑑𝑘 𝑎

𝑗<𝑘

1 − 𝛼 𝑟𝑒𝑙 𝑑𝑗 𝑎

Aspects?

Query aspects: manually defined (e.g. TREC), Wikipedia

disambiguation, suggested query reformulations…

Document aspects: categories, clusters…

Search diversity evaluation

=1

𝒜𝑞# 𝑎 ∈ 𝒜𝑞 ∃𝑑 ∈ 𝑅 that covers 𝑎

(Clarke et al. SIGIR 2008, Agrawal et al. WSDM 2009, Chapelle et al. Inf. Ret. 2011, Santos et al. WWW 2010)

Page 34: Recommender systems evaluation beyond accuracyir.ii.uam.es/castells/lars2019.pdf · ACM Latin American School on Recommender Systems (LARS 2019) Fortaleza, Brazil, October 10, 2019

IRGIR Group @ UAM

Recommender Systems Evaluation Beyond AccuracyACM Latin American School on Recommender Systems (LARS 2019)

Fortaleza, Brazil, October 10, 2019

“Avoid redundancy of possible user intents (aspects)

as a means to cope with the uncertainty in the query”

in the observed evidence of user interests”

Aspect-based diversity in recommendation

(Vargas et al. SIGIR 2011, Wasilewski & Hurley UMAP 2018, Kaya & Bridge UMUAI 2019, etc.)

Page 35: Recommender systems evaluation beyond accuracyir.ii.uam.es/castells/lars2019.pdf · ACM Latin American School on Recommender Systems (LARS 2019) Fortaleza, Brazil, October 10, 2019

IRGIR Group @ UAM

Recommender Systems Evaluation Beyond AccuracyACM Latin American School on Recommender Systems (LARS 2019)

Fortaleza, Brazil, October 10, 2019

4 4 2 2 2

1 4 4 4

4 3 2 5 2

4 3 3 2 2

1 1 5 1 5 5

Use

rs

Items

𝑢

𝑖

Aspect-based diversity in recommendation

Page 36: Recommender systems evaluation beyond accuracyir.ii.uam.es/castells/lars2019.pdf · ACM Latin American School on Recommender Systems (LARS 2019) Fortaleza, Brazil, October 10, 2019

IRGIR Group @ UAM

Recommender Systems Evaluation Beyond AccuracyACM Latin American School on Recommender Systems (LARS 2019)

Fortaleza, Brazil, October 10, 2019

4 4 2 2 2

1 4 4 4

4 3 2 5 2

4 3 3 2 2

1 1 5 1 5 5

Use

rs

Items

User profile𝑢

𝑖

Aspect-based diversity in recommendation

Page 37: Recommender systems evaluation beyond accuracyir.ii.uam.es/castells/lars2019.pdf · ACM Latin American School on Recommender Systems (LARS 2019) Fortaleza, Brazil, October 10, 2019

IRGIR Group @ UAM

Recommender Systems Evaluation Beyond AccuracyACM Latin American School on Recommender Systems (LARS 2019)

Fortaleza, Brazil, October 10, 2019

4 3 2 5 2 User profile

Items

𝑢

𝑖

Aspect-based diversity in recommendation

Page 38: Recommender systems evaluation beyond accuracyir.ii.uam.es/castells/lars2019.pdf · ACM Latin American School on Recommender Systems (LARS 2019) Fortaleza, Brazil, October 10, 2019

IRGIR Group @ UAM

Recommender Systems Evaluation Beyond AccuracyACM Latin American School on Recommender Systems (LARS 2019)

Fortaleza, Brazil, October 10, 2019

Items

Item

feat

ure

s Aspects from item features, using a “meaningful” item feature space

4 3 2 5 2𝑢

𝑖

Aspect-based diversity in recommendation

User profile

Page 39: Recommender systems evaluation beyond accuracyir.ii.uam.es/castells/lars2019.pdf · ACM Latin American School on Recommender Systems (LARS 2019) Fortaleza, Brazil, October 10, 2019

IRGIR Group @ UAM

Recommender Systems Evaluation Beyond AccuracyACM Latin American School on Recommender Systems (LARS 2019)

Fortaleza, Brazil, October 10, 2019

Items

Item

feat

ure

s

4 3 2 5 2

“User aspects”

𝑢

𝑖

Aspect-based diversity in recommendation

Aspects from item features, using a “meaningful” item feature space

Derive user aspect distributions

Page 40: Recommender systems evaluation beyond accuracyir.ii.uam.es/castells/lars2019.pdf · ACM Latin American School on Recommender Systems (LARS 2019) Fortaleza, Brazil, October 10, 2019

IRGIR Group @ UAM

Recommender Systems Evaluation Beyond AccuracyACM Latin American School on Recommender Systems (LARS 2019)

Fortaleza, Brazil, October 10, 2019

Items

Item

feat

ure

s Aspects from item features, using a “meaningful” item feature space

Derive user aspect distributions

4 3 2 5 2

“User aspects”

𝑢

𝑖

Aspect-based diversity in recommendation

IR diversity metrics and algorithms can now be applied

Other approaches to user interest subdivision have been considered

(Vargas et al. SIGIR 2011, Wasilewski & Hurley UMAP 2018, Kaya & Bridge UMUAI 2019, Vargas et al. OAIR 2013)

Page 41: Recommender systems evaluation beyond accuracyir.ii.uam.es/castells/lars2019.pdf · ACM Latin American School on Recommender Systems (LARS 2019) Fortaleza, Brazil, October 10, 2019

IRGIR Group @ UAM

Recommender Systems Evaluation Beyond AccuracyACM Latin American School on Recommender Systems (LARS 2019)

Fortaleza, Brazil, October 10, 2019

Sales diversity

Seller perspective

How spread are recommendations over the item inventory

Catalog exposure to sales

Items

Nr

use

rs t

o w

ho

mit

em is

rec

om

end

ed

Items

Recommender BRecommender A

(Adomavicius & Kwon TKDE 2012, Li & Murata WI 2012, Vargas & Castells RecSys 2014, Jannach et al. UMUAI 2015, etc.)

Page 42: Recommender systems evaluation beyond accuracyir.ii.uam.es/castells/lars2019.pdf · ACM Latin American School on Recommender Systems (LARS 2019) Fortaleza, Brazil, October 10, 2019

IRGIR Group @ UAM

Recommender Systems Evaluation Beyond AccuracyACM Latin American School on Recommender Systems (LARS 2019)

Fortaleza, Brazil, October 10, 2019

Sales diversity

. . .

. . .

. . .

. . .

“Ecosystem”One “species”

Set of all recommendations

Set of all items

Metrics: function over set of recommendations

Metrics adapted from ecology and other fields

Recommendation“slots”

One “individualof some species”

Page 43: Recommender systems evaluation beyond accuracyir.ii.uam.es/castells/lars2019.pdf · ACM Latin American School on Recommender Systems (LARS 2019) Fortaleza, Brazil, October 10, 2019

IRGIR Group @ UAM

Recommender Systems Evaluation Beyond AccuracyACM Latin American School on Recommender Systems (LARS 2019)

Fortaleza, Brazil, October 10, 2019

Sales diversity

“Ecosystem”One “species”

Set of all recommendations

Set of all items

Metrics: function over set of recommendations

Metrics adapted from ecology and other fields

Recommendation“slots”

One “individualof some species”

. . .

. . .

. . .

. . .

Page 44: Recommender systems evaluation beyond accuracyir.ii.uam.es/castells/lars2019.pdf · ACM Latin American School on Recommender Systems (LARS 2019) Fortaleza, Brazil, October 10, 2019

IRGIR Group @ UAM

Recommender Systems Evaluation Beyond AccuracyACM Latin American School on Recommender Systems (LARS 2019)

Fortaleza, Brazil, October 10, 2019

Sales diversity

“Ecosystem”One “species”

Set of all recommendations

Set of all items

Recommendation“slots”

One “individualof some species”

. . .

. . .

. . .

. . .

Aggregate diversity

Total number of different items recommended in top 𝑛

Equivalent to “species richness”

Aggdiv = ∪𝑢 𝑅𝑢

Page 45: Recommender systems evaluation beyond accuracyir.ii.uam.es/castells/lars2019.pdf · ACM Latin American School on Recommender Systems (LARS 2019) Fortaleza, Brazil, October 10, 2019

IRGIR Group @ UAM

Recommender Systems Evaluation Beyond AccuracyACM Latin American School on Recommender Systems (LARS 2019)

Fortaleza, Brazil, October 10, 2019

Aggregate diversity

Total number of different items recommended in top 𝑛

Equivalent to “species richness”

Gini-Simpson index

GSI = 1 −

𝑖

𝑝𝑖2

𝑝𝑖 = ratio of users to whom 𝑖 is recommended

Gini coefficient

Entropy

H = −

𝑖

𝑝𝑖 log2 𝑝𝑖

Sales diversity

G =1

ℐ − 1

𝑘=1

2𝑘 − ℐ − 1 𝑝𝑖𝑘

Aggdiv = ∪𝑢 𝑅𝑢

Page 46: Recommender systems evaluation beyond accuracyir.ii.uam.es/castells/lars2019.pdf · ACM Latin American School on Recommender Systems (LARS 2019) Fortaleza, Brazil, October 10, 2019

IRGIR Group @ UAM

Recommender Systems Evaluation Beyond AccuracyACM Latin American School on Recommender Systems (LARS 2019)

Fortaleza, Brazil, October 10, 2019

Sales diversity

Aggregate diversity: A as good as B

Gini, Gini-Simpson, Entropy: B better than A

Items

Nr

use

rs t

o w

ho

mit

em is

rec

om

end

ed

Items

Recommender A Recommender B

𝑛 𝑛

Page 47: Recommender systems evaluation beyond accuracyir.ii.uam.es/castells/lars2019.pdf · ACM Latin American School on Recommender Systems (LARS 2019) Fortaleza, Brazil, October 10, 2019

IRGIR Group @ UAM

Recommender Systems Evaluation Beyond AccuracyACM Latin American School on Recommender Systems (LARS 2019)

Fortaleza, Brazil, October 10, 2019

Common underlying principle to different diversity notions

Context

Recommended item

Target user’sexperience

Everyone else’sexperience

Everyone else’srecommendations

Other items in thesame recommendation

UnexpectednessIntra-listdiversity

Long-tailnovelty Sales diversity

Distance or identity

Item novelty model

(Vargas & Castells RecSys 2011, Castells et al. Handbook 2015)

Page 48: Recommender systems evaluation beyond accuracyir.ii.uam.es/castells/lars2019.pdf · ACM Latin American School on Recommender Systems (LARS 2019) Fortaleza, Brazil, October 10, 2019

IRGIR Group @ UAM

Recommender Systems Evaluation Beyond AccuracyACM Latin American School on Recommender Systems (LARS 2019)

Fortaleza, Brazil, October 10, 2019

nDCG

ERR-IA 0.64

Aspect recall -0.02 0.03

ILD 0.71 -0.09 0.03

Unexpectedness 0.85 0.62 -0.06 0.07

Long-tail novelty (MSI) -0.19 -0.21 -0.19 0.10 0.02

Sales diversity (IUD) 0.87 -0.23 -0.27 -0.20 0.14 0.06

Relation between metrics

Pearson correlation (on MF baseline recommender)

Aspect-based diversity

(Castells et al. Handbook 2015)

Page 49: Recommender systems evaluation beyond accuracyir.ii.uam.es/castells/lars2019.pdf · ACM Latin American School on Recommender Systems (LARS 2019) Fortaleza, Brazil, October 10, 2019

IRGIR Group @ UAM

Recommender Systems Evaluation Beyond AccuracyACM Latin American School on Recommender Systems (LARS 2019)

Fortaleza, Brazil, October 10, 2019

Outline

1. Motivation: beyond relevance

2. Measuring novelty and diversity

3. Enhancing novelty and diversity

4. Biases in recommendation

Page 50: Recommender systems evaluation beyond accuracyir.ii.uam.es/castells/lars2019.pdf · ACM Latin American School on Recommender Systems (LARS 2019) Fortaleza, Brazil, October 10, 2019

IRGIR Group @ UAM

Recommender Systems Evaluation Beyond AccuracyACM Latin American School on Recommender Systems (LARS 2019)

Fortaleza, Brazil, October 10, 2019

Outline

1. Motivation: beyond relevance

2. Measuring novelty and diversity

3. Enhancing novelty and diversity

4. Biases in recommendation

Page 51: Recommender systems evaluation beyond accuracyir.ii.uam.es/castells/lars2019.pdf · ACM Latin American School on Recommender Systems (LARS 2019) Fortaleza, Brazil, October 10, 2019

IRGIR Group @ UAM

Recommender Systems Evaluation Beyond AccuracyACM Latin American School on Recommender Systems (LARS 2019)

Fortaleza, Brazil, October 10, 2019

Diversity and accuracy viewed as opposing objectives

• Enhancing diversity (or novelty) is expected to involve

some sacrifice in accuracy

• The goal is to achieve an optimal trade-off:

a multiobjective optimization problem

• Results assessed by two metrics:

relevance vs. diversity/novelty

Novelty and diversity enhancement

Diversity

Acc

ura

cy

Page 52: Recommender systems evaluation beyond accuracyir.ii.uam.es/castells/lars2019.pdf · ACM Latin American School on Recommender Systems (LARS 2019) Fortaleza, Brazil, October 10, 2019

IRGIR Group @ UAM

Recommender Systems Evaluation Beyond AccuracyACM Latin American School on Recommender Systems (LARS 2019)

Fortaleza, Brazil, October 10, 2019

Greedy reranking for novelty / diversity

Input data(observations)

Accuracyalgorithm

Initialranking

End user

Diversifiedranking

𝑅 𝑆

. . .

Greedy versionof target metric

𝜙 𝑖 𝑆, 𝑢 = 1 − 𝜆 rel 𝑢, 𝑖 + 𝜆 div 𝑖 𝑆, 𝑢

Initial ranking

(Ziegler et al. WWW 2005, Carbonell SIGIR 1998, Agrawal et al. WSDM 2009, Santos et al. WWW 2010, etc.)

Page 53: Recommender systems evaluation beyond accuracyir.ii.uam.es/castells/lars2019.pdf · ACM Latin American School on Recommender Systems (LARS 2019) Fortaleza, Brazil, October 10, 2019

IRGIR Group @ UAM

Recommender Systems Evaluation Beyond AccuracyACM Latin American School on Recommender Systems (LARS 2019)

Fortaleza, Brazil, October 10, 2019

Greedy reranking for novelty / diversity

For instance…

div 𝑖 𝑆, 𝑢 =

𝑗∈𝑆

𝑑 𝑖, 𝑗

div 𝑖 𝑆, 𝑢 =

𝑗∈𝐮

𝑑 𝑖, 𝑗

div 𝑖 𝑆, 𝑢 = − log2 𝑝𝑖

IL Diversity (ILD)

Unexpectedness

Long-tail novelty (MSI)

𝜙 𝑖 𝑆, 𝑢 = 1 − 𝜆 rel 𝑢, 𝑖 + 𝜆 div 𝑖 𝑆, 𝑢(Ziegler et al. WWW 2005, Carbonell SIGIR 1998, Agrawal et al. WSDM 2009, Santos et al. WWW 2010, etc.)

Page 54: Recommender systems evaluation beyond accuracyir.ii.uam.es/castells/lars2019.pdf · ACM Latin American School on Recommender Systems (LARS 2019) Fortaleza, Brazil, October 10, 2019

IRGIR Group @ UAM

Recommender Systems Evaluation Beyond AccuracyACM Latin American School on Recommender Systems (LARS 2019)

Fortaleza, Brazil, October 10, 2019

Many other specific approaches…

More sophisticated multiobjective optimization

User subprofiles, latent factors

Graph-based, clustering, portfolio theory…

Progressive transition towards the long tail

External vs. internal to algorithm

Novelty and diversity enhancement approaches

(Smyth & McClave ICCBR 2001, Ziegler et al. WWW 2005, Celma & Herrera RecSys 2008, Zhou et al. PNAS 2010, Hurley & Zhang TOIT 2011, Zhang et al. WSDM 2012, Shi et al. SIGIR 2012, Adomavicius & Kwon IEEE TKDE 2012, etc.)

Page 55: Recommender systems evaluation beyond accuracyir.ii.uam.es/castells/lars2019.pdf · ACM Latin American School on Recommender Systems (LARS 2019) Fortaleza, Brazil, October 10, 2019

IRGIR Group @ UAM

Recommender Systems Evaluation Beyond AccuracyACM Latin American School on Recommender Systems (LARS 2019)

Fortaleza, Brazil, October 10, 2019

Novelty and diversity by weighted recommender ensembles

Multi-objective maximization of accuracy & novelty (MSI) &

diversity (ILD) evolutionary algorithm

Find the Pareto frontier on tradeoffs between the 3 metrics

MSI MSI

Acc

ura

cy (

reca

ll)

ILD ILDMovieLens Last.fm

Novelty and diversity enhancement

(Ribeiro-Neto et al. RecSys 2012, Veloso et al. ACM TIST 2014)

Page 56: Recommender systems evaluation beyond accuracyir.ii.uam.es/castells/lars2019.pdf · ACM Latin American School on Recommender Systems (LARS 2019) Fortaleza, Brazil, October 10, 2019

IRGIR Group @ UAM

Recommender Systems Evaluation Beyond AccuracyACM Latin American School on Recommender Systems (LARS 2019)

Fortaleza, Brazil, October 10, 2019

Sales diversity enhancement

Recommend users to items

(Vargas & Castells RecSys 2014)

Page 57: Recommender systems evaluation beyond accuracyir.ii.uam.es/castells/lars2019.pdf · ACM Latin American School on Recommender Systems (LARS 2019) Fortaleza, Brazil, October 10, 2019

IRGIR Group @ UAM

Recommender Systems Evaluation Beyond AccuracyACM Latin American School on Recommender Systems (LARS 2019)

Fortaleza, Brazil, October 10, 2019

Sales diversity enhancement

Recommend users to items

By taking inverse kNN

neighborhoods

(Vargas & Castells RecSys 2014)

Page 58: Recommender systems evaluation beyond accuracyir.ii.uam.es/castells/lars2019.pdf · ACM Latin American School on Recommender Systems (LARS 2019) Fortaleza, Brazil, October 10, 2019

IRGIR Group @ UAM

Recommender Systems Evaluation Beyond AccuracyACM Latin American School on Recommender Systems (LARS 2019)

Fortaleza, Brazil, October 10, 2019

Outline

1. Motivation: beyond relevance

2. Measuring novelty and diversity

3. Enhancing novelty and diversity

4. Biases in recommendation

Page 59: Recommender systems evaluation beyond accuracyir.ii.uam.es/castells/lars2019.pdf · ACM Latin American School on Recommender Systems (LARS 2019) Fortaleza, Brazil, October 10, 2019

IRGIR Group @ UAM

Recommender Systems Evaluation Beyond AccuracyACM Latin American School on Recommender Systems (LARS 2019)

Fortaleza, Brazil, October 10, 2019

Outline

1. Motivation: beyond relevance

2. Measuring novelty and diversity

3. Enhancing novelty and diversity

4. Biases in recommendation

Page 60: Recommender systems evaluation beyond accuracyir.ii.uam.es/castells/lars2019.pdf · ACM Latin American School on Recommender Systems (LARS 2019) Fortaleza, Brazil, October 10, 2019

IRGIR Group @ UAM

Recommender Systems Evaluation Beyond AccuracyACM Latin American School on Recommender Systems (LARS 2019)

Fortaleza, Brazil, October 10, 2019

Think about novelty again

Would you find it useful to recommend these?

Why not?

Movie Book Tourist attraction Music album

Page 61: Recommender systems evaluation beyond accuracyir.ii.uam.es/castells/lars2019.pdf · ACM Latin American School on Recommender Systems (LARS 2019) Fortaleza, Brazil, October 10, 2019

IRGIR Group @ UAM

Recommender Systems Evaluation Beyond AccuracyACM Latin American School on Recommender Systems (LARS 2019)

Fortaleza, Brazil, October 10, 2019

Think about novelty again

We do not always wantthe same amount of novelty

(Kapoor et al. RecSys 2015, Mcalister & Pessemier Cons. Res. 2010, etc.)

Page 62: Recommender systems evaluation beyond accuracyir.ii.uam.es/castells/lars2019.pdf · ACM Latin American School on Recommender Systems (LARS 2019) Fortaleza, Brazil, October 10, 2019

IRGIR Group @ UAM

Recommender Systems Evaluation Beyond AccuracyACM Latin American School on Recommender Systems (LARS 2019)

Fortaleza, Brazil, October 10, 2019

~13K ratings 25 ratings~80K ratings

Think about novelty again

Degrees of popularity beyond the short head

What is the effect of popularity?

5 ratings~300K ratings

Page 63: Recommender systems evaluation beyond accuracyir.ii.uam.es/castells/lars2019.pdf · ACM Latin American School on Recommender Systems (LARS 2019) Fortaleza, Brazil, October 10, 2019

IRGIR Group @ UAM

Recommender Systems Evaluation Beyond AccuracyACM Latin American School on Recommender Systems (LARS 2019)

Fortaleza, Brazil, October 10, 2019

There is a relation between popularity and accuracy

Popular items(short head)

Rest of items(long tail)

Observed user-item interaction

Unobserved preference

Items

Use

rs

Ratings are missingnot at random (MNAR)

(Marlin et al. RecSys 2010, Steck RecSys 2010, 2011, etc.)

Page 64: Recommender systems evaluation beyond accuracyir.ii.uam.es/castells/lars2019.pdf · ACM Latin American School on Recommender Systems (LARS 2019) Fortaleza, Brazil, October 10, 2019

IRGIR Group @ UAM

Recommender Systems Evaluation Beyond AccuracyACM Latin American School on Recommender Systems (LARS 2019)

Fortaleza, Brazil, October 10, 2019

There is a relation between popularity and accuracy

Test data (relevant items)

Training data

Unobserved preference

Items

Use

rs

Popular items(short head)

Rest of items(long tail)

avg P@𝑘 ∼𝑘

𝑘

Ratings are missingnot at random (MNAR)

(Marlin et al. RecSys 2010, Steck RecSys 2010, 2011, etc.)

Page 65: Recommender systems evaluation beyond accuracyir.ii.uam.es/castells/lars2019.pdf · ACM Latin American School on Recommender Systems (LARS 2019) Fortaleza, Brazil, October 10, 2019

IRGIR Group @ UAM

Recommender Systems Evaluation Beyond AccuracyACM Latin American School on Recommender Systems (LARS 2019)

Fortaleza, Brazil, October 10, 2019

There is a relation between popularity and accuracy

Random

Nr. positive ratings

User-based kNN

Matrix factorization0.3

0.2

0.1

0

nD

CG

@1

0

MovieLens 1M

(Cremonesi et al. RecSys 2010, etc.)

Page 66: Recommender systems evaluation beyond accuracyir.ii.uam.es/castells/lars2019.pdf · ACM Latin American School on Recommender Systems (LARS 2019) Fortaleza, Brazil, October 10, 2019

IRGIR Group @ UAM

Recommender Systems Evaluation Beyond AccuracyACM Latin American School on Recommender Systems (LARS 2019)

Fortaleza, Brazil, October 10, 2019

Popularity bias in recommendation algorithms

Matrix factorization

# positive ratings

# ti

me

sre

com

men

de

d

in t

op

10

0

400

800

0 1000 2000

Popularity

800

400

00 1000 2000

User-based kNN

# positive ratings

2000

1000

00 1000 2000

(Jannach et al. UMUAI 2015, Cañamares & Castells SIGIR 2017)

Page 67: Recommender systems evaluation beyond accuracyir.ii.uam.es/castells/lars2019.pdf · ACM Latin American School on Recommender Systems (LARS 2019) Fortaleza, Brazil, October 10, 2019

IRGIR Group @ UAM

Recommender Systems Evaluation Beyond AccuracyACM Latin American School on Recommender Systems (LARS 2019)

Fortaleza, Brazil, October 10, 2019

Can we trust our experiments?

Computed on availableuser taste observations

Computed with fullknowledge of user tastes

Observed metric value True metric value

Items

Use

rs

Relevant

Non relevant

Missing ratings

?≈

Items

Use

rs

(Cañamares & Castells SIGIR 2018)

Page 68: Recommender systems evaluation beyond accuracyir.ii.uam.es/castells/lars2019.pdf · ACM Latin American School on Recommender Systems (LARS 2019) Fortaleza, Brazil, October 10, 2019

IRGIR Group @ UAM

Recommender Systems Evaluation Beyond AccuracyACM Latin American School on Recommender Systems (LARS 2019)

Fortaleza, Brazil, October 10, 2019

Get rid of the popularity bias

In the data

Items Items

# ra

tin

gs

Flat test Popularity strata

Time

Temporal split

Test data (relevant items)

Training data

Unobserved preference

(Bellogín et al. Inf. Ret. 2017)

Page 69: Recommender systems evaluation beyond accuracyir.ii.uam.es/castells/lars2019.pdf · ACM Latin American School on Recommender Systems (LARS 2019) Fortaleza, Brazil, October 10, 2019

IRGIR Group @ UAM

Recommender Systems Evaluation Beyond AccuracyACM Latin American School on Recommender Systems (LARS 2019)

Fortaleza, Brazil, October 10, 2019

Get rid of the popularity bias

In the metrics

Stratified recallOff-policy evaluationInverse propensity scoring···

Divide the relevance of items by the probability to be discovered

Problem: howto estimatepropensity

𝑃 =1

𝑅

𝑖∈𝑅

𝑟𝑒𝑙 𝑖, 𝑢 𝑜𝑏𝑠 𝑖, 𝑢 →1

𝑅

𝑖∈𝑅

𝑟𝑒𝑙 𝑖, 𝑢 𝑜𝑏𝑠 𝑖, 𝑢

𝑝 𝑜𝑏𝑠 𝑖, 𝑢

In the algorithms unbiased learning

In the data: unbiased datasets, e.g. Yahoo! R3, CM100k

(Steck RecSys 2011, Schnabel et al. ICML 2016, Swaminathan et. al NIPS 2017, Yang et al. RecSys 2018, etc.)

Page 70: Recommender systems evaluation beyond accuracyir.ii.uam.es/castells/lars2019.pdf · ACM Latin American School on Recommender Systems (LARS 2019) Fortaleza, Brazil, October 10, 2019

IRGIR Group @ UAM

Recommender Systems Evaluation Beyond AccuracyACM Latin American School on Recommender Systems (LARS 2019)

Fortaleza, Brazil, October 10, 2019

Should we really get rid of the popularity bias?

Items

# in

tera

ctio

ns

𝑎 𝑏

What made 𝑎 be so much

more popular than 𝑏?

(Cañamares & Castells SIGIR 2018)

Page 71: Recommender systems evaluation beyond accuracyir.ii.uam.es/castells/lars2019.pdf · ACM Latin American School on Recommender Systems (LARS 2019) Fortaleza, Brazil, October 10, 2019

IRGIR Group @ UAM

Recommender Systems Evaluation Beyond AccuracyACM Latin American School on Recommender Systems (LARS 2019)

Fortaleza, Brazil, October 10, 2019

?

Should we really get rid of the popularity bias?

The popularity bias may not be “bad”

– If item discovery and user rating is aligned

with relevance, popularity is a relevance signal

– Rational herd behavior

But it can distort evaluation

– If popularity is generated independently from relevance

– E.g. marketing, conformity, manipulation, randomness

Implications on state of the art algorithms

Can have unfair implications if tied to sensitive features

(Cañamares & Castells SIGIR 2018)

Page 72: Recommender systems evaluation beyond accuracyir.ii.uam.es/castells/lars2019.pdf · ACM Latin American School on Recommender Systems (LARS 2019) Fortaleza, Brazil, October 10, 2019

IRGIR Group @ UAM

Recommender Systems Evaluation Beyond AccuracyACM Latin American School on Recommender Systems (LARS 2019)

Fortaleza, Brazil, October 10, 2019

The recommendation feedback loop

Recommender systems bias themselves– Self-reinforced (popularity) concentration

– Increasingly poor sales diversity

Biases in offline evaluation with the logged observations

External sources: search, browsing, questionnaires, etc.

Input data(observations)

Recommendationalgorithm

Recommendation

Feedback

Learning (exploration)

Satisfaction(exploitation)

Feedbackloop

(Fleder & Hossanagar Mgt. Sci. 2009, Chaney et al. RecSys 2018, etc.)

Page 73: Recommender systems evaluation beyond accuracyir.ii.uam.es/castells/lars2019.pdf · ACM Latin American School on Recommender Systems (LARS 2019) Fortaleza, Brazil, October 10, 2019

IRGIR Group @ UAM

Recommender Systems Evaluation Beyond AccuracyACM Latin American School on Recommender Systems (LARS 2019)

Fortaleza, Brazil, October 10, 2019

Breaking the feedback loop: multi-armed bandits

Banditpolicy 1. Select

arm2. Get

reward

Estimated(models)

𝜇

𝜇

𝜇

3. Update estimated reward model of arm

True (unob-served)

𝜇

𝜇

𝜇

Reward distributions

Arms

Multi-armed bandit problem:

Choose an arm iteratively and maximize total payoff

without knowing reward distributions in advance

(Sutton & Barto RL book 2018, Chapelle & Li NIPS 2011, etc.)

Page 74: Recommender systems evaluation beyond accuracyir.ii.uam.es/castells/lars2019.pdf · ACM Latin American School on Recommender Systems (LARS 2019) Fortaleza, Brazil, October 10, 2019

IRGIR Group @ UAM

Recommender Systems Evaluation Beyond AccuracyACM Latin American School on Recommender Systems (LARS 2019)

Fortaleza, Brazil, October 10, 2019

Breaking the feedback loop: multi-armed bandits

Banditpolicy

Estimated(models)

3. Update estimated reward model of arm𝜇

𝜇

𝜇

True (unob-served)

𝜇

𝜇

𝜇

Recommendation keeps an ingredient of randomness (exploration) in its actions– Aware (explicit model) of uncertainty in present knowledge about the user

– Gives apparently suboptimal options a chance to be reconsidered

Actions can be items, latent factors, clusters, neighbors, algorithms…

Do much better in the mid/long run!!

Reward distributions

Arms

1. Selectarm

2. Getreward

(Li et al. SIGIR 2016, Lacerda Neurocomputing 2017, McInerney et al. RecSys 2018, etc.)

Page 75: Recommender systems evaluation beyond accuracyir.ii.uam.es/castells/lars2019.pdf · ACM Latin American School on Recommender Systems (LARS 2019) Fortaleza, Brazil, October 10, 2019

IRGIR Group @ UAM

Recommender Systems Evaluation Beyond AccuracyACM Latin American School on Recommender Systems (LARS 2019)

Fortaleza, Brazil, October 10, 2019

Conclusions

Novelty, bias and reinforcement learning are related problems

Novelty & diversity are now state of the art

– Different notions and metrics for different angles

Bias: popular items score high in accuracy in offline experiments

– Progress made in understanding and seeking to avoid

Reinforcement loop bias: multi-armed bandits and

reinforcement learning can greatly help

– And improve sales diversity

Page 76: Recommender systems evaluation beyond accuracyir.ii.uam.es/castells/lars2019.pdf · ACM Latin American School on Recommender Systems (LARS 2019) Fortaleza, Brazil, October 10, 2019

IRGIR Group @ UAM

Recommender Systems Evaluation Beyond AccuracyACM Latin American School on Recommender Systems (LARS 2019)

Fortaleza, Brazil, October 10, 2019

Open directions

Large room for research, matters to industry

Better understand the role of novelty and diversity in user needs

Unbiased evaluation

– How to estimate propensity

– Model complex biases e.g. involving user pairs

– Build unbiased datasets

Multi-armed bandits and reinforcement learning

– How to map the task, algorithmic research

– How to evaluate methods and represent different scenarios

(Nguyen et al. WWW 2014, Kapoor et al. RecSys 2015, Karumur et al. CSCW 2016, etc.)

Page 77: Recommender systems evaluation beyond accuracyir.ii.uam.es/castells/lars2019.pdf · ACM Latin American School on Recommender Systems (LARS 2019) Fortaleza, Brazil, October 10, 2019

IRGIR Group @ UAM

Recommender Systems Evaluation Beyond AccuracyACM Latin American School on Recommender Systems (LARS 2019)

Fortaleza, Brazil, October 10, 2019

Thank you for your attention!

Questions?

Page 78: Recommender systems evaluation beyond accuracyir.ii.uam.es/castells/lars2019.pdf · ACM Latin American School on Recommender Systems (LARS 2019) Fortaleza, Brazil, October 10, 2019

IRGIR Group @ UAM

Recommender Systems Evaluation Beyond AccuracyACM Latin American School on Recommender Systems (LARS 2019)

Fortaleza, Brazil, October 10, 2019

References

Adamopoulos, P., Tuzhilin, A. On Unexpectedness in Recommender Systems: Or How to Expect the Unexpected. ACM TIST 5(4), Special Issue on Novelty and Diversity in Recommender Systems, January 2015.

Adomavicius, G., Kwon, Y. Improving Aggregate Recommendation Diversity Using Ranking-Based Techniques. IEEE Trans. on Knowl. and Data Eng. 24(5), May 2012.

Agrawal, R., Gollapudi, S., Halverson, A., Ieong, S. Diversifying search results. WSDM 2009, Barcelona, Spain, pp. 5-14.

Anderson, C. The Long Tail: Why the Future of Business is Selling Less of More. Hyperion, New York, NY, USA, 2006.

Bellogín, A. , Castells, P. and Cantador, I. Statistical Biases in Information Retrieval Metrics for Recommender Systems. Information Retrieval 20(6), July 2017, 606-634.

Brickman, P., D’Amato, B. Exposure Effects in a Free Choice Situation. Journal of Personality and Social Psychology 32(3), 1975, pp. 415-420.

Cañamares, R. and Castells, P. Should I follow the Crowd? A Probabilistic Analysis of the Effectiveness of Popularity in Recommender Systems. SIGIR 2017, Ann Arbor, MI, USA, pp. 415-424.

Cañamares, R. and Castells, P. A Probabilistic Reformulation of Memory-Based Collaborative Filtering – Implications on Popularity Biases. SIGIR 2017, Ann Arbor, MI, USA, pp. 215-224.

Cañamares, R., Redondo, M., Castells, P. Multi-Armed Recommender System Bandit Ensembles. RecSys 2019, Copenhagen, Denmark, pp. 432-436.

Page 79: Recommender systems evaluation beyond accuracyir.ii.uam.es/castells/lars2019.pdf · ACM Latin American School on Recommender Systems (LARS 2019) Fortaleza, Brazil, October 10, 2019

IRGIR Group @ UAM

Recommender Systems Evaluation Beyond AccuracyACM Latin American School on Recommender Systems (LARS 2019)

Fortaleza, Brazil, October 10, 2019

Carbonell, J. G. and Goldstein, J. The Use of MMR, Diversity-Based Reranking for Reordering Documents and Producing Summaries. SIGIR 1998, Melbourne, Australia, 335-336.

Castells, P, Hurley, N. J., Vargas, S. Novelty and Diversity in Recommender Systems. In: Recommender Systems Handbook, 2nd edition, F. Ricci, L. Rokach, B. Shapira (Eds.). Springer, New York, NY, USA, pp. 881-918.

Castells, P., Wang. J., Lara, R., Zhang. D. Workshop on novelty and diversity in recommender systems – DiveRS 2011. RecSys 2011, Chicago, Illinois, USA, pp. 393-394.

Celma, O. and Herrera, P. A New Approach to Evaluating Novel Recommendations. RecSys 2008, Lausanne, Switzerland, pp. 179-186.

Chaney, A. J. B., Stewart, B. M., Engelhardt, B. E. How algorithmic confounding in recommendation systems increases homogeneity and decreases utility. RecSys 2018, Vancouver, Canada, pp. 224-232.

Chapelle, O., Ji, S., Liao, C., Velipasaoglu, E., Lai, L., Wu, S-L. Intent-based diversification of web search results: metrics and algorithms. Information Retrieval 14(6), December 2011, pp. 572-592.

Chapelle, O. and Li, L. An empirical evaluation of Thompson Sampling. NIPS 2011, Granada, Spain, pp. 2249-2257.

Chen, H. and Karger, D. R. Less is More. 29th Annual International ACM Conference on Research and Development in Information Retrieval (SIGIR 2006). Seattle, WA, USA, pp. 429-436.

Clarke, C. L. A., Kolla, M., Cormack, G. V., Vechtomova, O., Ashkan, A., Büttcher, S., MacKinnon, I. Novelty and diversity in information retrieval evaluation. SIGIR 2008, Singapore, pp. 659-666.

Page 80: Recommender systems evaluation beyond accuracyir.ii.uam.es/castells/lars2019.pdf · ACM Latin American School on Recommender Systems (LARS 2019) Fortaleza, Brazil, October 10, 2019

IRGIR Group @ UAM

Recommender Systems Evaluation Beyond AccuracyACM Latin American School on Recommender Systems (LARS 2019)

Fortaleza, Brazil, October 10, 2019

Clarke, C. L. A., Craswell, N., Soboroff, I, Cormack, G. V. Overview of the TREC 2010 Web Track. TREC 2010, Gaithersburg, MD, USA.

Clarke, C. L. A., Craswell, N., Soboroff, I., Ashkan, A. A Comparative Analysis of Cascade Measures for Novelty and Diversity. WSDM 2011, Hong-Kong, China, pp. 75-84.

Cremonesi, P., Koren, Y. and Turrin, R. Performance of recommender algorithms on top-n recommendation tasks. RecSys 2010, Barcelona, Spain, pp. 39-46.

Fleder, D. M. and Hosanagar, K. Blockbuster Culture’s Next Rise or Fall: The Impact of Recommender Systems on Sales Diversity. Management Science 35(5), May 2009, pp. 697-712.

Ge. M., Delgado-Battenfeld, C., Jannach,D. Beyond accuracy: evaluating recommender systems by coverage and serendipity. RecSys 2010, Barcelona, Spain, pp. 257-260.

Hurley, N., Zhang, M. Novelty and Diversity in Top-N Recommendation – Analysis and Evaluation. ACM TIIT 10(4), March 2011.

Iaquinta, L., de Gemmis, M., Lops, P., Semeraro, G., Filannino, M., Molino, P. Introducing Serendipity in a Content-based Recommender System. HIS 2008, Barcelona, Spain, September 2008.

Jalili, M., Javari, A. Accurate and novel recommendations: An algorithm based on popularity forecasting. ACM TIST 5(4), Special Issue on Novelty and Diversity in Recommender Systems, Jan. 2015.

Jannach, D., Lerche, L., Kamehkhosh, I. Jugovac, M. What recommenders recommend: an analysis of recommendation biases and possible countermeasures. UMUAI 25(5), Dec. 2015, pp. 427-491.

Page 81: Recommender systems evaluation beyond accuracyir.ii.uam.es/castells/lars2019.pdf · ACM Latin American School on Recommender Systems (LARS 2019) Fortaleza, Brazil, October 10, 2019

IRGIR Group @ UAM

Recommender Systems Evaluation Beyond AccuracyACM Latin American School on Recommender Systems (LARS 2019)

Fortaleza, Brazil, October 10, 2019

Kahn, B. E. Consumer variety-seeking among goods and services: An integrative review. Journal of Retailing and Consumer Services 2(3), July 1995, pp.139-148.

Kaminskas, M., Bridge, D. Diversity, Serendipity, Novelty, and Coverage: A Survey and Empirical Analysis of Beyond-Accuracy Objectives in Recommender Systems. ACM TIIS 7(1), March 2017.

Kapoor, K., Kumar, V., Terveen, L. G., Konstan, J. A., Schrater, P. R. “I like to explore sometimes”: Adapting to Dynamic User Novelty Preferences. RecSys 2015, Vienna, Austria, pp. 19-26.

Karumur, R. P., Nguyen, T. T., Konstan, J. A. Early Activity Diversity: Assessing Newcomer Retention from First-Session Activity. CSCW 2016, San Francisco, CA, USA, pp. 594-607.

Lacerda, A. Multi-Objective Ranked Bandits for Recommender Systems. Neurocomputing 246, July 2017, 12-24.

Lathia, N., Hailes, S., Capra, L., Amatriain, X. Temporal Diversity in Recommender Systems. SIGIR 2010, Geneva, Switzerland, 210-217.

Li, S., Karatzoglou, A. and Gentile, C. Collaborative Filtering Bandits. SIGIR 2016, Pisa, pp. 539-548.

Maddi, S. R. The Pursuit of Consistency and Variety. In Abelson, R. P. et al. (Eds.), Theories of Cognitive Consistency: A Sourcebook, Rand McNally, Chicago, 1968, pp. 61-85.

Marlin, B. M., Zemel, R. S. Collaborative prediction and ranking with non-random missing data. RecSys 2009, New York, NY, USA, pp. 5-12.

McAlister, L. Choosing Multiple Items from a Product Class. Journal of Consumer Research 6, December 1979, pp. 213-224.

Page 82: Recommender systems evaluation beyond accuracyir.ii.uam.es/castells/lars2019.pdf · ACM Latin American School on Recommender Systems (LARS 2019) Fortaleza, Brazil, October 10, 2019

IRGIR Group @ UAM

Recommender Systems Evaluation Beyond AccuracyACM Latin American School on Recommender Systems (LARS 2019)

Fortaleza, Brazil, October 10, 2019

McAlister, L., Pessemier, E. A. Variety seeking behavior: an interdisciplinary review. Journal of Consumer Research 9, December 1982.

McNee, S. M., Riedl, J., Konstan, J. A. Being Accurate is Not Enough: How Accuracy Metrics have hurt Recommender Systems. CHI 2006, Montréal, Canada, pp. 1097-1101.

McInerney, J., Lacker, B., Hansen, S., Higley, K., Bouchard, H., Gruson, A. and Mehrotra, R. Explore, exploit, and explain: personalizing explainable recommendations with bandits. RecSys 2018, Vancouver, Canada, pp. 31-39.

Mourão, F., Fonseca, C., Araújo, C., Meira Jr., W. The Oblivion Problem: Exploiting Forgotten Items to Improve Recommendation Diversity. Workshop on Novelty and Diversity in Recommender Systems (DiveRS 2011) at RecSys 2011, Chicago, Illinois, October 2011, pp. 27-34.

Murakami, T., Mori, K., Orihara, R. Metrics for Evaluating the Serendipity of Recommendation Lists. JSAI 2007. Mizayaki, Japan, June 2007. Also in Springer Verlag LNCS Vol. 4914, 2008, pp 40-46.

Nguyen T. T., Hui, P-M., Harper, F. M., Terveen, L. G., Konstan, J. A. Exploring the filter bubble: the effect of using recommender systems on content diversity. WWW 2014, Seoul, Korea, pp. 677-686.

Onuma, K., Tong, H., Faloutsos, C. TANGENT: a novel, ‘Surprise me’, recommendation algorithm. KDD 2009, pp. 657-666.

Park, Y-J., Tuzhilin, A. The long tail of recommender systems and how to leverage it. RecSys 2008, Lausanne, Switzerland, pp. 11-18.

Patil, G. P., Taillie, C. Diversity as a Concept and its Measurement. Journal of the American Statistical Association 77(379), September 1982, pp. 548-561.

Page 83: Recommender systems evaluation beyond accuracyir.ii.uam.es/castells/lars2019.pdf · ACM Latin American School on Recommender Systems (LARS 2019) Fortaleza, Brazil, October 10, 2019

IRGIR Group @ UAM

Recommender Systems Evaluation Beyond AccuracyACM Latin American School on Recommender Systems (LARS 2019)

Fortaleza, Brazil, October 10, 2019

Raju, P. S. Optimum Stimulation Level: Its Relationship to Personality, Demographics and Exploratory Behavior. Journal of Consumer Research 7(3), December 1980, pp. 272-282.

Ribeiro, M. T., Lacerda, A., Veloso, A. and Ziviani, N. Pareto-efficient hybridization for multi-objective recommender systems. RecSys 2012, Dublin, Ireland, September 2012, pp. 19-26.

Salganik, M. J., Dodds, P. S. and Watts, D. J. Experimental Study of Inequality and Unpredictability in an Artificial Cultural Market. Science 311(5762), February 2006, pp. 854-856.

Santos, R. L. T., Macdonald, C., Ounis, I. Exploiting query reformulations for web search result diversification. WWW 2010, Raleigh, NC, USA, April 2010, pp. 881-890.

Santos, R. L. T., Macdonald, C., Ounis, I. Search Result Diversification. Foundations and Trends in Information Retrieval 9(1), 2015.

Sanz-Cruzado, J. Castells, P. Enhancing Structural Diversity in Social Networks by Recommending Weak Ties. RecSys 2018, Vancouver, Canada, pp. 233-241.

Sanz-Cruzado, J., Castells, P., López, E. A Simple Multi-Armed Nearest-Neighbor Bandit for Interactive Recommendation. RecSys 2019, Copenhagen, Denmark, pp. 358-362.

Schnabel, T., Swaminathan, A., Singh, A., Chandak, N. and Joachims, T. Recommendations as Treatments: Debiasing Learning and Evaluation. ICML 2016, New York, NY, USA, pp. 1670-1679.

Shi, Y., Zhao, X., Wang, J., Larson, M., Hanjalic, A. Adaptive diversification of recommendation results via latent factor portfolio. SIGIR 2012, Portland, OR, USA, pp. 175-184.

Page 84: Recommender systems evaluation beyond accuracyir.ii.uam.es/castells/lars2019.pdf · ACM Latin American School on Recommender Systems (LARS 2019) Fortaleza, Brazil, October 10, 2019

IRGIR Group @ UAM

Recommender Systems Evaluation Beyond AccuracyACM Latin American School on Recommender Systems (LARS 2019)

Fortaleza, Brazil, October 10, 2019

Sinha, A., Gleich, D. F. Ramani, K. Deconvolving Feedback Loops in Recommender Systems. NIPS 2016, Barcelona, Spain, December 2016, pp. 3243-3251.

Smyth, B. McClave, P. Similarity vs. diversity. ICCBR 2001. London, UK, pp. 347-361.

Steck, H. Training and Testing of Recommender Systems on Data Missing not at Random. KDD 2010, Washington D. C., USA, pp. 713-722.

Steck, H. Item popularity and recommendation accuracy. RecSys 2011, Chicago, IL, pp. 125-132.

Sutton R. and Barto, A. Reinforcement Learning: An Introduction (2nd ed.). MIT Press, Cambridge, MA, USA, 2018.

Swaminathan, A., Krishnamurthy, A., Agarwal, A., Dudik, M., Langford, J., Jose, D. and Zitouni, I. Off-policy Evaluation for Slate Recommendation. NIPS 2017, Long Beach, CA, USA, pp. 3635-3645.

Vallet, D. and Castells, P. Personalized Diversification of Search Results. SIGIR 2012, Portland, OR, USA, pp. 841-850.

Varadarajan, P. Product Diversity and Firm Performance: An Empirical Investigation. Journal of Marketing 50(3), July 1986, pp. 43-57.

Vargas, S., Castells, P. and Vallet, D. Intent-Oriented Diversity in Recommender Systems. SIGIR 2011, Beijing, China, pp. 1211-1212.

Vargas, S. and Castells, P. Rank and Relevance in Novelty and Diversity Metrics for Recommender Systems. RecSys 2011. Chicago, Illinois, pp. 109-116.

Page 85: Recommender systems evaluation beyond accuracyir.ii.uam.es/castells/lars2019.pdf · ACM Latin American School on Recommender Systems (LARS 2019) Fortaleza, Brazil, October 10, 2019

IRGIR Group @ UAM

Recommender Systems Evaluation Beyond AccuracyACM Latin American School on Recommender Systems (LARS 2019)

Fortaleza, Brazil, October 10, 2019

Vargas, S. and Castells, P. Exploiting the Diversity of User Preferences for Recommendation. OAIR 2013, Lisbon, Portugal, May 2013.

Veloso, A., Ribeiro, M., Lacerda, A., Moura, E., Hata, I. and Ziviani, N. Multi-Objective Pareto-Efficient Approaches for Recommender Systems. ACM TIST 5(4), Special Issue on Novelty and Diversity in Recommender Systems, January 2015.

Yang, L., Cui, Y., Xuan, Y. , Wang, C. , Belongie, S. and Estrin, D. Unbiased Offline Recommender Evaluation for Missing-Not-At-Random Implicit Feedback. RecSys 2018, Vancouver, Canada, pp. 279-287.

Zhang, M. and Hurley, N. Avoiding Monotony: Improving the Diversity of Recommendation Lists. RecSys 2008, Lausanne, Switzerland, 123-130.

Zhang, M., Hurley, N. Novel Item Recommendation by User Profile Partitioning. Web Intelligence 2009, pp. 508-515.

Zhang, Y. C., Ó Séaghdha, D., Quercia, D., Jambor, T. Auralist: introducing serendipity into music recommendation. WSDM 2012, Seattle, WA, USA, pp. 13-22.

Zhou, T., Kuscsik, Z., Liu, J-G., Medo, M., Wakeling, J. R., Zhang, Y-C. Solving the apparent diversity-accuracy dilemma of recommender systems. PNAS 107(10), March 2010, pp. 4511-4515.

Ziegler, C-N., McNee, S. M., Konstan, J. A., Lausen, G. Improving recommendation lists through topic diversification. WWW 2005, Chiba, Japan, pp. 22-32.