Explaining the User Experience of Recommender Systems with User Experiments

The User Experienceof Recommender Systems

IntroductionWhat I want to do today

INFORMATION AND COMPUTER SCIENCES

IntroductionBart Knijnenburg

- Current- Informatics PhD candidate,

UC Irvine- Research intern,

Samsung R&D

- Past- Researcher & teacher,

TU Eindhoven- MSc Human Technology

Interaction, TU Eindhoven- M Human-Computer

Interaction, Carnegie Mellon


IntroductionBart Knijnenburg

- UMUAI paper- Explaining the User Experience

of Recommender Systems (first UX framework in RecSys)

- Founder of UCERSTI- workshop on user-centric

evaluation of recommender systems and their interfaces

- this year: tutorial on user experiments


IntroductionMy goal:

Introduce my framework for user-centric evaluation of recommender systemsExplain how the framework can help in conducting user experiments

My approach:

- Start with “beyond the five stars”

- Explain how to improve upon this

- Introduce a 21st century approach to user experiments


Introduction

“What is a user experiment?”

“A user experiment is a scientific method to investigate how and why system aspects

influence the users’ experience and behavior.”

IntroductionWhat I want to do today

Beyond the 5 starsThe Netflix approach

Towards user experienceHow to improve upon the Netflix approach

Evaluation frameworkA framework for user-centric evaluation

MeasurementMeasuring subjective valuations

AnalysisStatistical evaluation of results

Some examplesFindings based on the framework

Beyond the 5 starsThe Netflix approach


Beyond the 5 stars

Offline evaluations may not give the same outcome as online evaluations

Cosley et al., 2002; McNee et al., 2002

Solution: Test with real users


Beyond the 5 stars

Systemalgorithm

Interactionrating


Beyond the 5 stars

Higher accuracy does not always mean higher satisfactionMcNee et al., 2006

Solution: Consider other behaviors“Beyond the five stars”


Beyond the 5 stars

Systemalgorithm

Interaction

rating

consumption

retention


Beyond the 5 stars

The algorithm counts for only 5% of the relevance of a recommender system

Francisco Martin - RecSys 2009 keynote

Solution: test those other aspects“Beyond the five stars”


Beyond the 5 stars

System

algorithm

interaction

presentation

Interaction

rating

consumption

retention


Beyond the 5 starsStart with a hypothesis

Algorithm/feature/design X will increase member engagement and ultimately member retention

Design a testDevelop a solution or prototypeThink about dependent & independent variables, control, significance…

Execute the test


Beyond the 5 stars

Let data speak for itselfMany different metricsWe ultimately trust member engagement (e.g. hours of play) and retention

Is this good enough?I think it is not

Towards user experienceHow to improve upon the Netflix approach


User experience

“Testing a recommender against a static videoclip system, the number of clicked clips

and total viewing time went down!”


perceived recommendation quality

SSA

perceived system effectiveness

EXP

personalized

recommendationsOSA

number of clips watched from beginning

to end totalviewing time

number of clips clicked+

++

+

− −

choicesatisfaction

EXP

User experience

Knijnenburg et al.: “Receiving Recommendations and Providing Feedback”, EC-Web 2010


User experience

Behavior is hard to interpretRelationship between behavior and satisfaction is not always trivial

User experience is a better predictor of long-term retentionWith behavior only, you will need to run for a long time

Questionnaire data is more robust Fewer participants needed


User experience

Measure subjective valuations with questionnairesPerception and experience

Triangulate these data with behaviorFind out how they mediate the effect on behavior

Measure every step in your theoryCreate a chain of mediating variables


User experience

Perception: do users notice the manipulation?

- Understandability, perceived control

- Perceived recommendation diversity and quality

Experience: self-relevant evaluations of the system

- Process-related (e.g. choice difficulty)

- System-related (e.g. system effectiveness)

- Outcome related (e.g. choice satisfaction)


User experience

System

algorithm

interaction

presentation

Perception

usability

quality

appeal

Experience

system

process

outcome

Interaction

rating

consumption

retention


User experience

Personal and situational characteristics may have an important impact

Adomavicius et al., 2005; Knijnenburg et al., 2012

Solution: measure those as well


User experience

System

algorithm

interaction

presentation

Perception

usability

quality

appeal

Experience

system

process

outcome

Interaction

rating

consumption

retention

Personal CharacteristicsPersonal CharacteristicsPersonal Characteristicsgender privacy expertise

Situational CharacteristicsSituational CharacteristicsSituational Characteristicsroutine system trust choice goal

Evaluation frameworkA framework for user-centric evaluation


FrameworkFramework for user-centric evaluation of recommenders

Knijnenburg et al.: “Explaining the user experience of recommender systems”, UMUAI 2012

Systemalgorithm

interaction

presentation

Perceptionusability

quality

appeal

Experiencesystem

process

outcome

Interactionrating

consumption

retention




System

algorithm

interaction

presentation

Perception

usability

quality

appeal

Experience

system

process

outcome

Interaction

rating

consumption

retention



Framework

Objective System Aspects (OSA)

- visual / interaction design

- recommender algorithm

- presentation of recommendations

- additional features


System

algorithm

interaction

presentation

Perception

usability

quality

appeal

Experience

system

process

outcome

Interaction

rating

consumption

retention



FrameworkUser Experience (EXP)

Different aspects may influence different things

- interface -> system evaluations

- preference elicitation method -> choice process

- algorithm -> quality of the final choice


System

algorithm

interaction

presentation

Perception

usability

quality

appeal

Experience

system

process

outcome

Interaction

rating

consumption

retention



Framework

Perception or Subjective

System Aspects (SSA) Link OSA to EXP (mediation)Increase the robustness of the effects of OSAs on EXPHow and why OSAs affect EXP


System

algorithm

interaction

presentation

Perception

usability

quality

appeal

Experience

system

process

outcome

Interaction

rating

consumption

retention



Personal characteristics (PC)

Users’ general dispositionBeyond the influence of the systemTypically stable across tasks

Framework


System

algorithm

interaction

presentation

Perception

usability

quality

appeal

Experience

system

process

outcome

Interaction

rating

consumption

retention



Framework

Situational Characteristics (SC)

How the task influences the experienceAlso beyond the influence of the systemHere used for evaluation, not for augmenting algorithms!


System

algorithm

interaction

presentation

Perception

usability

quality

appeal

Experience

system

process

outcome

Interaction

rating

consumption

retention



Framework

Interaction (INT)

Observable behavior

- browsing, viewing, log-ins

Final step of evaluationBut not the goal of the evaluationTriangulate with EXP


Framework

System

algorithm

interaction

presentation

Perception

usability

quality

appeal

Experience

system

process

outcome

Interaction

rating

consumption

retention




FrameworkHas suggestions for measurement scales

e.g. How can we measure something like “satisfaction”?

Provides a good starting point for causal relations

e.g. How and why do certain system aspects influence the user experience?

Useful for integrating existing worke.g. How do recommendation list length, diversity and presentation each have an influence on user experience?


Framework

“Statisticians, like artists, have the bad habit of falling in love with their models.”

George Box

MeasurementMeasuring subjective valuations


Measurement

“To measure satisfaction, we asked users whether they liked the system

(on a 5-point rating scale).”


Measurement

Does the question mean the same to everyone?

- John likes the system because it is convenient

- Mary likes the system because it is easy to use

- Dave likes it because the recommendations are good

A single question is not enough to establish content validity

We need a multi-item measurement scale


Measurement

Measure at least 5 items per conceptAt least 2-3 should remain after removing bad items

It is most convenient to statements to which the participant can agree/disagree on a 5- or 7-point scale:

“I am satisfied with this system.”

Completely disagree, somewhat disagree, neutral, somewhat agree, completely agree


MeasurementUse both positively and negatively phrased items

- They make the questionnaire less “leading”

- They help filtering out bad participants

- They explore the “flip-side” of the scale

The word “not” is easily overlookedBad: “The recommendations were not very novel”Better: “The recommendations were not very novel”Best: “The recommendations felt outdated”


Measurement

Choose simple over specialized wordsParticipants may have no idea they are using a “recommender system”

Use as few words as possibleBut always use complete sentences


Measurement

Use design techniques that help to improve recall and provide appropriate time references

Bad: “I liked the signup process” (one week ago)Good: Provide a screenshotBest: Ask the question immediately afterwards

Avoid double-barreled itemsBad: “The recommendations were relevant and fun”


Measurement

“We asked users ten 5-point scale questions and summed the answers.”


MeasurementIs the scale really measuring a single thing?

- 5 items measure satisfaction, the other 5 convenience

- The items are not related enough to make a reliable scale

Are two scales really measuring different things?

- They are so closely related that they actually measure the same thing

We need to establish convergent and discriminant validity

This makes sure the scales are unidimensional


MeasurementSolution: factor analysis

- Define latent factors, specify how items “load” on them

- Factor analysis will determine how well the items “fit”

- It will give you suggestions for improvement

Benefits of factor analysis:

- Establishes convergent and discriminant validity

- Outcome is a normally distributed measurement scale

- The scale captures the “shared essence” of the items


General privacy

concerns

Control concerns

Collection concerns

collct22

collct23

collct24

collct25

collct26

collct27

ctrl28

ctrl29

ctrl30

ctrl31

ctrl32

gipc16 gipc17 gipc18 gipc19 gipc21

1

1

1

Measurement


General privacy

concerns

Control concerns

Collection concerns

collct22

collct24

collct25

collct26

collct27

ctrl28

ctrl29

gipc17 gipc18 gipc21

1

1

1

Measurement

AnalysisStatistical evaluation of results


Analysis

Manipulation -> perception: Do these two algorithms lead to a different level of perceived quality?

T-test-1

-0.5

0

0.5

1

Perceived quality

A B


Analysis

Perception -> experience:Does perceived quality influence system effectiveness?

Linear regression -2

-1

0

1

2

-3 0 3

System effectiveness

Recommendation quality



SSA


EXP

personalized

recommendationsOSA

+

+

Analysis

Manipulation -> perception -> experience

Does the algorithm influence effectiveness via perceived quality?

Path model


Analysis

Two manipulations -> perception:

What is the combined effect of list diversity and list length on perceived recommendation quality?

Factorial ANOVA 0

0.1

0.2

0.3

0.4

0.5

0.6

5 items 10 items 20 items

Perceived quality

low diversificationhigh diversification

Willemsen et al.: “Not just more of the same”, submitted to TiiS


Analysis

Manipulation x personal characteristic -> outcome

Do experts and novices rate these two interaction methods differently in terms of usefulness?

ANCOVA-2

-1

0

1

2

-3 0 3

System usefulness

Domain knowledge

case-based PE attribute-based PE

Knijnenburg & WIllemsen: “Understanding the Effect of

Adaptive Preference Elicitation Methods“, RecSys2009


Analysis

Use only one method: structural equation models

- A combination of Factor Analysis and Path Models

- The statistical method of the 21st century

- Statistical evaluation of causal effects

- Report as causal paths


Analysis

We compared three recommender systems

Three different algorithms

MF-I and MF-E algorithms are more effective

Why?

��

��

��

��

��

��

��

��


��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

Analysis

The mediating variables show the entire story



Number of items clicked in the

playerRating

of clicked itemsNumber of items

rated higher than the predicted rating

Number of items rated

1.250 (.183)p < .001

.310 (.148)p < .05

.356 (.175)p < .05

1.352 (.222)p < .001

.856 (.132)p < .001

20.094 (7.600)p < .01

.314 (.111)p < .01

1.865 (.729)p < .05

++

+ + +

+ +

+

+

participant's age

PC


EXP

.846 (.127)p < .001

Matrix Factorization recommender with explicit feedback (MF-E)(versus generally most popular; GMP)

OSA

Matrix Factorization recommender with implicit feedback (MF-I)

(versus most popular; GMP)OSA

perceived recommendation variety

SSA


SSA

Analysis



Analysis

“All models are wrong, but some are useful.”

George Box

Some examplesFindings based on the framework


Some examples

Giving users many options:

- increases the chance they find something nice

- also increases the difficulty of making a decision

Result: choice overload

Solution:

- Provide short, diversified lists of recommendations


movieexpertise

PC

++

+

+

−+

.455 (.211)p < .05

.181 (.075)p < .05

.503 (.090)p < .001

1.151 (.161)p < .001

.336 (.089)p < .001

-.417 (.125)p < .005

+

+ − +

.205 (.083)p < .05

.879 (.265)p < .001

.612 (.220)p < .01 -.804 (.230)

p < .001

.894 (.287)p < .005

perceived recommendation variety

SSA


SSA

Top-20

vs Top-5 recommendationsOSA

choicesatisfaction

EXP

choicedifficulty

EXP

Lin-20

vs Top-5 recommendationsOSA

Some examples


��

��

��

��

Some examples


Some examples

-1

-0.5

0

0.5

1

1.5


Choice difficulty

low diversificationhigh diversification

0

0.1

0.2

0.3

0.4

0.5

0.6


Perceived quality

0

0.5

1

1.5

2


Choice satisfaction


Some examples

Experts and novices differ in their decision-making process

- Experts want to leverage their domain knowledge

- Novices make decisions by choosing from examples

Result: they want different preference elicitation methods

Solution: Tailor the preference elicitation method to the userA needs-based approach seems to work well for everyone


Some examples


Some examples

-2

-1

0

1

2

-3 0 3

System usefulness

Domain knowledge

case-based PE attribute-based PE


Some examples*!

-2!

-1!

0!

1!

2!

-2! -1! 0! 1! 2!Con

trol!

Domain Knowledge!

TopN!Sort!Explicit!Implicit!Hybrid!

*!

-2!

-1!

0!

1!

2!

-2! -1! 0! 1! 2!

Und

erst

anda

bilit

y!

Domain Knowledge!

TopN!Sort!Explicit!Implicit!Hybrid! **!

***!

5!

10!

15!

20!

25!

30!

35!

40!

45!

-2! -1! 0! 1! 2!

Use

r int

erfa

ce s

atis

fact

ion!

Domain Knowledge!


***!1! **!

-2!

-1!

0!

1!

2!

-2! -1! 0! 1! 2!

Perc

eive

d sy

stem

effe

ctiv

enes

s!

Domain Knowledge!


*!1!

*!

-2!

-1!

0!

1!

2!

-2! -1! 0! 1! 2!

Cho

ice

satis

fact

ion!

Domain Knowledge!


*!

-2!

-1!

0!

1!

2!

-2! -1! 0! 1! 2!Con

trol!

Domain Knowledge!


*!

-2!

-1!

0!

1!

2!

-2! -1! 0! 1! 2!

Und

erst

anda

bilit

y!

Domain Knowledge!

TopN!Sort!Explicit!Implicit!Hybrid! **!

***!

5!

10!

15!

20!

25!

30!

35!

40!

45!

-2! -1! 0! 1! 2!

Use

r int

erfa

ce s

atis

fact

ion!

Domain Knowledge!


***!1! **!

-2!

-1!

0!

1!

2!

-2! -1! 0! 1! 2!

Perc

eive

d sy

stem

effe

ctiv

enes

s!

Domain Knowledge!


*!1!

*!

-2!

-1!

0!

1!

2!

-2! -1! 0! 1! 2!

Cho

ice

satis

fact

ion!

Domain Knowledge!



Some examples

*!

-2!

-1!

0!

1!

2!

-2! -1! 0! 1! 2!Con

trol!

Domain Knowledge!


*!

-2!

-1!

0!

1!

2!

-2! -1! 0! 1! 2!Con

trol!

Domain Knowledge!



Some examples

*!

-2!

-1!

0!

1!

2!

-2! -1! 0! 1! 2!

Und

erst

anda

bilit

y!

Domain Knowledge!


*!

-2!

-1!

0!

1!

2!

-2! -1! 0! 1! 2!

Und

erst

anda

bilit

y!

Domain Knowledge!



Some examples

**!

***!

5!

10!

15!

20!

25!

30!

35!

40!

45!

-2! -1! 0! 1! 2!

Use

r int

erfa

ce s

atis

fact

ion!

Domain Knowledge!

TopN!Sort!Explicit!Implicit!Hybrid!**!

***!

5!

10!

15!

20!

25!

30!

35!

40!

45!

-2! -1! 0! 1! 2!

Use

r int

erfa

ce s

atis

fact

ion!

Domain Knowledge!



Some examples***!1! **!

-2!

-1!

0!

1!

2!

-2! -1! 0! 1! 2!

Perc

eive

d sy

stem

effe

ctiv

enes

s!

Domain Knowledge!


***!1! **!

-2!

-1!

0!

1!

2!

-2! -1! 0! 1! 2!

Perc

eive

d sy

stem

effe

ctiv

enes

s!

Domain Knowledge!



Some examples*!1!

*!

-2!

-1!

0!

1!

2!

-2! -1! 0! 1! 2!

Cho

ice

satis

fact

ion!

Domain Knowledge!


*!1!

*!

-2!

-1!

0!

1!

2!

-2! -1! 0! 1! 2!

Cho

ice

satis

fact

ion!

Domain Knowledge!



Some examples

Social recommenders limit the neighbors to the users’ friends

Result:

- More control: friends can be rated as well

- Better inspectability: Recommendations can be traced

We show that each of these effects existThey both increase overall satisfaction


Some examplesRate likes Weigh friends


Some examples


Some examples

User Experience (EXP)Objective System Aspects (OSA)

Subjective System Aspects (SSA)

+

+

+

++

+

Understandability

(R2 = .153)

Satisfaction with the system

(R2 = .696)

Perceived control

(R2 = .311)

Interaction (INT)

Average rating(R2 = .508)

Inspection time (min)(R2 = .092)

+

−

+ ++

−

+ +

+

0.166 (0.077)*

Perceived recommendation

quality(R2 = .512)

Controlitem/friend vs. no control

Inspectabilityfull graph vs. list only

number of known recommendations

(R2 = .044)

−0.332 (0.088)***

0.257(0.124)*

0.205(0.100)*

0.375(0.094)***

−0.152 (0.063)*

0.410 (0.092)***

0.955 (0.148)***

0.231(0.114)*

0.377(0.074)***

0.770(0.094)***

0.249(0.049)***

0.695 (0.304)* 0.067 (0.022)**

0.323 (0.031)***

0.148(0.051)**

!2(2) = 10.70**item: 0.428 (0.207)*friend: 0.668 (0.206)**

!2(2) = 10.81**item: −0.181 (0.097)1

friend: −0.389 (0.125)**

0.288 (0.091)**

0.459 (0.148)**

Personal Characteristics (PC)

+++

−+

Music expertise

Familiarity with recommenders

Trusting propensity

IntroductionI discussed how to do user experiments with my framework

Beyond the 5 starsThe Netflix approach is a step in the right direction

Towards user experienceWe need to measure subjective valuations as well

Evaluation frameworkWith the framework you can put this all together

MeasurementMeasure subjective valuations with questionnaires and factor analysis

AnalysisUse structural equation models to test comprehensive causal models

Some examplesI demonstrated the findings of some studies that used this approach

See you in Dublin!

[email protected]

www.usabart.nl

@usabart

mailto:[email protected]

mailto:[email protected]

http://www.usabart.nl

http://www.usabart.nl


ResourcesUser-centric evaluation

Knijnenburg, B.P., Willemsen, M.C., Gantner, Z., Soncu, H., Newell, C.: Explaining the User Experience of Recommender Systems. UMUAI 2012.

Knijnenburg, B.P., Willemsen, M.C., Kobsa, A.: A Pragmatic Procedure to Support the User-Centric Evaluation of Recommender Systems. RecSys 2011.

Choice overloadWillemsen, M.C., Graus, M.P., Knijnenburg, B.P., Bollen, D.: Not just more of the same: Preventing Choice Overload in Recommender Systems by Offering Small Diversified Sets. Submitted to TiiS.

Bollen, D.G.F.M., Knijnenburg, B.P., Willemsen, M.C., Graus, M.P.: Understanding Choice Overload in Recommender Systems. RecSys 2010.


ResourcesPreference elicitation methods

Knijnenburg, B.P., Reijmer, N.J.M., Willemsen, M.C.: Each to His Own: How Different Users Call for Different Interaction Methods in Recommender Systems. RecSys 2011.

Knijnenburg, B.P., Willemsen, M.C.: The Effect of Preference Elicitation Methods on the User Experience of a Recommender System. CHI 2010.

Knijnenburg, B.P., Willemsen, M.C.: Understanding the effect of adaptive preference elicitation methods on user satisfaction of a recommender system. RecSys 2009.

Social recommendersKnijnenburg, B.P., Bostandjiev, S., O'Donovan, J., Kobsa, A.: Inspectability and Control in Social Recommender Systems. RecSys 2012.


ResourcesUser feedback and privacy

Knijnenburg, B.P., Kobsa, A: Making Decisions about Privacy: Information Disclosure in Context-Aware Recommender Systems. Submitted to TiiS.

Knijnenburg, B.P., Willemsen, M.C., Hirtbach, S.: Getting Recommendations and Providing Feedback: The User-Experience of a Recommender System. EC-Web 2010.