60
A Journey into Evalua0on: from Retrieval Effec0veness to User Engagement Mounia Lalmas Yahoo Labs London [email protected] SPIRE 2015 – King’s College London

A Journey into Evaluation: from Retrieval Effectiveness to User Engagement

Embed Size (px)

Citation preview

Page 1: A Journey into Evaluation: from Retrieval Effectiveness to User Engagement

A  Journey  into  Evalua0on:  from  Retrieval  Effec0veness  to  User  Engagement  Mounia Lalmas Yahoo Labs London [email protected]

SPIRE 2015 – King’s College London

Page 2: A Journey into Evaluation: from Retrieval Effectiveness to User Engagement

This talk

§ Introduction to user engagement

§ Evaluation in information retrieval (retrieval effectiveness)

§ From retrieval effectiveness to user engagement (from intra-session to inter-session evaluation)

(from small- to large-scale evaluation)

Page 3: A Journey into Evaluation: from Retrieval Effectiveness to User Engagement

This talk

beyond the click

beyond relevance

towards user engagement

Page 4: A Journey into Evaluation: from Retrieval Effectiveness to User Engagement

User engagement

Page 5: A Journey into Evaluation: from Retrieval Effectiveness to User Engagement

What is user engagement?

“User engagement is a quality of the user experience that emphasizes the phenomena associated with wanting to use a technological resource longer and frequently” (Attfield et al, 2011) self-report: happy, sad, enjoyment, …

emotional, cognitive and behavioural connection that exists, at any point in time and over time, between a user and a technological resource

analytics: click, upload, read, comment, share …

physiology: gaze, body heat, mouse movement, …

6

Page 6: A Journey into Evaluation: from Retrieval Effectiveness to User Engagement

Why is it important to engage users? §  In today’s wired world, users have enhanced expectations

about their interactions with technology … resulting in increased competition amongst the

purveyors and designers of interactive systems. §  In addition to utilitarian factors, such as usability, we must

consider the hedonic and experiential factors of interacting with technology, such as fun, fulfillment, play, and user engagement.

(O’Brien, Lalmas & Yom-Tov, 2014)

Page 7: A Journey into Evaluation: from Retrieval Effectiveness to User Engagement

Online sites differ with respect to their engagement pattern

Games Users spend much time per visit

Search Users come frequently and do not stay long

Social media Users come frequently and stay long

Niche Users come on average once a week e.g. weekly post

News Users come periodically, e.g. morning and evening

Service Users visit site, when needed, e.g. to renew subscription

(Lehmann etal, 2012)

Page 8: A Journey into Evaluation: from Retrieval Effectiveness to User Engagement

Characteristics of user engagement

Novelty (Webster & Ho, 1997; O’Brien,

2008)

Richness and control (Jacques et al, 1995; Webster &

Ho, 1997)

Aesthetics (Jacques et al, 1995; O’Brien,

2008)

Endurability (Read, MacFarlane, & Casey,

2002; O’Brien, 2008)

Focused attention (Webster & Ho, 1997; O’Brien,

2008)

Reputation, trust and expectation (Attfield et al,

2011)

Positive Affect (O’Brien & Toms, 2008)

Motivation, interests, incentives, and benefits

(Jacques et al., 1995; O’Brien & Toms, 2008)

(O’Brien, Lalmas & Yom-Tov, 2014)

Page 9: A Journey into Evaluation: from Retrieval Effectiveness to User Engagement

Measuring user engagement Measures   Attributes  

Self-report Questionnaire, interview, think-aloud and think after protocols

Subjective Short- and long-term Lab and field Small scale

Physiology EEG, SCL, fMRI eye tracking mouse-tracking

Objective Short-term Lab and field Small and large scale

Analytics within- and across-session metrics data science

Objective Short- and long-term Field Large scale

Page 10: A Journey into Evaluation: from Retrieval Effectiveness to User Engagement

Attributes of user engagement

§ Scale (small versus large) § Setting (laboratory versus field) § Objective versus subjective § Temporality (short- versus long-term)

We focus on 1.  Temporality: from intra- to inter-session 2.  Scalability: from small- to large-scale

Page 11: A Journey into Evaluation: from Retrieval Effectiveness to User Engagement

Evaluation in information retrieval

Page 12: A Journey into Evaluation: from Retrieval Effectiveness to User Engagement

How to evaluate a search engine

§ Coverage  § Speed  § Query  language  § User  interface  

§ User  happiness  ›  Users  find  what  they  want  and  return  to  the  search  engine  ›  Users  complete  the  search  task,  where  search  is  a  means,  not  an  end  

Sec. 8.6

(Manning, Raghavan & Schütze, 2008; Baeza-Yates & Ribeiro-Neto, 2011)

Page 13: A Journey into Evaluation: from Retrieval Effectiveness to User Engagement

Within an online session

›  July 2012 ›  2.5M users ›  785M page views

›  Categorization of the most frequent accessed sites

•  11 categories (e.g. news), 33 subcategories (e.g. news finance, news society)

•  760 sites from 70 countries/regions

short sessions: average 3.01 distinct sites visited with revisitation rate 10% long sessions: average 9.62 distinct sites visited with revisitation rate 22%

(Lehmann etal, 2013)

Page 14: A Journey into Evaluation: from Retrieval Effectiveness to User Engagement

Measuring user happiness Most  common  proxy:  relevance  of  search  results  

Sec. 8.1

Relevant

Retrieved

all items

§  User  informa)on  need  translated  into  a  query  

§  Relevance  assessed  rela0ve  to    informa)on  need  not  the  query  

§  Example:  ›  Informa0on  need:  I  am  looking  for  tennis  

holiday  in  a  country  with  no  rain  ›  Query:  tennis  academy  good  weather  

Evaluation measures: •  precision, recall, R-precision; precision@n; mean average precision; F-measure; … •  bpref; cumulative gains, …

precision

recall

Page 15: A Journey into Evaluation: from Retrieval Effectiveness to User Engagement

Measuring user happiness Most  common  proxy:  relevance  of  search  result  

Sec. 8.1

Explicit signals Test collection methodology (TREC, CLEF, …) Human labeled corpora

Implicit signals User behavior in online settings (clicks, skips, …)

Page 16: A Journey into Evaluation: from Retrieval Effectiveness to User Engagement

Examples of implicit signals in web search

§  Number of clicks

§  Click at given position

§  Time to first click §  Skipping

§  Abandonment rate §  Number of query reformulations

§  Dwell time

Page 17: A Journey into Evaluation: from Retrieval Effectiveness to User Engagement

What is a happy user in web search 1.  The user information need is satisfied 2.  The user has learned about a topic and even

about other topics 3.  The system was inviting and even fun to use

In-the-moment engagement Users active on a site or stayed long Long-term engagement Users come back frequently and over a long-term period

USER ENGAGEMENT

Page 18: A Journey into Evaluation: from Retrieval Effectiveness to User Engagement

Interpreting the signals

Page 19: A Journey into Evaluation: from Retrieval Effectiveness to User Engagement

Click-through rates

CTR

new ranking algorithm new design of search result page …

Page 20: A Journey into Evaluation: from Retrieval Effectiveness to User Engagement

I just wanted the phone number … I am totally happy J

No clicks

Page 21: A Journey into Evaluation: from Retrieval Effectiveness to User Engagement

Dwell time

DWELL TIME used a proxy of user experience

Publisher click on an ad on mobile device

Dwell time on non-optimized landing pages comparable and even higher than on mobile-optimized ones

… when mobile optimized, users realize quickly whether they “like” the ad or not?

(Lalmas etal, 2015)

non-mobile optimized mobile optimized

Page 22: A Journey into Evaluation: from Retrieval Effectiveness to User Engagement

Multimedia search activities often driven by entertainment needs, not by information needs

Relevance in multimedia search

(Slaney, 2011)

Page 23: A Journey into Evaluation: from Retrieval Effectiveness to User Engagement

Explorative or serendipitous search

(Miliaraki, Blanco & Lalmas, 2015)

Page 24: A Journey into Evaluation: from Retrieval Effectiveness to User Engagement

top most popular tweets top most popular tweets + geographical diverse

Being from a central or peripheral location makes a difference. Peripheral users did not perceive the timeline as being diverse

Objectivity versus subjectivity

It should never be just about the algorithm, but also how users respond to what the algorithm returns to them à USER ENGAGEMENT

(Eduardo Graells, 2015)

Page 25: A Journey into Evaluation: from Retrieval Effectiveness to User Engagement

Let us revisit

Page 26: A Journey into Evaluation: from Retrieval Effectiveness to User Engagement

Interactive Information Retrieval

(Ingwersen, Human Aspects in IR, ESSIR 2011)

US

ER

EN

GA

GE

ME

NT

Page 27: A Journey into Evaluation: from Retrieval Effectiveness to User Engagement

Beyond clicks and relevance towards user engagement

§ From intra- to inter-session evaluation ›  Dwell time and absence time ›  Linking strategy ›  Mobile advertising

§ From small- to large-scale evaluation ›  Eye-tracking and user engagement questionnaire ›  Mouse tracking and user engagement questionnaire

happy users come back

we need to properly

identify the happy users

Page 28: A Journey into Evaluation: from Retrieval Effectiveness to User Engagement

From intra- to inter-session evaluation

Page 29: A Journey into Evaluation: from Retrieval Effectiveness to User Engagement

From short- to long-term engagement: From intra- to inter-session engagement

intra-session metric(s)

inter-session metric(s)

how users engage within a session?

how users engage across sessions?

We monitor We know what it will mean

futu

re e

ngag

emen

t

proxy

Page 30: A Journey into Evaluation: from Retrieval Effectiveness to User Engagement

User engagement metrics

Page 31: A Journey into Evaluation: from Retrieval Effectiveness to User Engagement

intra-session metrics •  Dwell time •  Session duration •  Bounce rate •  Play time (video) • Mouse movement •  Click through rate (CTR) •  Number of pages

viewed (click depth) •  Conversion rate •  Number of UCG

(comments) • …

Dwell time as a proxy of user interest Dwell time as a proxy of relevance Dwell time as a proxy of conversion Dwell time as a proxy of post-click ad quality …

User engagement metrics

intra-session

inter-session

Page 32: A Journey into Evaluation: from Retrieval Effectiveness to User Engagement

Dwell time

§ Definition The contiguous time spent on a site or web page

§ Similar measures Play time (for video sites)

§ Cons Not clear that the user was actually looking at the site while there à blur/focus Distribution of dwell times on 50

websites

(O’Brien, Lalmas & Yom-Tov, 2014)

Page 33: A Journey into Evaluation: from Retrieval Effectiveness to User Engagement

Dwell time Dwell time varies by site type: •  leisure sites tend to have

longer dwell times than news, e-commerce, etc.

Dwell time has a relatively large variance even for the same site

Dwell time on 50 websites

(tourists, VIP, active … users)

(O’Brien, Lalmas & Yom-Tov, 2014)

Page 34: A Journey into Evaluation: from Retrieval Effectiveness to User Engagement

Dwell time across sessions or absence time

Page 35: A Journey into Evaluation: from Retrieval Effectiveness to User Engagement

The context – search experience

Page 36: A Journey into Evaluation: from Retrieval Effectiveness to User Engagement

The context – search experience

Page 37: A Journey into Evaluation: from Retrieval Effectiveness to User Engagement

Absence time and survival analysis

story 1story 2story 3story 4story 5story 6story 7story 8story 9

0 5 10 15 20

0.0

0.2

0.4

0.6

0.8

1.0

Users (%) who did come back

Users (%) who read story 2 but did not come back after 10 hours

SURVIVE

DIE

DIE = RETURN TO SITE èSHORT ABSENCE TIME

hours

Page 38: A Journey into Evaluation: from Retrieval Effectiveness to User Engagement

Absence time applied to search Ranking function on Yahoo Answer Japan

Two-weeks click data on Yahoo Answer Japan: search One millions users Six ranking functions 30-minute session boundary

Page 39: A Journey into Evaluation: from Retrieval Effectiveness to User Engagement

survival analysis: high hazard rate (die quickly) = short absence

5 clicks

cont

rol =

no

clic

k

Absence time and number of clicks on search result page

3 clicks

Page 40: A Journey into Evaluation: from Retrieval Effectiveness to User Engagement

Absence time – search experience

1.  No click means a bad user experience 2.  Clicking between 3-5 results leads to same user experience 3.  Clicking on more than 5 results reflects poorer user experience;

users cannot find what they are looking for 4.  Clicking lower in the ranking (2nd, 3rd) suggests more careful choice

from the user (compared to 1st) 5.  Clicking at bottom is a sign of low quality overall ranking 6.  Users finding their answers quickly (time to 1st click) return sooner to

the search application 7.  Returning to the same search result page is a worse user experience

than reformulating the query

search session metrics à absence time

(Dupret & Lalmas, 2013)

Page 41: A Journey into Evaluation: from Retrieval Effectiveness to User Engagement

Others

Page 42: A Journey into Evaluation: from Retrieval Effectiveness to User Engagement

Related  off-­‐site  content  

The context – Linking strategy in online news

News provider

p(a

bse

nce

12

h)

No Click Off-site click

Off-site link à absence time Providing links to related off-site content has a positive long-term effect

(Lehmann etal, In Progress)

Page 43: A Journey into Evaluation: from Retrieval Effectiveness to User Engagement

The Context – Mobile advertising

0%

200%

400%

600%

short ad clicks long ad clicks

ad c

lick

diffe

renc

e

Dwell time à ad click

Positive post-click experience (“long” clicks) has an effect on users clicking on ads again

(Lalmas etal, 2015)

Page 44: A Journey into Evaluation: from Retrieval Effectiveness to User Engagement

Beyond clicks and relevance towards user engagement

§ From intra- to inter-session evaluation ›  Dwell time and absence time ›  Linking strategy ›  Mobile advertising

happy users come back

Page 45: A Journey into Evaluation: from Retrieval Effectiveness to User Engagement

From small- to large-scale evaluation

Page 46: A Journey into Evaluation: from Retrieval Effectiveness to User Engagement

Small scale measurement – focused attention questionnaire 5-point scale (strong disagree to strong agree)

1.  I lost myself in this news tasks experience 2.  I was so involved in my news tasks that I lost track of time 3.  I blocked things out around me when I was completing the news tasks 4.  When I was performing these news tasks, I lost track of the world

around me 5.  The time I spent performing these news tasks just slipped away 6.  I was absorbed in my news tasks 7.  During the news tasks experience I let myself go

(O'Brien & Toms, 2010)

Page 47: A Journey into Evaluation: from Retrieval Effectiveness to User Engagement

Small scale measurement – PANAS questionnaire (10 positive items and 10 negative items)

§  You feel this way right now, that is, at the present moment [1 = very slightly or not at all; 2 = a little; 3 = moderately;

4 = quite a bit; 5 = extremely] [randomize items]

distressed, upset, guilty, scared, hostile, irritable, ashamed, nervous, jittery, afraid interested, excited, strong, enthusiastic, proud, alert, inspired, determined, attentive, active

(Watson, Clark & Tellegen, 1988)

Page 48: A Journey into Evaluation: from Retrieval Effectiveness to User Engagement

Small scale measurement – gaze and self-reporting

News interest 57 users reading task (114)

•  questionnaire (qualitative data) •  record eye tracking •  (quantitative data)

Three metrics: gaze, focus attention and

positive affect

All three metrics align: interesting content promote all engagement metrics

(Arapakis etal, 2014)

Page 49: A Journey into Evaluation: from Retrieval Effectiveness to User Engagement

From small- to large-scale measurement – mouse tracking § Navigation & interaction with digital

environment usually involves the use of a mouse (selecting, positioning, clicking)

§ Several works show mouse cursor as weak proxy of gaze (attention)

§  Low-cost, scalable alternative

§ Can be performed in a non-invasive manner, without removing users from their natural setting

Page 50: A Journey into Evaluation: from Retrieval Effectiveness to User Engagement

Relevance, dwell time & cursor

“reading” a relevant long document vs “scanning” a long non-relevant document

(Guo & Agichtein, 2012)

Page 51: A Journey into Evaluation: from Retrieval Effectiveness to User Engagement

“U

gly”

vs

“N

orm

al” In

terf

ace

BBC News

Wikipedia

Page 52: A Journey into Evaluation: from Retrieval Effectiveness to User Engagement

Mouse tracking and self-reporting §  324 users from Amazon Mechanical Turk (between

subject design) §  Two tasks (reading and search) §  “Normal vs Ugly” interface

§  Questionnaires (qualitative data) ›  focus attention, positive effect ›  interest, aesthetics

§  Mouse tracking (quantitative data) ›  movement speed, movement rate, click rate, pause length, percentage of time

still

(Warnock & Lalmas, 2015)

Page 53: A Journey into Evaluation: from Retrieval Effectiveness to User Engagement

Mouse tracking could not tell much about

•  focused attention and positive affect •  user interests in the task/topic •  aesthetics

BUT BUT BUT BUT ›  “ugly” variant did not result in lower USER aesthetics scores ›  although BBC > Wikipedia

BUT – the comments left … ›  Wikipedia: “The website was simply awful. Ads flashing everywhere, poor

text colors on a dark blue background.”; “The webpage was entirely blue. I don't know if it was supposed to be like that, but it definitely detracted from the browsing experience.”

›  BBC News: “The website's layout and color scheme were a bitch to navigate and read.”; “Comic sans is a horrible font.”

Page 54: A Journey into Evaluation: from Retrieval Effectiveness to User Engagement

Flawed methodology? Non-existing signal? Wrong metric? Wrong measure?

§ Hawthorne Effect

§ Design ›  Usability versus engagement ›  Within- versus between-subject

§ Mouse movement was not sophisticated enough

Page 55: A Journey into Evaluation: from Retrieval Effectiveness to User Engagement

Mouse Gestures à Features

x0y0

x1y1

x2y2

x3y3 x4y4

x5y5

x6y6

x7y7

x8y8

t

Δt rest Δt rest

resting cursor (500ms) resting cursor (1000ms) resting cursor (1500ms) click

−2000 0 2000 4000

02000

4000

6000

x

y

●●

●●●●●●●●●●●

●●●

(Arapakis, Lalmas & Valkanas, 2014)

22 users reading two articles 176,550 cursor positions 2,913 mouse gestures

Page 56: A Journey into Evaluation: from Retrieval Effectiveness to User Engagement

Towards a taxonomy of mouse gestures for user engagement measurement

§  The top-ranked clustering configuration is the Spectral Clustering for the original dataset, with hyperbolic tangent kernel, for k = 38

•  certain types of mouse gestures occur more or less often, depending on user interest in article

•  significant correlations between certain types of mouse gestures and self-report measures

•  cursor behaviour goes beyond measuring frustration •  inform about the positive and negative interaction

Page 57: A Journey into Evaluation: from Retrieval Effectiveness to User Engagement

Beyond clicks and relevance towards user engagement

§ From small- to large-scale evaluation ›  Eye-tracking and user engagement questionnaire ›  Mouse tracking and user engagement questionnaire

we need to properly identify the happy users

Page 58: A Journey into Evaluation: from Retrieval Effectiveness to User Engagement

Towards user engagement

Page 59: A Journey into Evaluation: from Retrieval Effectiveness to User Engagement

Towards User Engagement

happy users come back

we need to properly identify the happy users

Page 60: A Journey into Evaluation: from Retrieval Effectiveness to User Engagement

§  “If you cannot measure it, you cannot improve it” William Thomson (Lord Kelvin)

§  “You cannot control what you cannot measure” DeMarco

§  “The way you measure is more important than what you measure” Art Gust

Thank you