A Journey into Evaluation: from Retrieval Effectiveness to User Engagement

A Journey into Evalua0on: from Retrieval Effec0veness to User Engagement Mounia Lalmas Yahoo Labs London [email protected]

SPIRE 2015 – King’s College London

This talk

§ Introduction to user engagement

§ Evaluation in information retrieval (retrieval effectiveness)

§ From retrieval effectiveness to user engagement (from intra-session to inter-session evaluation)

(from small- to large-scale evaluation)

This talk

beyond the click

beyond relevance

towards user engagement

User engagement

What is user engagement?

“User engagement is a quality of the user experience that emphasizes the phenomena associated with wanting to use a technological resource longer and frequently” (Attfield et al, 2011) self-report: happy, sad, enjoyment, …

emotional, cognitive and behavioural connection that exists, at any point in time and over time, between a user and a technological resource

analytics: click, upload, read, comment, share …

physiology: gaze, body heat, mouse movement, …

6

Why is it important to engage users? §  In today’s wired world, users have enhanced expectations

about their interactions with technology … resulting in increased competition amongst the

purveyors and designers of interactive systems. §  In addition to utilitarian factors, such as usability, we must

consider the hedonic and experiential factors of interacting with technology, such as fun, fulfillment, play, and user engagement.

(O’Brien, Lalmas & Yom-Tov, 2014)

Online sites differ with respect to their engagement pattern

Games Users spend much time per visit

Search Users come frequently and do not stay long

Social media Users come frequently and stay long

Niche Users come on average once a week e.g. weekly post

News Users come periodically, e.g. morning and evening

Service Users visit site, when needed, e.g. to renew subscription

(Lehmann etal, 2012)

Characteristics of user engagement

Novelty (Webster & Ho, 1997; O’Brien,

2008)

Richness and control (Jacques et al, 1995; Webster &

Ho, 1997)

Aesthetics (Jacques et al, 1995; O’Brien,

2008)

Endurability (Read, MacFarlane, & Casey,

2002; O’Brien, 2008)

Focused attention (Webster & Ho, 1997; O’Brien,

2008)

Reputation, trust and expectation (Attfield et al,

2011)

Positive Affect (O’Brien & Toms, 2008)

Motivation, interests, incentives, and benefits

(Jacques et al., 1995; O’Brien & Toms, 2008)


Measuring user engagement Measures Attributes

Self-report Questionnaire, interview, think-aloud and think after protocols

Subjective Short- and long-term Lab and field Small scale

Physiology EEG, SCL, fMRI eye tracking mouse-tracking

Objective Short-term Lab and field Small and large scale

Analytics within- and across-session metrics data science

Objective Short- and long-term Field Large scale

Attributes of user engagement

§ Scale (small versus large) § Setting (laboratory versus field) § Objective versus subjective § Temporality (short- versus long-term)

We focus on 1.  Temporality: from intra- to inter-session 2.  Scalability: from small- to large-scale

Evaluation in information retrieval

How to evaluate a search engine

§ Coverage § Speed § Query language § User interface

§ User happiness ›  Users find what they want and return to the search engine ›  Users complete the search task, where search is a means, not an end

Sec. 8.6

(Manning, Raghavan & Schütze, 2008; Baeza-Yates & Ribeiro-Neto, 2011)

Within an online session

›  July 2012 ›  2.5M users ›  785M page views

›  Categorization of the most frequent accessed sites

•  11 categories (e.g. news), 33 subcategories (e.g. news finance, news society)

•  760 sites from 70 countries/regions

short sessions: average 3.01 distinct sites visited with revisitation rate 10% long sessions: average 9.62 distinct sites visited with revisitation rate 22%

(Lehmann etal, 2013)

Measuring user happiness Most common proxy: relevance of search results

Sec. 8.1

Relevant

Retrieved

all items

§  User informa)on need translated into a query

§  Relevance assessed rela0ve to informa)on need not the query

§  Example: ›  Informa0on need: I am looking for tennis

holiday in a country with no rain ›  Query: tennis academy good weather

Evaluation measures: •  precision, recall, R-precision; precision@n; mean average precision; F-measure; … •  bpref; cumulative gains, …

precision

recall

Measuring user happiness Most common proxy: relevance of search result

Sec. 8.1

Explicit signals Test collection methodology (TREC, CLEF, …) Human labeled corpora

Implicit signals User behavior in online settings (clicks, skips, …)

Examples of implicit signals in web search

§  Number of clicks

§  Click at given position

§  Time to first click §  Skipping

§  Abandonment rate §  Number of query reformulations

§  Dwell time

What is a happy user in web search 1.  The user information need is satisfied 2.  The user has learned about a topic and even

about other topics 3.  The system was inviting and even fun to use

In-the-moment engagement Users active on a site or stayed long Long-term engagement Users come back frequently and over a long-term period

USER ENGAGEMENT

Interpreting the signals

Click-through rates

CTR

new ranking algorithm new design of search result page …

I just wanted the phone number … I am totally happy J

No clicks

Dwell time

DWELL TIME used a proxy of user experience

Publisher click on an ad on mobile device

Dwell time on non-optimized landing pages comparable and even higher than on mobile-optimized ones

… when mobile optimized, users realize quickly whether they “like” the ad or not?

(Lalmas etal, 2015)

non-mobile optimized mobile optimized

Multimedia search activities often driven by entertainment needs, not by information needs

Relevance in multimedia search

(Slaney, 2011)

Explorative or serendipitous search

(Miliaraki, Blanco & Lalmas, 2015)

top most popular tweets top most popular tweets + geographical diverse

Being from a central or peripheral location makes a difference. Peripheral users did not perceive the timeline as being diverse

Objectivity versus subjectivity

It should never be just about the algorithm, but also how users respond to what the algorithm returns to them à USER ENGAGEMENT

(Eduardo Graells, 2015)

Let us revisit

Interactive Information Retrieval

(Ingwersen, Human Aspects in IR, ESSIR 2011)

US

ER

EN

GA

GE

ME

NT

Beyond clicks and relevance towards user engagement

§ From intra- to inter-session evaluation ›  Dwell time and absence time ›  Linking strategy ›  Mobile advertising

§ From small- to large-scale evaluation ›  Eye-tracking and user engagement questionnaire ›  Mouse tracking and user engagement questionnaire

happy users come back

we need to properly

identify the happy users

From intra- to inter-session evaluation

From short- to long-term engagement: From intra- to inter-session engagement

intra-session metric(s)

inter-session metric(s)

how users engage within a session?

how users engage across sessions?

We monitor We know what it will mean

futu

re e

ngag

emen

t

proxy

User engagement metrics

intra-session metrics •  Dwell time •  Session duration •  Bounce rate •  Play time (video) • Mouse movement •  Click through rate (CTR) •  Number of pages

viewed (click depth) •  Conversion rate •  Number of UCG

(comments) • …

Dwell time as a proxy of user interest Dwell time as a proxy of relevance Dwell time as a proxy of conversion Dwell time as a proxy of post-click ad quality …

User engagement metrics

intra-session

inter-session

Dwell time

§ Definition The contiguous time spent on a site or web page

§ Similar measures Play time (for video sites)

§ Cons Not clear that the user was actually looking at the site while there à blur/focus Distribution of dwell times on 50

websites


Dwell time Dwell time varies by site type: •  leisure sites tend to have

longer dwell times than news, e-commerce, etc.

Dwell time has a relatively large variance even for the same site

Dwell time on 50 websites

(tourists, VIP, active … users)


Dwell time across sessions or absence time

The context – search experience

The context – search experience

Absence time and survival analysis

story 1story 2story 3story 4story 5story 6story 7story 8story 9

0 5 10 15 20

0.0

0.2

0.4

0.6

0.8

1.0

Users (%) who did come back

Users (%) who read story 2 but did not come back after 10 hours

SURVIVE

DIE

DIE = RETURN TO SITE èSHORT ABSENCE TIME

hours

Absence time applied to search Ranking function on Yahoo Answer Japan

Two-weeks click data on Yahoo Answer Japan: search One millions users Six ranking functions 30-minute session boundary

survival analysis: high hazard rate (die quickly) = short absence

5 clicks

cont

rol =

no

clic

k

Absence time and number of clicks on search result page

3 clicks

Absence time – search experience

1.  No click means a bad user experience 2.  Clicking between 3-5 results leads to same user experience 3.  Clicking on more than 5 results reflects poorer user experience;

users cannot find what they are looking for 4.  Clicking lower in the ranking (2nd, 3rd) suggests more careful choice

from the user (compared to 1st) 5.  Clicking at bottom is a sign of low quality overall ranking 6.  Users finding their answers quickly (time to 1st click) return sooner to

the search application 7.  Returning to the same search result page is a worse user experience

than reformulating the query

search session metrics à absence time

(Dupret & Lalmas, 2013)

Others

Related off-‐site content

The context – Linking strategy in online news

News provider

p(a

bse

nce

12

h)

No Click Off-site click

Off-site link à absence time Providing links to related off-site content has a positive long-term effect

(Lehmann etal, In Progress)

The Context – Mobile advertising

0%

200%

400%

600%

short ad clicks long ad clicks

ad c

lick

diffe

renc

e

Dwell time à ad click

Positive post-click experience (“long” clicks) has an effect on users clicking on ads again

(Lalmas etal, 2015)


§ From intra- to inter-session evaluation ›  Dwell time and absence time ›  Linking strategy ›  Mobile advertising


From small- to large-scale evaluation

Small scale measurement – focused attention questionnaire 5-point scale (strong disagree to strong agree)

1.  I lost myself in this news tasks experience 2.  I was so involved in my news tasks that I lost track of time 3.  I blocked things out around me when I was completing the news tasks 4.  When I was performing these news tasks, I lost track of the world

around me 5.  The time I spent performing these news tasks just slipped away 6.  I was absorbed in my news tasks 7.  During the news tasks experience I let myself go

(O'Brien & Toms, 2010)

Small scale measurement – PANAS questionnaire (10 positive items and 10 negative items)

§  You feel this way right now, that is, at the present moment [1 = very slightly or not at all; 2 = a little; 3 = moderately;

4 = quite a bit; 5 = extremely] [randomize items]

distressed, upset, guilty, scared, hostile, irritable, ashamed, nervous, jittery, afraid interested, excited, strong, enthusiastic, proud, alert, inspired, determined, attentive, active

(Watson, Clark & Tellegen, 1988)

Small scale measurement – gaze and self-reporting

News interest 57 users reading task (114)

•  questionnaire (qualitative data) •  record eye tracking •  (quantitative data)

Three metrics: gaze, focus attention and

positive affect

All three metrics align: interesting content promote all engagement metrics

(Arapakis etal, 2014)

From small- to large-scale measurement – mouse tracking § Navigation & interaction with digital

environment usually involves the use of a mouse (selecting, positioning, clicking)

§ Several works show mouse cursor as weak proxy of gaze (attention)

§  Low-cost, scalable alternative

§ Can be performed in a non-invasive manner, without removing users from their natural setting

Relevance, dwell time & cursor

“reading” a relevant long document vs “scanning” a long non-relevant document

(Guo & Agichtein, 2012)

“U

gly”

vs

“N

orm

al” In

terf

ace

BBC News

Wikipedia

Mouse tracking and self-reporting §  324 users from Amazon Mechanical Turk (between

subject design) §  Two tasks (reading and search) §  “Normal vs Ugly” interface

§  Questionnaires (qualitative data) ›  focus attention, positive effect ›  interest, aesthetics

§  Mouse tracking (quantitative data) ›  movement speed, movement rate, click rate, pause length, percentage of time

still

(Warnock & Lalmas, 2015)

Mouse tracking could not tell much about

•  focused attention and positive affect •  user interests in the task/topic •  aesthetics

BUT BUT BUT BUT ›  “ugly” variant did not result in lower USER aesthetics scores ›  although BBC > Wikipedia

BUT – the comments left … ›  Wikipedia: “The website was simply awful. Ads flashing everywhere, poor

text colors on a dark blue background.”; “The webpage was entirely blue. I don't know if it was supposed to be like that, but it definitely detracted from the browsing experience.”

›  BBC News: “The website's layout and color scheme were a bitch to navigate and read.”; “Comic sans is a horrible font.”

Flawed methodology? Non-existing signal? Wrong metric? Wrong measure?

§ Hawthorne Effect

§ Design ›  Usability versus engagement ›  Within- versus between-subject

§ Mouse movement was not sophisticated enough

Mouse Gestures à Features

x0y0

x1y1

x2y2

x3y3 x4y4

x5y5

x6y6

x7y7

x8y8

t

Δt rest Δt rest

resting cursor (500ms) resting cursor (1000ms) resting cursor (1500ms) click

−2000 0 2000 4000

02000

4000

6000

x

y

●●

●

●●●●●●●●●●●

●●●

(Arapakis, Lalmas & Valkanas, 2014)

22 users reading two articles 176,550 cursor positions 2,913 mouse gestures

Towards a taxonomy of mouse gestures for user engagement measurement

§  The top-ranked clustering configuration is the Spectral Clustering for the original dataset, with hyperbolic tangent kernel, for k = 38

•  certain types of mouse gestures occur more or less often, depending on user interest in article

•  significant correlations between certain types of mouse gestures and self-report measures

•  cursor behaviour goes beyond measuring frustration •  inform about the positive and negative interaction


§ From small- to large-scale evaluation ›  Eye-tracking and user engagement questionnaire ›  Mouse tracking and user engagement questionnaire

we need to properly identify the happy users

Towards user engagement

Towards User Engagement


we need to properly identify the happy users

§  “If you cannot measure it, you cannot improve it” William Thomson (Lord Kelvin)

§  “You cannot control what you cannot measure” DeMarco

§  “The way you measure is more important than what you measure” Art Gust

Thank you

Internet

A Journey into Evaluation: from Retrieval Effectiveness to User Engagement