Upload
mounia-lalmas
View
1.446
Download
2
Embed Size (px)
Citation preview
A Journey into Evalua0on: from Retrieval Effec0veness to User Engagement Mounia Lalmas Yahoo Labs London [email protected]
SPIRE 2015 – King’s College London
This talk
§ Introduction to user engagement
§ Evaluation in information retrieval (retrieval effectiveness)
§ From retrieval effectiveness to user engagement (from intra-session to inter-session evaluation)
(from small- to large-scale evaluation)
This talk
beyond the click
beyond relevance
towards user engagement
User engagement
What is user engagement?
“User engagement is a quality of the user experience that emphasizes the phenomena associated with wanting to use a technological resource longer and frequently” (Attfield et al, 2011) self-report: happy, sad, enjoyment, …
emotional, cognitive and behavioural connection that exists, at any point in time and over time, between a user and a technological resource
analytics: click, upload, read, comment, share …
physiology: gaze, body heat, mouse movement, …
6
Why is it important to engage users? § In today’s wired world, users have enhanced expectations
about their interactions with technology … resulting in increased competition amongst the
purveyors and designers of interactive systems. § In addition to utilitarian factors, such as usability, we must
consider the hedonic and experiential factors of interacting with technology, such as fun, fulfillment, play, and user engagement.
(O’Brien, Lalmas & Yom-Tov, 2014)
Online sites differ with respect to their engagement pattern
Games Users spend much time per visit
Search Users come frequently and do not stay long
Social media Users come frequently and stay long
Niche Users come on average once a week e.g. weekly post
News Users come periodically, e.g. morning and evening
Service Users visit site, when needed, e.g. to renew subscription
(Lehmann etal, 2012)
Characteristics of user engagement
Novelty (Webster & Ho, 1997; O’Brien,
2008)
Richness and control (Jacques et al, 1995; Webster &
Ho, 1997)
Aesthetics (Jacques et al, 1995; O’Brien,
2008)
Endurability (Read, MacFarlane, & Casey,
2002; O’Brien, 2008)
Focused attention (Webster & Ho, 1997; O’Brien,
2008)
Reputation, trust and expectation (Attfield et al,
2011)
Positive Affect (O’Brien & Toms, 2008)
Motivation, interests, incentives, and benefits
(Jacques et al., 1995; O’Brien & Toms, 2008)
(O’Brien, Lalmas & Yom-Tov, 2014)
Measuring user engagement Measures Attributes
Self-report Questionnaire, interview, think-aloud and think after protocols
Subjective Short- and long-term Lab and field Small scale
Physiology EEG, SCL, fMRI eye tracking mouse-tracking
Objective Short-term Lab and field Small and large scale
Analytics within- and across-session metrics data science
Objective Short- and long-term Field Large scale
Attributes of user engagement
§ Scale (small versus large) § Setting (laboratory versus field) § Objective versus subjective § Temporality (short- versus long-term)
We focus on 1. Temporality: from intra- to inter-session 2. Scalability: from small- to large-scale
Evaluation in information retrieval
How to evaluate a search engine
§ Coverage § Speed § Query language § User interface
§ User happiness › Users find what they want and return to the search engine › Users complete the search task, where search is a means, not an end
Sec. 8.6
(Manning, Raghavan & Schütze, 2008; Baeza-Yates & Ribeiro-Neto, 2011)
Within an online session
› July 2012 › 2.5M users › 785M page views
› Categorization of the most frequent accessed sites
• 11 categories (e.g. news), 33 subcategories (e.g. news finance, news society)
• 760 sites from 70 countries/regions
short sessions: average 3.01 distinct sites visited with revisitation rate 10% long sessions: average 9.62 distinct sites visited with revisitation rate 22%
(Lehmann etal, 2013)
Measuring user happiness Most common proxy: relevance of search results
Sec. 8.1
Relevant
Retrieved
all items
§ User informa)on need translated into a query
§ Relevance assessed rela0ve to informa)on need not the query
§ Example: › Informa0on need: I am looking for tennis
holiday in a country with no rain › Query: tennis academy good weather
Evaluation measures: • precision, recall, R-precision; precision@n; mean average precision; F-measure; … • bpref; cumulative gains, …
precision
recall
Measuring user happiness Most common proxy: relevance of search result
Sec. 8.1
Explicit signals Test collection methodology (TREC, CLEF, …) Human labeled corpora
Implicit signals User behavior in online settings (clicks, skips, …)
Examples of implicit signals in web search
§ Number of clicks
§ Click at given position
§ Time to first click § Skipping
§ Abandonment rate § Number of query reformulations
§ Dwell time
What is a happy user in web search 1. The user information need is satisfied 2. The user has learned about a topic and even
about other topics 3. The system was inviting and even fun to use
In-the-moment engagement Users active on a site or stayed long Long-term engagement Users come back frequently and over a long-term period
USER ENGAGEMENT
Interpreting the signals
Click-through rates
CTR
new ranking algorithm new design of search result page …
I just wanted the phone number … I am totally happy J
No clicks
Dwell time
DWELL TIME used a proxy of user experience
Publisher click on an ad on mobile device
Dwell time on non-optimized landing pages comparable and even higher than on mobile-optimized ones
… when mobile optimized, users realize quickly whether they “like” the ad or not?
(Lalmas etal, 2015)
non-mobile optimized mobile optimized
Multimedia search activities often driven by entertainment needs, not by information needs
Relevance in multimedia search
(Slaney, 2011)
Explorative or serendipitous search
(Miliaraki, Blanco & Lalmas, 2015)
top most popular tweets top most popular tweets + geographical diverse
Being from a central or peripheral location makes a difference. Peripheral users did not perceive the timeline as being diverse
Objectivity versus subjectivity
It should never be just about the algorithm, but also how users respond to what the algorithm returns to them à USER ENGAGEMENT
(Eduardo Graells, 2015)
Let us revisit
Interactive Information Retrieval
(Ingwersen, Human Aspects in IR, ESSIR 2011)
US
ER
EN
GA
GE
ME
NT
Beyond clicks and relevance towards user engagement
§ From intra- to inter-session evaluation › Dwell time and absence time › Linking strategy › Mobile advertising
§ From small- to large-scale evaluation › Eye-tracking and user engagement questionnaire › Mouse tracking and user engagement questionnaire
happy users come back
we need to properly
identify the happy users
From intra- to inter-session evaluation
From short- to long-term engagement: From intra- to inter-session engagement
intra-session metric(s)
inter-session metric(s)
how users engage within a session?
how users engage across sessions?
We monitor We know what it will mean
futu
re e
ngag
emen
t
proxy
User engagement metrics
intra-session metrics • Dwell time • Session duration • Bounce rate • Play time (video) • Mouse movement • Click through rate (CTR) • Number of pages
viewed (click depth) • Conversion rate • Number of UCG
(comments) • …
Dwell time as a proxy of user interest Dwell time as a proxy of relevance Dwell time as a proxy of conversion Dwell time as a proxy of post-click ad quality …
User engagement metrics
intra-session
inter-session
Dwell time
§ Definition The contiguous time spent on a site or web page
§ Similar measures Play time (for video sites)
§ Cons Not clear that the user was actually looking at the site while there à blur/focus Distribution of dwell times on 50
websites
(O’Brien, Lalmas & Yom-Tov, 2014)
Dwell time Dwell time varies by site type: • leisure sites tend to have
longer dwell times than news, e-commerce, etc.
Dwell time has a relatively large variance even for the same site
Dwell time on 50 websites
(tourists, VIP, active … users)
(O’Brien, Lalmas & Yom-Tov, 2014)
Dwell time across sessions or absence time
The context – search experience
The context – search experience
Absence time and survival analysis
story 1story 2story 3story 4story 5story 6story 7story 8story 9
0 5 10 15 20
0.0
0.2
0.4
0.6
0.8
1.0
Users (%) who did come back
Users (%) who read story 2 but did not come back after 10 hours
SURVIVE
DIE
DIE = RETURN TO SITE èSHORT ABSENCE TIME
hours
Absence time applied to search Ranking function on Yahoo Answer Japan
Two-weeks click data on Yahoo Answer Japan: search One millions users Six ranking functions 30-minute session boundary
survival analysis: high hazard rate (die quickly) = short absence
5 clicks
cont
rol =
no
clic
k
Absence time and number of clicks on search result page
3 clicks
Absence time – search experience
1. No click means a bad user experience 2. Clicking between 3-5 results leads to same user experience 3. Clicking on more than 5 results reflects poorer user experience;
users cannot find what they are looking for 4. Clicking lower in the ranking (2nd, 3rd) suggests more careful choice
from the user (compared to 1st) 5. Clicking at bottom is a sign of low quality overall ranking 6. Users finding their answers quickly (time to 1st click) return sooner to
the search application 7. Returning to the same search result page is a worse user experience
than reformulating the query
search session metrics à absence time
(Dupret & Lalmas, 2013)
Others
Related off-‐site content
The context – Linking strategy in online news
News provider
p(a
bse
nce
12
h)
No Click Off-site click
Off-site link à absence time Providing links to related off-site content has a positive long-term effect
(Lehmann etal, In Progress)
The Context – Mobile advertising
0%
200%
400%
600%
short ad clicks long ad clicks
ad c
lick
diffe
renc
e
Dwell time à ad click
Positive post-click experience (“long” clicks) has an effect on users clicking on ads again
(Lalmas etal, 2015)
Beyond clicks and relevance towards user engagement
§ From intra- to inter-session evaluation › Dwell time and absence time › Linking strategy › Mobile advertising
happy users come back
From small- to large-scale evaluation
Small scale measurement – focused attention questionnaire 5-point scale (strong disagree to strong agree)
1. I lost myself in this news tasks experience 2. I was so involved in my news tasks that I lost track of time 3. I blocked things out around me when I was completing the news tasks 4. When I was performing these news tasks, I lost track of the world
around me 5. The time I spent performing these news tasks just slipped away 6. I was absorbed in my news tasks 7. During the news tasks experience I let myself go
(O'Brien & Toms, 2010)
Small scale measurement – PANAS questionnaire (10 positive items and 10 negative items)
§ You feel this way right now, that is, at the present moment [1 = very slightly or not at all; 2 = a little; 3 = moderately;
4 = quite a bit; 5 = extremely] [randomize items]
distressed, upset, guilty, scared, hostile, irritable, ashamed, nervous, jittery, afraid interested, excited, strong, enthusiastic, proud, alert, inspired, determined, attentive, active
(Watson, Clark & Tellegen, 1988)
Small scale measurement – gaze and self-reporting
News interest 57 users reading task (114)
• questionnaire (qualitative data) • record eye tracking • (quantitative data)
Three metrics: gaze, focus attention and
positive affect
All three metrics align: interesting content promote all engagement metrics
(Arapakis etal, 2014)
From small- to large-scale measurement – mouse tracking § Navigation & interaction with digital
environment usually involves the use of a mouse (selecting, positioning, clicking)
§ Several works show mouse cursor as weak proxy of gaze (attention)
§ Low-cost, scalable alternative
§ Can be performed in a non-invasive manner, without removing users from their natural setting
Relevance, dwell time & cursor
“reading” a relevant long document vs “scanning” a long non-relevant document
(Guo & Agichtein, 2012)
“U
gly”
vs
“N
orm
al” In
terf
ace
BBC News
Wikipedia
Mouse tracking and self-reporting § 324 users from Amazon Mechanical Turk (between
subject design) § Two tasks (reading and search) § “Normal vs Ugly” interface
§ Questionnaires (qualitative data) › focus attention, positive effect › interest, aesthetics
§ Mouse tracking (quantitative data) › movement speed, movement rate, click rate, pause length, percentage of time
still
(Warnock & Lalmas, 2015)
Mouse tracking could not tell much about
• focused attention and positive affect • user interests in the task/topic • aesthetics
BUT BUT BUT BUT › “ugly” variant did not result in lower USER aesthetics scores › although BBC > Wikipedia
BUT – the comments left … › Wikipedia: “The website was simply awful. Ads flashing everywhere, poor
text colors on a dark blue background.”; “The webpage was entirely blue. I don't know if it was supposed to be like that, but it definitely detracted from the browsing experience.”
› BBC News: “The website's layout and color scheme were a bitch to navigate and read.”; “Comic sans is a horrible font.”
Flawed methodology? Non-existing signal? Wrong metric? Wrong measure?
§ Hawthorne Effect
§ Design › Usability versus engagement › Within- versus between-subject
§ Mouse movement was not sophisticated enough
Mouse Gestures à Features
x0y0
x1y1
x2y2
x3y3 x4y4
x5y5
x6y6
x7y7
x8y8
t
Δt rest Δt rest
resting cursor (500ms) resting cursor (1000ms) resting cursor (1500ms) click
−2000 0 2000 4000
02000
4000
6000
x
y
●●
●
●●●●●●●●●●●
●●●
(Arapakis, Lalmas & Valkanas, 2014)
22 users reading two articles 176,550 cursor positions 2,913 mouse gestures
Towards a taxonomy of mouse gestures for user engagement measurement
§ The top-ranked clustering configuration is the Spectral Clustering for the original dataset, with hyperbolic tangent kernel, for k = 38
• certain types of mouse gestures occur more or less often, depending on user interest in article
• significant correlations between certain types of mouse gestures and self-report measures
• cursor behaviour goes beyond measuring frustration • inform about the positive and negative interaction
Beyond clicks and relevance towards user engagement
§ From small- to large-scale evaluation › Eye-tracking and user engagement questionnaire › Mouse tracking and user engagement questionnaire
we need to properly identify the happy users
Towards user engagement
Towards User Engagement
happy users come back
we need to properly identify the happy users
§ “If you cannot measure it, you cannot improve it” William Thomson (Lord Kelvin)
§ “You cannot control what you cannot measure” DeMarco
§ “The way you measure is more important than what you measure” Art Gust
Thank you