124
usage mining techniques with applications to web search and content recommendation Aristides Gionis Yahoo! Research, Barcelona yandex aug 31, 2012

Арис Гионис «Методы анализа поведения пользователей и его применение в веб-поиске и рекомендации

  • Upload
    yandex

  • View
    2.040

  • Download
    1

Embed Size (px)

DESCRIPTION

Научно-технический семинар «Умный веб-поиск: не только находит, но и рекомендует», 31 августа 2012 г. Арис Гионис, старший научный сотрудник Yahoo!Research, Барселона.

Citation preview

Page 1: Арис Гионис «Методы анализа поведения пользователей и его применение в веб-поиске и рекомендации

usage mining techniqueswith applications to web searchand content recommendation

Aristides Gionis

Yahoo! Research, Barcelona

yandex aug 31, 2012

Page 2: Арис Гионис «Методы анализа поведения пользователей и его применение в веб-поиске и рекомендации

yahoo! research, barcelona

web mining

social media and multimedia

large-scale distributed systems

user engagement

semantic web

yandex aug 31, 2012

Page 3: Арис Гионис «Методы анализа поведения пользователей и его применение в веб-поиске и рекомендации

web mining in yahoo! research

themes

usage mining and query-log mining

social network analysis and graph mining

influence propagation

other data mining problems

data sources

- query logs (search) and toolbar (browsing)

- social networks (flickr, messenger, email, ...)

- question-answering (answers)

- micro-blogging (twitter)

yandex aug 31, 2012

Page 4: Арис Гионис «Методы анализа поведения пользователей и его применение в веб-поиске и рекомендации

web mining in yahoo! research

themes

usage mining and query-log mining

social network analysis and graph mining

influence propagation

other data mining problems

data sources

- query logs (search) and toolbar (browsing)

- social networks (flickr, messenger, email, ...)

- question-answering (answers)

- micro-blogging (twitter)

yandex aug 31, 2012

Page 5: Арис Гионис «Методы анализа поведения пользователей и его применение в веб-поиске и рекомендации

overview of the talk

query-log mining

query graphsquery recommendations

yahoo! tips

news recommendations using real-time web

yandex aug 31, 2012

Page 6: Арис Гионис «Методы анализа поведения пользователей и его применение в веб-поиске и рекомендации

query-log mining

yandex aug 31, 2012

Page 7: Арис Гионис «Методы анализа поведения пользователей и его применение в веб-поиске и рекомендации

query-log mining

search engines collect a large amount of query logs

lots of interesting information

analyzing users’ behaviorcreating user profiles and personalizationcreating knowledge bases and folksonomiesfinding similar conceptsbuilding systems for query recommendationsusing statistics for improving systems’ performance. . .

yandex aug 31, 2012

Page 8: Арис Гионис «Методы анализа поведения пользователей и его применение в веб-поиске и рекомендации

query-log mining

search engines collect a large amount of query logs

lots of interesting information

analyzing users’ behaviorcreating user profiles and personalizationcreating knowledge bases and folksonomiesfinding similar conceptsbuilding systems for query recommendationsusing statistics for improving systems’ performance. . .

yandex aug 31, 2012

Page 9: Арис Гионис «Методы анализа поведения пользователей и его применение в веб-поиске и рекомендации

the click graph

[Craswell and Szummer, 2007]

yandex aug 31, 2012

Page 10: Арис Гионис «Методы анализа поведения пользователей и его применение в веб-поиске и рекомендации

applications of the click graph

[Craswell and Szummer, 2007]

query-to-document search

query-to-query suggestion

document-to-query annotation

document-to-document relevance feedback

yandex aug 31, 2012

Page 11: Арис Гионис «Методы анализа поведения пользователей и его применение в веб-поиске и рекомендации

the query-flow graph

[Boldi et al., 2008]

take into account temporal information

captures the “flow” of how users submit queries

definition:

nodes V = Q ∪ {s, t} the distinct set of queries Q, plusa starting state s and a terminal state tedges E ⊆ V × Vweights w(q, q′) representing the probabilitythat q and q′ are part of the same chain

yandex aug 31, 2012

Page 12: Арис Гионис «Методы анализа поведения пользователей и его применение в веб-поиске и рекомендации

building the query-flow graph

an edge (q, q′) if q and q′ are consecutive inat least one session

weights w(q, q′) learned by machine learning

features used

textual features: cosine similarity, Jaccard coefficient,size of intersection, etc.session features: the number of sessions, the averagesession length, the average number of clicks in thesessions, the average position of the queries in thesessions, etc. andtime-related features: average time difference, etc.

yandex aug 31, 2012

Page 13: Арис Гионис «Методы анализа поведения пользователей и его применение в веб-поиске и рекомендации

query-flow graph

barcelona fc

<T>

0.506

barcelona fcwebsite

0.043barcelona fc

fixtures

0.031

realmadrid

0.017

barcelonaweather

0.523

barcelonahotels

0.018

barcelonaweatheronline

0.100

barcelona

0.018

0.011

0.439

cheapbarcelona

hotels

0.072

luxurybarcelona

hotels

0.029

0.080

0.416

0.043

0.023

yandex aug 31, 2012

Page 14: Арис Гионис «Методы анализа поведения пользователей и его применение в веб-поиске и рекомендации

query-flow graph

dog

cat

funny cat

picture of a catcat and dog

picture of a funny

breed of dog

dog for sale

picture of a dog

funny dog

^

$

yandex aug 31, 2012

Page 15: Арис Гионис «Методы анализа поведения пользователей и его применение в веб-поиске и рекомендации

query recommendations

the general theme:

given an input query q

identify similar queries q

rank them and present them to the user

most query graphs can be used for both tasks:similarity and ranking

yandex aug 31, 2012

Page 16: Арис Гионис «Методы анализа поведения пользователей и его применение в веб-поиске и рекомендации

query recommendations

the general theme:

given an input query q

identify similar queries q

rank them and present them to the user

most query graphs can be used for both tasks:similarity and ranking

yandex aug 31, 2012

Page 17: Арис Гионис «Методы анализа поведения пользователей и его применение в веб-поиске и рекомендации

recommendations using the query-flow graph

[Boldi et al., 2008]

perform a random walk on the query-flow graph

teleportation to the submitted query

teleportation to previous queries to take into accountthe user history

normalize PageRank score to un-biasingfor very popular queries

yandex aug 31, 2012

Page 18: Арис Гионис «Методы анализа поведения пользователей и его применение в веб-поиске и рекомендации

example : apple

Max. weight sq sq sq

t t apple appleapple ipod apple apple fruit apple ipodapple store apple ipod apple ipod apple trailersapple trailers apple store apple belgium apple storeamazon apple trailers eating apple apple macapple mac google apple.nl apple fruititunes amazon apple monitor apple usapc world argos apple usa apple ipod nanoargos itunes apple jobs apple.com/ipod...

yandex aug 31, 2012

Page 19: Арис Гионис «Методы анализа поведения пользователей и его применение в веб-поиске и рекомендации

example : banana → apple

banana → apple banana

banana bananaapple eating bugsusb no banana holidaybanana cs opening a bananagiant chocolate bar banana shoewhere is the seed inanut

fruit banana

banana shoe recipe 22 feb 08fruit banana banana jules oliverbanana cloths banana cseating bugs banana cloths

yandex aug 31, 2012

Page 20: Арис Гионис «Методы анализа поведения пользователей и его применение в веб-поиске и рекомендации

example : beatles → apple

beatles → apple beatles

beatles beatlesapple scarringapple ipod paul mcartneyscarring yarns from irelandsrg peppers artwork statutory instrument

A55ill get you silver beatles tribute

bandbashles beatles mp3dundee folk songs GHOST’Sthe beatles love album ill get youplace lyrics beatles fugees triger finger

remix

yandex aug 31, 2012

Page 21: Арис Гионис «Методы анализа поведения пользователей и его применение в веб-поиске и рекомендации

recommendations as shortcuts to qfg

[Anagnostopoulos et al., 2010]

yandex aug 31, 2012

Page 22: Арис Гионис «Методы анализа поведения пользователей и его применение в веб-поиске и рекомендации

the query-recommendation problem

yandex aug 31, 2012

Page 23: Арис Гионис «Методы анализа поведения пользователей и его применение в веб-поиске и рекомендации

the query-recommendation problem

yandex aug 31, 2012

Page 24: Арис Гионис «Методы анализа поведения пользователей и его применение в веб-поиске и рекомендации

the query-recommendation problem

yandex aug 31, 2012

Page 25: Арис Гионис «Методы анализа поведения пользователей и его применение в веб-поиске и рекомендации

the query-recommendation problem

yandex aug 31, 2012

Page 26: Арис Гионис «Методы анализа поведения пользователей и его применение в веб-поиске и рекомендации

the recommendation problem

model user behavior as a random walk on qfg

a user starts at query q0 and follows a path p ofreformulations on qfg before terminating

consider a reward function w(q) on the nodes of qfg

goal: “nudge” users in order to maximize their reward

objectives:

1. collect a large reward along the way

2. end the session at a high-reward node

applications: a general problem formulation for suggestingshortcuts (web graph, social networks, etc.)

yandex aug 31, 2012

Page 27: Арис Гионис «Методы анализа поведения пользователей и его применение в веб-поиске и рекомендации

probabilistic model

we can only suggest, not order the user

we do not know how the user will act

random walk on qfg is modeled by stochastic matrix P

recommendations R modify P to P ′ = P + R

yandex aug 31, 2012

Page 28: Арис Гионис «Методы анализа поведения пользователей и его применение в веб-поиске и рекомендации

utility functions

reward function w(q) on queries

- quality of search results, user satisfaction, dwell time,monetization, etc.

utility function U(p) on paths p = 〈q0 . . . qk−1T 〉

U(p) =∑

q∈p

w(q) U(p) = w(qk−1),

(Cafavy) (Machiavelli)

“road to Ithaca” “end justify the means”

yandex aug 31, 2012

Page 29: Арис Гионис «Методы анализа поведения пользователей и его применение в веб-поиске и рекомендации

utility

w ρ ρw 1−step heuristic

0.0

0.2

0.4

0.6

0.8

1.0

1.2

Sum of expected values

yandex aug 31, 2012

Page 30: Арис Гионис «Методы анализа поведения пользователей и его применение в веб-поиске и рекомендации

qfg projections for diverse recommendations

[Bordino et al., 2010]

yandex aug 31, 2012

Page 31: Арис Гионис «Методы анализа поведения пользователей и его применение в веб-поиске и рекомендации

diverse recommendations

[Bordino et al., 2010]

we want not only relevant and high-qualityrecommendations, but also a diverse set

we want recommendations that take to different“directions” in the qfg

need notions of distance of queries in the qfg

use spectral embeddings

project a graph in a low dimensional space, so thatembedding minimizes total edge distortion

finding diverse recommendations reduces to a geometricproblem

yandex aug 31, 2012

Page 32: Арис Гионис «Методы анализа поведения пользователей и его применение в веб-поиске и рекомендации

example: time

Spectral projection on 2-hop neighborhood

time time magazine new york times time zone world time what time is it time warner time warner cabletime magazine 0.9953 0.0162 0.1422 0.1049 -0.6071 -0.6056new york times 0.9953 -0.0051 0.1248 0.0893 -0.6478 -0.6462

time zone 0.0162 -0.0051 0.9903 0.9891 -0.5234 -0.5254world time 0.1422 0.1248 0.9903 0.9970 -0.6263 -0.6282

what time is it 0.1049 0.0893 0.9891 0.9970 -0.6244 -0.6263time warner -0.6071 -0.6478 -0.5234 -0.6263 -0.6244 0.9999

time warner cable -0.6056 -0.6462 -0.5254 -0.6282 -0.6263 0.9999

yandex aug 31, 2012

Page 33: Арис Гионис «Методы анализа поведения пользователей и его применение в веб-поиске и рекомендации

improving recommendationfor long-tail queries via templates

[Szpektor et al., 2011]

yandex aug 31, 2012

Page 34: Арис Гионис «Методы анализа поведения пользователей и его применение в веб-поиске и рекомендации

motivation

goal: improve coverage of query-recommendation systems

observation: in a typical query log 50 % of query volumeare unique queries [Baeza-Yates et al., 2007]

most query-recommendation systems are based on findingqueries that co-occur frequently

inherent limitation on using co-occurrences

need to be able to develop methods to reason for rare,and even previously unseen, queries

yandex aug 31, 2012

Page 35: Арис Гионис «Методы анализа поведения пользователей и его применение в веб-поиске и рекомендации

overview of the approach

1 generate candidate query-templates for each query

Paris hotels → <city> hotels

Paris hotels → <district> hotels

Moscow hotels → <city> hotels

2 infer transitions between templates

<city> hotels → <city> restaurants

3 infer recommendations for rare queries

Yancheng hotels → Yancheng restaurants

yandex aug 31, 2012

Page 36: Арис Гионис «Методы анализа поведения пользователей и его применение в веб-поиске и рекомендации

overview of the approach

1 generate candidate query-templates for each query

Paris hotels → <city> hotels

Paris hotels → <district> hotels

Moscow hotels → <city> hotels

2 infer transitions between templates

<city> hotels → <city> restaurants

3 infer recommendations for rare queries

Yancheng hotels → Yancheng restaurants

yandex aug 31, 2012

Page 37: Арис Гионис «Методы анализа поведения пользователей и его применение в веб-поиске и рекомендации

overview of the approach

1 generate candidate query-templates for each query

Paris hotels → <city> hotels

Paris hotels → <district> hotels

Moscow hotels → <city> hotels

2 infer transitions between templates

<city> hotels → <city> restaurants

3 infer recommendations for rare queries

Yancheng hotels → Yancheng restaurants

yandex aug 31, 2012

Page 38: Арис Гионис «Методы анализа поведения пользователей и его применение в веб-поиске и рекомендации

overview of the approach

1 generate candidate query-templates for each query

Paris hotels → <city> hotels

Paris hotels → <district> hotels

Moscow hotels → <city> hotels

2 infer transitions between templates

<city> hotels → <city> restaurants

3 infer recommendations for rare queries

Yancheng hotels → Yancheng restaurants

yandex aug 31, 2012

Page 39: Арис Гионис «Методы анализа поведения пользователей и его применение в веб-поиске и рекомендации

overview of the approach

1 generate candidate query-templates for each query

Paris hotels → <city> hotels

Paris hotels → <district> hotels

Moscow hotels → <city> hotels

2 infer transitions between templates

<city> hotels → <city> restaurants

3 infer recommendations for rare queries

Yancheng hotels → Yancheng restaurants

yandex aug 31, 2012

Page 40: Арис Гионис «Методы анализа поведения пользователей и его применение в веб-поиске и рекомендации

overview of the approach

1 generate candidate query-templates for each query

Paris hotels → <city> hotels

Paris hotels → <district> hotels

Moscow hotels → <city> hotels

2 infer transitions between templates

<city> hotels → <city> restaurants

3 infer recommendations for rare queries

Yancheng hotels → Yancheng restaurants

yandex aug 31, 2012

Page 41: Арис Гионис «Методы анализа поведения пользователей и его применение в веб-поиске и рекомендации

query templates

defined over a hierarchy of entity types

define a global set of templates over the whole query log

do not restrict on specific domains(such as, travel, weather, or movies)

examples:

jaguar spare parts → <car> spare parts

name for salt → name for <compound>

a thousand miles notes → <song> notes

yandex aug 31, 2012

Page 42: Арис Гионис «Методы анализа поведения пользователей и его применение в веб-поиске и рекомендации

query templates

defined over a hierarchy of entity types

define a global set of templates over the whole query log

do not restrict on specific domains(such as, travel, weather, or movies)

examples:

jaguar spare parts → <car> spare parts

name for salt → name for <compound>

a thousand miles notes → <song> notes

yandex aug 31, 2012

Page 43: Арис Гионис «Методы анализа поведения пользователей и его применение в веб-поиске и рекомендации

candidate templates – example

chocolate cookie chocolate cookie

food

dessert

drink

recipe

instruction

substance

query: chocolate cookie recipe

candidate templates: <food> cookie recipe

<drink> cookie recipe

<food> recipe

<substance> recipe

chocolate cookie <instruction> . . .

yandex aug 31, 2012

Page 44: Арис Гионис «Методы анализа поведения пользователей и его применение в веб-поиске и рекомендации

candidate templates – example

chocolate cookie chocolate cookie

food

dessert

drink

recipe

instruction

substance

query: chocolate cookie recipe

candidate templates: <food> cookie recipe

<drink> cookie recipe

<food> recipe

<substance> recipe

chocolate cookie <instruction> . . .

yandex aug 31, 2012

Page 45: Арис Гионис «Методы анализа поведения пользователей и его применение в веб-поиске и рекомендации

candidate templates – example

chocolate cookie chocolate cookie

food

dessert

drink

recipe

instruction

substance

query: chocolate cookie recipe

candidate templates: <food> cookie recipe

<drink> cookie recipe

<food> recipe

<substance> recipe

chocolate cookie <instruction> . . .

yandex aug 31, 2012

Page 46: Арис Гионис «Методы анализа поведения пользователей и его применение в веб-поиске и рекомендации

ranking candidate templates

ambiguity

Jaguar spare parts → <car> spare parts

Jaguar spare parts → <animal> spare parts

focus

name for salt → name for <compound>

name for salt → <description> for salt

right generalization level

Paris hotels → <capital> hotels

Paris hotels → <city> hotels

Paris hotels → <location> hotels

yandex aug 31, 2012

Page 47: Арис Гионис «Методы анализа поведения пользователей и его применение в веб-поиске и рекомендации

ranking candidate templates

ambiguity

Jaguar spare parts → <car> spare parts

Jaguar spare parts → <animal> spare parts

focus

name for salt → name for <compound>

name for salt → <description> for salt

right generalization level

Paris hotels → <capital> hotels

Paris hotels → <city> hotels

Paris hotels → <location> hotels

yandex aug 31, 2012

Page 48: Арис Гионис «Методы анализа поведения пользователей и его применение в веб-поиске и рекомендации

ranking candidate templates

ambiguity

Jaguar spare parts → <car> spare parts

Jaguar spare parts → <animal> spare parts

focus

name for salt → name for <compound>

name for salt → <description> for salt

right generalization level

Paris hotels → <capital> hotels

Paris hotels → <city> hotels

Paris hotels → <location> hotels

yandex aug 31, 2012

Page 49: Арис Гионис «Методы анализа поведения пользователей и его применение в веб-поиске и рекомендации

construction of query templates – details

hierarchy used: WordNet 3.0 hierarchy and Wikipediacategory hierarchy, connected via yago mapping

queries are tokenized, and n-grams are looked up andmapped to entities in the hierarchy

enriched with heuristic generalizations for <email>,<url>, numbers, and noun-phrases not in the taxonomy

yandex aug 31, 2012

Page 50: Арис Гионис «Методы анализа поведения пользователей и его применение в веб-поиске и рекомендации

query-to-template edges

mapping from a query q to its set of templates T (q)viewed as query-to-template edges

associated edge scores

sqt(q, t) = αd

when t obtained by generalizing q at distance d in H

parameter α set experimentally to 0.9

set sqt(q, q′) = 1, if (q, q′) edge in query-flow graph

normalize so that all sqt(q, ·) sum to 1

yandex aug 31, 2012

Page 51: Арис Гионис «Методы анализа поведения пользователей и его применение в веб-поиске и рекомендации

template-to-templates edges

reasoning about transitions between templates

<food> recipe → healthy <food> recipe

for templates (t1, t2) define the support set of query pairs{(q1, q2)}, s.t.

t1 ∈ T (q1) and t2 ∈ T (q2)t1 and t2 substitute the same token in q1 and q2

(e.g., dosa recipe and healthy dosa recipe)

define template-to-template edge score as

stt(t1, t2) =∑

(q1,q2)∈Sup(t1,t2)

sqq(q1, q2)

normalize so that all stt(t, ·) sum to 1

yandex aug 31, 2012

Page 52: Арис Гионис «Методы анализа поведения пользователей и его применение в веб-поиске и рекомендации

example – ambiguity

consider query transition:jaguar transmission → jaguar spare parts

template transition<car> transmission → <car> spare parts

supported bybmw transmission → bmw spare parts

audi transmission → audi spare parts

. . .

template transition<animal> transmission → <animal> spare parts

will not be supported bylion transmission → lion spare parts

tiger transmission → tiger spare parts

. . .

yandex aug 31, 2012

Page 53: Арис Гионис «Методы анализа поведения пользователей и его применение в веб-поиске и рекомендации

example – ambiguity

consider query transition:jaguar transmission → jaguar spare parts

template transition<car> transmission → <car> spare parts

supported bybmw transmission → bmw spare parts

audi transmission → audi spare parts

. . .

template transition<animal> transmission → <animal> spare parts

will not be supported bylion transmission → lion spare parts

tiger transmission → tiger spare parts

. . .

yandex aug 31, 2012

Page 54: Арис Гионис «Методы анализа поведения пользователей и его применение в веб-поиске и рекомендации

the query-template flow graph

extension of the query-flow graph

superposition of all the concepts we have seen so far:

set of nodes consists of queries and templates

set of edges consists of

query to query edgesquery to template edgestemplate to template edges

associated weights

yandex aug 31, 2012

Page 55: Арис Гионис «Методы анализа поведения пользователей и его применение в веб-поиске и рекомендации

generating recommendations

q

q q′

q′t1

t2

t3

t4

s1

s2

s3

s4

s5

s6

s7

r(q, q′) = s1s4 + s2s5 + s3s6 + s3s7

interpretation: probability of a feasible path

dashed lines do not really exist, but discovered on-the-fly

queries q and q′ may not have been seen before

transitions in the query-flow graph ranked first

yandex aug 31, 2012

Page 56: Арис Гионис «Методы анализа поведения пользователей и его применение в веб-поиске и рекомендации

methodology

methods:

query-template flow graph

query-flow graph

evaluation:

inspection a sample of the results

editorial evaluation

automated evaluation

yandex aug 31, 2012

Page 57: Арис Гионис «Методы анализа поведения пользователей и его применение в веб-поиске и рекомендации

training dataset

queries templates# nodes 95 279 132 5 382 051 983# edges 83 513 590 4 345 497 267avg degree 0.88 0.81max out-degree 14 145 34 249

(craigslist) (<album>)max in-degree 14 317 133 874

(youtube) (<institution>)

yandex aug 31, 2012

Page 58: Арис Гионис «Методы анализа поведения пользователей и его применение в веб-поиске и рекомендации

anecdotal evidence

{“guangzhou flights”, “guangzhou map”}<capital> flights → <capital> map

{“a thousand miles notes”, “a thousand miles piano notes”}<single> notes → <single> piano notes

{“8 week old weimaraner”, “8 week old weimaraner puppy”}8 week old <breed> → 8 week old <breed> puppy

{“aaa office twin falls idaho”, “aaa twin falls idaho”}aaa office <city> → aaa <city>

{“air force titles”, “air force ranks”}<military service> titles → <military service> ranks

{“name for salt”, “chemical name for salt”}name for <compound> → chemical name for <compound>

yandex aug 31, 2012

Page 59: Арис Гионис «Методы анализа поведения пользователей и его применение в веб-поиске и рекомендации

editorial evaluation

set-A: 300 pairs from each configuration,recommendation in the top-10

set-B: 100 pairs, same queries in each configuration,same position

set-C: 100 pairs for which query-flow graph has norecommendation

editors labeled query-recommendation pairs as:relevant, not relevant, cannot tell

two editors, 100 common queries, kappa-statistic 0.37

qfg qtfgset-A 98.48% 97.84%set-B 97.65% 98.86%set-C — 94.38%

yandex aug 31, 2012

Page 60: Арис Гионис «Методы анализа поведения пользователей и его применение в веб-поиске и рекомендации

automated evaluation – guiding principle

extract query pairs {qi , qi+1} from a testing dataset, suchthat user submitted qi+1 after qi in the same session

measure if qi+1 is predicted by our methods, and in whichposition

assumption: qi+1 should be relevant and useful for qi

yandex aug 31, 2012

Page 61: Арис Гионис «Методы анализа поведения пользователей и его применение в веб-поиске и рекомендации

results

qfg qtfg relative increase

pair occurrences

total pairs 3134388 3134388coverage 22.65 % 28.17 % 24.37 %# in top-100 16.97 % 25.49 % 50.23 %# in top-10 9.49 % 20.74 % 118.49 %# in top-1 2.86 % 10.01 % 249.5 %MAP 0.050 0.137avg. position 18.35 8.3

unique pairs

total pairs 2755922 2755922coverage 13.28 % 19.38 % 45.87 %# in top-100 12.06 % 17.25 % 42.96 %# in top-10 8.41 % 13.52 % 60.68 %# in top-1 2.86 % 6.5 % 127.32 %MAP 0.047 0.089avg. position 12.33 9.43yandex aug 31, 2012

Page 62: Арис Гионис «Методы анализа поведения пользователей и его применение в веб-поиске и рекомендации

results

0

2

4

6

8

10

12

14

16

18

20

2 4 6 8 10 12 14 16

# te

st-p

airs

at t

op-1

0 (%

)

query length (words)

QFGQTFG

yandex aug 31, 2012

Page 63: Арис Гионис «Методы анализа поведения пользователей и его применение в веб-поиске и рекомендации

conclusions

improve coverage of query recommendation systems

recommendations for rare or previously unseen queries

well suited for tail queries

complements rather than replaces existing methods

future work: improve quality of extracted templates

yandex aug 31, 2012

Page 64: Арис Гионис «Методы анализа поведения пользователей и его применение в веб-поиске и рекомендации

yahoo! tips

[Weber et al., 2011]

yandex aug 31, 2012

Page 65: Арис Гионис «Методы анализа поведения пользователей и его применение в веб-поиске и рекомендации

motivation

provide answers, not links

identify “how to” queries and provide tips

tip: piece of advice that is1 short2 concrete3 self-contained4 non-obvious

yandex aug 31, 2012

Page 66: Арис Гионис «Методы анализа поведения пользователей и его применение в веб-поиске и рекомендации

yahoo! tips

yandex aug 31, 2012

Page 67: Арис Гионис «Методы анализа поведения пользователей и его применение в веб-поиске и рекомендации

yahoo! tips

yandex aug 31, 2012

Page 68: Арис Гионис «Методы анализа поведения пользователей и его применение в веб-поиске и рекомендации

yahoo! tips

yandex aug 31, 2012

Page 69: Арис Гионис «Методы анализа поведения пользователей и его применение в веб-поиске и рекомендации

yahoo! tips

yandex aug 31, 2012

Page 70: Арис Гионис «Методы анализа поведения пользователей и его применение в веб-поиске и рекомендации

extract tips from yahoo! answers

tip: To tell if your eggs are fresh : place eggs in a bowl/glassof water.....if it floats it’s bad. if it sinks it’s good.

yandex aug 31, 2012

Page 71: Арис Гионис «Методы анализа поведения пользователей и его применение в веб-поиске и рекомендации

system diagram

zest lime without zester

250k candidate tips

rule-based extraction

machine learning

Does query have

how-to intent?

show normal

search resultsno

yes

Obtain quality labels for 20k

candidate tip using CrowdFlower

machine learning

22k high quality tipsAre there relevant

high quality tips?

show normal

search results

rank the matching tips and

display highest ranking one

TIP: To zest a lime if you don‘t have a zester : use a cheese grater

no

yes

yandex aug 31, 2012

Page 72: Арис Гионис «Методы анализа поведения пользователей и его применение в веб-поиске и рекомендации

mining tips from yahoo! answers

consider tips of a specific structure: “X : Y ”

X : goal of the tip

Y : action of the tip

examples

To get the mildew smell out of your towels : try soakingit in a salt water solution, then washing with soap andcold water, that tends to get rid of smellsTo style your hair without heat, gel or straighteners : trycoconut oil mark k

yandex aug 31, 2012

Page 73: Арис Гионис «Методы анализа поведения пользователей и его применение в веб-поиске и рекомендации

mining tips from yahoo! answers

english

only literal “how to” queries

answer should start with a verb

consider only best answers

replace I, my, me, myself, etc.with you, your, you, yourself, etc.

yandex aug 31, 2012

Page 74: Арис Гионис «Методы анализа поведения пользователей и его применение в веб-поиске и рекомендации

quality filtering

generated 249 675 tips

manually label 20 000 using CrowdFlower

classes: very good (25%), ok (48%), bad (27%)

algorithms

svm (rbf)decision treesk-nn (Euclidean, k = 21 . . . 50)

feature families:

18 handcrafted features: e.g., style (Flesch-Kincaidreading level), sentiment, # urls, emoticons, etc.content: SVD on the tip×term matrix

yandex aug 31, 2012

Page 75: Арис Гионис «Методы анализа поведения пользователей и его применение в веб-поиске и рекомендации

quality filtering

generated 249 675 tips

manually label 20 000 using CrowdFlower

classes: very good (25%), ok (48%), bad (27%)

algorithms

svm (rbf)decision treesk-nn (Euclidean, k = 21 . . . 50)

feature families:

18 handcrafted features: e.g., style (Flesch-Kincaidreading level), sentiment, # urls, emoticons, etc.content: SVD on the tip×term matrix

yandex aug 31, 2012

Page 76: Арис Гионис «Методы анализа поведения пользователей и его применение в веб-поиске и рекомендации

quality filtering

generated 249 675 tips

manually label 20 000 using CrowdFlower

classes: very good (25%), ok (48%), bad (27%)

algorithms

svm (rbf)decision treesk-nn (Euclidean, k = 21 . . . 50)

feature families:

18 handcrafted features: e.g., style (Flesch-Kincaidreading level), sentiment, # urls, emoticons, etc.content: SVD on the tip×term matrix

yandex aug 31, 2012

Page 77: Арис Гионис «Методы анализа поведения пользователей и его применение в веб-поиске и рекомендации

quality filtering — machine learning results

Method handcrafted content bothfeatures features

Har

d SVM 0.63/0.13 0.60/0.09 0.63/0.16Decision Tree 0.67/0.07 0.61/0.06 0.66/0.13k-NN 0.62/0.23 0.56/0.11 0.63/0.11

Sof

t SVM 0.95/0.11 0.93/0.05 0.95/0.08Decision Tree 0.95/0.03 0.92/0.03 0.94/0.06k-NN 0.94/0.11 0.91/0.05 0.94/0.05

yandex aug 31, 2012

Page 78: Арис Гионис «Методы анализа поведения пользователей и его применение в веб-поиске и рекомендации

quality filtering — machine learning results

Category P,R VG sizeBeauty & Style 0.53,0.08 0.16 0.08Business & Finance 0.57,0.20 0.20 0.03Cars & Transportation 0.64,0.12 0.23 0.03Computers & Internet 0.69,0.33 0.45 0.15Consumer Electronics 0.70,0.23 0.38 0.06Entertainment & Music 0.60,0.39 0.15 0.05Family & Relationships 0.35,0.05 0.06 0.14Games & Recreation 0.61,0.31 0.24 0.04Health 0.62,0.07 0.15 0.09Home & Garden 0.43,0.06 0.27 0.04Society & Culture 0.50,0.19 0.09 0.03Sports 0.68,0.24 0.19 0.03Yahoo! Products 0.73,0.43 0.45 0.07

yandex aug 31, 2012

Page 79: Арис Гионис «Методы анализа поведения пользователей и его применение в веб-поиске и рекомендации

detecting “how to” queries

how many? 2-3% of volume, 3-4% of distinct queries

start with “how to” “how do i” or “how can i”

how do you fix keys on a laptopP: 96-99%, cover: 1.0%

queries start with an action verb

play my music on tool bar raidoP: 7-14%, cover: 3.2%

if exists “how to X” then “X”

craft ideas for boysP: 87-94%, cover: 1.1%

incoming queries to “how to” web sites

fixing a wet cell phoneP: 61-75%, cover: 0.08%

yandex aug 31, 2012

Page 80: Арис Гионис «Методы анализа поведения пользователей и его применение в веб-поиске и рекомендации

detecting “how to” queries

how many? 2-3% of volume, 3-4% of distinct queries

start with “how to” “how do i” or “how can i”

how do you fix keys on a laptopP: 96-99%, cover: 1.0%

queries start with an action verb

play my music on tool bar raidoP: 7-14%, cover: 3.2%

if exists “how to X” then “X”

craft ideas for boysP: 87-94%, cover: 1.1%

incoming queries to “how to” web sites

fixing a wet cell phoneP: 61-75%, cover: 0.08%

yandex aug 31, 2012

Page 81: Арис Гионис «Методы анализа поведения пользователей и его применение в веб-поиске и рекомендации

detecting “how to” queries

how many? 2-3% of volume, 3-4% of distinct queries

start with “how to” “how do i” or “how can i”

how do you fix keys on a laptopP: 96-99%, cover: 1.0%

queries start with an action verb

play my music on tool bar raidoP: 7-14%, cover: 3.2%

if exists “how to X” then “X”

craft ideas for boysP: 87-94%, cover: 1.1%

incoming queries to “how to” web sites

fixing a wet cell phoneP: 61-75%, cover: 0.08%

yandex aug 31, 2012

Page 82: Арис Гионис «Методы анализа поведения пользователей и его применение в веб-поиске и рекомендации

detecting “how to” queries

how many? 2-3% of volume, 3-4% of distinct queries

start with “how to” “how do i” or “how can i”

how do you fix keys on a laptopP: 96-99%, cover: 1.0%

queries start with an action verb

play my music on tool bar raidoP: 7-14%, cover: 3.2%

if exists “how to X” then “X”

craft ideas for boysP: 87-94%, cover: 1.1%

incoming queries to “how to” web sites

fixing a wet cell phoneP: 61-75%, cover: 0.08%

yandex aug 31, 2012

Page 83: Арис Гионис «Методы анализа поведения пользователей и его применение в веб-поиске и рекомендации

detecting “how to” queries

how many? 2-3% of volume, 3-4% of distinct queries

start with “how to” “how do i” or “how can i”

how do you fix keys on a laptopP: 96-99%, cover: 1.0%

queries start with an action verb

play my music on tool bar raidoP: 7-14%, cover: 3.2%

if exists “how to X” then “X”

craft ideas for boysP: 87-94%, cover: 1.1%

incoming queries to “how to” web sites

fixing a wet cell phoneP: 61-75%, cover: 0.08%

yandex aug 31, 2012

Page 84: Арис Гионис «Методы анализа поведения пользователей и его применение в веб-поиске и рекомендации

matching queries to tips

precision–recall trade-off

index only the “goal” or also “action”use AND or OR mode for queryrequire minimum “span” for the goal

ranking

rank by number of query tokens in goal, then tf·idf

yandex aug 31, 2012

Page 85: Арис Гионис «Методы анализа поведения пользователей и его применение в веб-поиске и рекомендации

matching queries to tips — evaluation

mode min span vol. dist. P@1 medianAND .50 8.7% 2.7% .428/.680 1AND .66 6.8% 1.8% .557/.770 1AND 1.0 4.4% 0.8% .625/.835 1OR .50 87.4% 88.4% .048/.110 18OR .66 36.8% 36.3% .092/.200 2OR 1.0 13.5% 10.3% .160/.300 1

yandex aug 31, 2012

Page 86: Арис Гионис «Методы анализа поведения пользователей и его применение в веб-поиске и рекомендации

future work

mine tips from other recourses

twitterwikitravel

improve quality of existing system

incorporating more featuresimproving rule extractionclassification

yandex aug 31, 2012

Page 87: Арис Гионис «Методы анализа поведения пользователей и его применение в веб-поиске и рекомендации

information dissemination in social networks

yandex aug 31, 2012

Page 88: Арис Гионис «Методы анализа поведения пользователей и его применение в веб-поиске и рекомендации

the information dissemination spectrum

news sitescontent-provider siteseditorially curatedusers browseno specific info need

web searchurl, images, music,...clear intent

social media (twitter, facebook)recommendations(content- or context- or geo-aware)user-generated content(blogs, images, q/a)

yandex aug 31, 2012

Page 89: Арис Гионис «Методы анализа поведения пользователей и его применение в веб-поиске и рекомендации

the information dissemination spectrum

news sitescontent-provider siteseditorially curatedusers browseno specific info need

web searchurl, images, music,...clear intent

social media (twitter, facebook)recommendations(content- or context- or geo-aware)user-generated content(blogs, images, q/a)

yandex aug 31, 2012

Page 90: Арис Гионис «Методы анализа поведения пользователей и его применение в веб-поиске и рекомендации

the information dissemination spectrum

news sitescontent-provider siteseditorially curatedusers browseno specific info need

web searchurl, images, music,...clear intent

social media (twitter, facebook)recommendations(content- or context- or geo-aware)user-generated content(blogs, images, q/a)

yandex aug 31, 2012

Page 91: Арис Гионис «Методы анализа поведения пользователей и его применение в веб-поиске и рекомендации

social media

yandex aug 31, 2012

Page 92: Арис Гионис «Методы анализа поведения пользователей и его применение в веб-поиске и рекомендации

the information overload problem

yandex aug 31, 2012

Page 93: Арис Гионис «Методы анализа поведения пользователей и его применение в веб-поиске и рекомендации

social media and user-generated content

paradigm shift from a broadcast one-to-many mechanismto a many-to-many model

users at the role of information producers

yandex aug 31, 2012

Page 94: Арис Гионис «Методы анализа поведения пользователей и его применение в веб-поиске и рекомендации

benefits and opportunities

wealth of information of extreme volume and diversity

wisdom of crowd phenomena

accurate profiling and personalization(toolbar, search, clicks)

content- and context- information available

social and geo information available

yandex aug 31, 2012

Page 95: Арис Гионис «Методы анализа поведения пользователей и его применение в веб-поиске и рекомендации

challenges

heterogeneous sources

high variability in quality

needle-in-the-haystack problems

we want to:

support users to seek, filter, and disseminate information

build efficient platforms that support social-mediafunctionalities

yandex aug 31, 2012

Page 96: Арис Гионис «Методы анализа поведения пользователей и его применение в веб-поиске и рекомендации

challenges

heterogeneous sources

high variability in quality

needle-in-the-haystack problems

we want to:

support users to seek, filter, and disseminate information

build efficient platforms that support social-mediafunctionalities

yandex aug 31, 2012

Page 97: Арис Гионис «Методы анализа поведения пользователей и его применение в веб-поиске и рекомендации

personalized news recommendationsby harnessing the real-time web

[De Francisci Morales et al., 2012]

yandex aug 31, 2012

Page 98: Арис Гионис «Методы анализа поведения пользователей и его применение в веб-поиске и рекомендации

overview

a news recommendation system based on real-time web,e.g., twitter

suggest news articles to twitter users

infer user preferences from twitter activity

yandex aug 31, 2012

Page 99: Арис Гионис «Методы анализа поведения пользователей и его применение в веб-поиске и рекомендации

yahoo! news

yandex aug 31, 2012

Page 100: Арис Гионис «Методы анализа поведения пользователей и его применение в веб-поиске и рекомендации

yahoo! news

yandex aug 31, 2012

Page 101: Арис Гионис «Методы анализа поведения пользователей и его применение в веб-поиске и рекомендации

yahoo! news

yandex aug 31, 2012

Page 102: Арис Гионис «Методы анализа поведения пользователей и его применение в веб-поиске и рекомендации

sources characteristics

news stream

+ high coverage

− sparse and noisy data for user profiling

− latency on collecting user feedback

twitter stream

+ much more accurate personalization

+ news spread very fast

yandex aug 31, 2012

Page 103: Арис Гионис «Методы анализа поведения пользователей и его применение в веб-поиске и рекомендации

Entities

News

Tweets

From Chatter to Headlines:Harnessing the Real-Time Web

for Personalized News Recommendation

Overview Motivation Problem

Model Method Results

tweetsUser

tweetsFollowee

tweetsFollowee

tweetsFollowee

tweetstwitter

articlesnews

T.Rex

User Model

!

"

#

Personalized ranked list of news articles

Table 5.2: MRR, precision and coverage.

Algorithm MRR P@1 P@5 P@10 CoverageRECENCY 0.020 0.002 0.018 0.036 1.000CLICKCOUNT 0.059 0.024 0.086 0.135 1.000SOCIAL 0.017 0.002 0.018 0.036 0.606CONTENT 0.107 0.029 0.171 0.286 0.158POPULARITY 0.008 0.003 0.005 0.012 1.000T.REX 0.107 0.073 0.130 0.168 1.000T.REX+ 0.109 0.062 0.146 0.189 1.000

RECENCY: it ranks news articles by time of publication (most recent first);CLICKCOUNT: it ranks news articles by click count (highest count first);SOCIAL: it ranks news articles by using T.REX with β = γ = 0;CONTENT: it ranks news articles by using T.REX with α = γ = 0;POPULARITY: it ranks news articles by using T.REX with α = β = 0.

5.6.5 Results

We report MRR, precision and coverage results in Table 5.6.3. The twovariants of our system, T.REX and T.REX+, have the best results overall.

T.REX+ has the highest MRR of all the alternatives. This result meansthat our model has a good overall performance across the dataset. CON-TENT has also a very high MRR. Unfortunately, the coverage level achievedby the CONTENT strategy is very low. This issue is mainly caused by thesparsity of the user profiles. It is well know that most of twitter usersbelong to the “silent majority,” and do not tweet very much.

The SOCIAL strategy is affected by the same problem, albeit to a muchlesser extent. The reason for this difference is that SOCIAL draws froma large social neighborhood of user profiles, instead of just one. So ithas more chances to provide a recommendation. The quality of the rec-ommendation is however quite low, probably because the social-basedprofile only is not able to catch the specific user interests.

It is worth noting that in almost 20% of the cases T.REX+ was able torank the clicked news in the top 10 results. Ranking by the CLICKCOUNT

124

!"#$%&"'()*+'#,%&#$-.%/*"'(0(+$%#$1%2+3"*#4"5

0

2

4

6

8

10

12

14

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Ave

rag

e D

CG

Rank

T.Rex+T.Rex

PopularityContent

SocialRecency

Click count

63"*#4"%7(0'+8$9"1%28:8,#9(3"%;#($5

T.Rex!"#$%%<8(,10%80"*%)*+=,"0%>*+:%9?(99"*5/#*#:"9"*0%,"#*$"1%>*+:%',('-%1#9#%($%9@"%A#@++B%9++,<#*%,+45C0"0%08))+*9%3"'9+*%:#'@($"0%#$1%,"#*$0%#%*#$-($4%>8$'9(+$5D"8*(09('#,,E%(1"$9(="1%#%4*+8)%+>%FGHI%9?(99"*%80"*0%($%9@"%9++,<#*%#$1%80"1%9@"(*%',('-0%9+%9*#($%#$1%9"09%9@"%0E09":5

What!"#$%%(0%#%$"?%:"9@+1+,+4E%>+*%*"'+::"$1($4%($9"*"09($4%$"?0%9+%80"*0%<E%"J),+(9($4%9@"%($>+*:#9(+$%($%9@"(*%9?(99"*%)"*0+$#5

Content Model Γ&'(')'*'+%?@"*"%&,-./0%(0%9@"%'+$9"$9%*","3#$'"%+>%$"?0%1/'>+*%80"*%2-5

Social Model Σ!3'('45'*')'*'+%?@"*"%3,-./0%(0%9@"%0+'(#,%*","3#$'"%+>%$"?0%1/'>+*%80"*%2-5

Popularity Model Π6'('7'*'8%?@"*"'6,/0%(0%9@"%)+)8,#*(9E%+>%$"?0%#*9(',"%1/5

in updating the popularity counts is to take into account recency: newentities of interest should dominate the popularity counts of older enti-ties. In this work, we choose to update the popularity counts using anexponential decay rule. We discuss the details in Section 5.3.1. However,note that the popularity update is independent of our recommendationmodel, and any other decaying function can be used.

Finally, we propose a ranking function for recommending news arti-cles to users. The ranking function is linear combination of the scoringcomponents described above. We plan to investigate the effect of non-linear combinations in the future.

Definition 10 (Recommendation ranking Rτ (u, n)). Given the componentsΣτ , Γτ and Πτ , resulting form a stream of news N and a stream of tweets Tauthored by users U up to time τ , the recommendation score of a news articlen ∈ N for a user u ∈ U at time τ is defined as

Rτ (u, n) = α · Στ (u, n) + β · Γτ (u, n) + γ · Πτ (n),

where α, β, γ are coefficients that specify the relative weight of the components.

At any given time, the recommender system produces a set of newsrecommendation by ranking a set of candidate news, e.g., the most re-cent ones, according to the ranking function R. To motivate the pro-posed ranking function we note similarities with popular recommenda-tion techniques. When β = γ = 0, the ranking function R resemblescollaborative filtering, where user similarity is computed on the basisof their social circles. When α = γ = 0, the function R implements acontent-based recommender system, where a user is profiled by the bag-of-entities occurring in the tweets of the user. Finally, when α = β = 0,the most popular items recommended, regardless of the user profile.

Note that Σ, Γ, Π and R are all time dependent. At any given time τ

the social network and the set of authored tweets vary, thus affecting Σ

and Γ. More importantly, some entities may abruptly become popular,hence of interest to many user. This dependency is captured by Π. Whilethe changes in Σ and Γ derive directly from the tweet stream T and thesocial network S, the update of Π is non-trivial, and plays a fundamentalrole in the recommendation system that we describe in the next section.

108

Recommendation Model R

T.Rex+KE09":%9*#($"1%?(9@%#11(9(+$#,%>"#98*"0LM "$9(9E%@+9$"00%N*#?%$8:<"*%+>%:"$9(+$0%($%$"?0%#$1%9?(99"*OM $"?0%',('-%'+8$9M $"?0%#*9(',"%#4"

;(3"$L N = $"?0%09*"#: T = 9?""9%09*"#: U = 0"9%+>%80"*0

"#$%!&'(!&)*+,!-).&!/(0(12$&!$(3.!4)/!5.(/!&!2&!&#-(τ6Why Twitter?%%P(:",($"00%#$1%)"*0+$#,(Q#9(+$5%R"?0%<"'+:"%09#,"%3"*E%>#09%#$1%0)*"#1%>#09"*%+$%9?(99"*5%P?(99"*%(0%#%4++1%)*"1('9+*%+>%($9"*"095

How!"#$%%80"0%#%:(J%+>%0(4$#,0%9+%:+1",%*","3#$'"%+>%$"?0%#*9(',"0%>+*%80"*0L%9@"%)*+=,"%+>%9@"%0+'(#,%$"(4@<+*@++1%+>%9@"%80"*0.%9@"%'+$9"$9%9@"(*%9?""9%09*"#:.%#$1%9+)('%)+)8,#*(9E%($%9@"%$"?0%#$1%#'*+00%9?(99"*5

Results !"#$%%(0%#<,"%9+%)*"1('9%?(9@%4++1%#''8*#'E%9@"%$"?0%#*9(',"0%',('-"1%<E%9@"%80"*0%#$1%*#$-%9@":%@(4@"*%9@#$%+9@"*%$"?0%#*9(',"05

DataR"?0L%SIT-%#*9(',"0%>*+:%A#@++B%$"?0P?(99"*L%H%:+$9@%+>%'*#?,"1%9?""9052,('-0L%80"*0%+>%9?(99"*%($%A#@++B%9++,<#*%,+405

EvaluationU"%"3#,8#9"%!"#$%%#0%#%',('-%)*"1('9(+$%0E09":5%U"%9*#($%+8*%:+1",%80($4%#%,"#*$($4V9+V*#$-%#))*+#'@%#$1%08))+*9%3"'9+*%:#'@($"05P@"%9*#($%#$1%9"09%0"9%#*"%1*#?$%>*+:%',('-%,+405

Claudio [email protected]

Gianmarco De Francisci [email protected]

Aristides [email protected]

Overwhelmed by information overload! W($1%($9"*"09($4%09+*("0%($%#$%+'"#$%+>%+$,($"%$"?0%#*9(',"05

0

5

10

15

20

25

30

35

40

45

1 10 100 1000 10000

Minutes

News-click delay

$8:<"

*%+>%+

''8**"$'

"0

R"?0V',('-%1",#E%1(09*(<89(+$

-0.2

0

0.2

0.4

0.6

0.8

1

1.2

May-01 h20

May-02 h00

May-02 h04

May-02 h08

May-02 h12

May-02 h16

May-02 h20

May-03 h00

May-03 h04

May-03 h08

newstwitterclicks

9:;<;'=-1'>;?$1%9*"$10

$+*:

#,(Q"1

%$8:

<"*%+

>%+''8**"$'

"0

-0.2

0

0.2

0.4

0.6

0.8

1

1.2

1.4

May-22 h00

May-22 h12

May-23 h00

May-23 h12

May-24 h00

May-24 h12

May-25 h00

May-25 h12

May-26 h00

newstwitterclicks

$+*:

#,(Q"1

%$8:

<"*%+

>%+''8**"$'

"0

@ABC-1'!AD1;?A'9*"$10

),-./0'('E%(X%2-%(0%9@"%#89@+*%+>%9?""9%F/

U

T

''(%#89@+*0@()%:#9*(J

4,-./0'('E%(X%2-%(0%($9"*"09"1%($%9@"%'+$9"$9%

)*+18'"1%<E%2/

U

U

('('0+'(#,%:#9*(J

in N according to a user-dependent relevance criteria. We also aim atincorporating time recency into our model, so that our recommendationsfavor the most recently published news articles.

We now proceed to model the factors that affect the relevance of newsfor a given user. We first model the social-network aspect. In our case,the social component is induced by the twitter following relationship. Wedefine S to be the social network adjacency matrix, were S(i, j) is equalto 1 divided by the number of users followed by user ui if ui follows uj ,and 0 otherwise. We also adopt a functional ranking (Baeza-Yates et al.,2006) that spreads the interests of a user among its neighbors recursively.By limiting the maximum hop distance d, we define the social influencein a network as follows.

Definition 4 (Social influence S∗). Given a set of users U = {u0, u1, . . .},organized in a social network where each user may express an interest to thecontent published by another user, we define the social influence model S∗ as the|U| × |U| matrix where S∗(i, j) measures the interest of user ui to the contentgenerated by user uj and it is computed as

S∗ =

�i=d�

i=1

σiSi

�,

where S is the row-normalized adjacency matrix of the social network, d is themaximum hop-distance up to which users may influence their neighbors, and σis a damping factor.

Next we model the profile of a user based on the content that the userhas generated. We first define a binary authorship matrix A to capturethe relationship between users and the tweets they produce.

Definition 5 (Tweet authorship A). Let A be a |U|×|T | matrix where A(i, j)is 1 if ui is the author of tj , and 0 otherwise.

The matrix A can be extended to deal with different types of relation-ships between users and posts, e.g., weigh differently re-tweets, or likes.In this work, we limit the concept of authorship to the posts actuallywritten by the user.

104

0+'(#,%($9"*"09

45,-./0%Y%,"3",%+>%($9"*"09%+>%2-%9+%9@"%'+$9"$9%)*+18'"1%<E%2/5

Z = $1F-FG':B;H$'+$9+%?@('@%T%#$1'N%#*"%:#))"15U"%80"%U(-()"1(#%)#4"0%#0%+8*%"$9(9E%0)#'"5

C)1#9"1%<E%9*#'-($4%:"$9(+$0%($%$"?0%#$1%9?(99"*%?(9@%"J)+$"$9(#,%1"'#E5

Z

7,-0'(%)+)8,#*(9E%+>%"$9(9E%I-)'(%)+)8,#*(9E%3"'9+*

+,-./0'('*",#9"1$"00%+>%

9?""9%F-%9+%$"?0%1/T

N

*'('9?""9V9+V$"?0%:#9*(J

*+,+!+-+.

!,-./0'(%*",#9"1$"00%+>%9?""9%F-'9+%"$9(9E%I/

T

Z

!'(%9?""9%:#9*(J

8,-./0'(%*",#9"1$"00%+>%%"$9(9E%I-'9+%$"?0%1/

Z

N

.'(%$"?0%:#9*(J

yandex aug 31, 2012

Page 104: Арис Гионис «Методы анализа поведения пользователей и его применение в веб-поиске и рекомендации

Entities

News

Tweets

From Chatter to Headlines:Harnessing the Real-Time Web

for Personalized News Recommendation

Overview Motivation Problem

Model Method Results

tweetsUser

tweetsFollowee

tweetsFollowee

tweetsFollowee

tweetstwitter

articlesnews

T.Rex

User Model

!

"

#

Personalized ranked list of news articles

Table 5.2: MRR, precision and coverage.

Algorithm MRR P@1 P@5 P@10 CoverageRECENCY 0.020 0.002 0.018 0.036 1.000CLICKCOUNT 0.059 0.024 0.086 0.135 1.000SOCIAL 0.017 0.002 0.018 0.036 0.606CONTENT 0.107 0.029 0.171 0.286 0.158POPULARITY 0.008 0.003 0.005 0.012 1.000T.REX 0.107 0.073 0.130 0.168 1.000T.REX+ 0.109 0.062 0.146 0.189 1.000

RECENCY: it ranks news articles by time of publication (most recent first);CLICKCOUNT: it ranks news articles by click count (highest count first);SOCIAL: it ranks news articles by using T.REX with β = γ = 0;CONTENT: it ranks news articles by using T.REX with α = γ = 0;POPULARITY: it ranks news articles by using T.REX with α = β = 0.

5.6.5 Results

We report MRR, precision and coverage results in Table 5.6.3. The twovariants of our system, T.REX and T.REX+, have the best results overall.

T.REX+ has the highest MRR of all the alternatives. This result meansthat our model has a good overall performance across the dataset. CON-TENT has also a very high MRR. Unfortunately, the coverage level achievedby the CONTENT strategy is very low. This issue is mainly caused by thesparsity of the user profiles. It is well know that most of twitter usersbelong to the “silent majority,” and do not tweet very much.

The SOCIAL strategy is affected by the same problem, albeit to a muchlesser extent. The reason for this difference is that SOCIAL draws froma large social neighborhood of user profiles, instead of just one. So ithas more chances to provide a recommendation. The quality of the rec-ommendation is however quite low, probably because the social-basedprofile only is not able to catch the specific user interests.

It is worth noting that in almost 20% of the cases T.REX+ was able torank the clicked news in the top 10 results. Ranking by the CLICKCOUNT

124

!"#$%&"'()*+'#,%&#$-.%/*"'(0(+$%#$1%2+3"*#4"5

0

2

4

6

8

10

12

14

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Ave

rag

e D

CG

Rank

T.Rex+T.Rex

PopularityContent

SocialRecency

Click count

63"*#4"%7(0'+8$9"1%28:8,#9(3"%;#($5

T.Rex!"#$%%<8(,10%80"*%)*+=,"0%>*+:%9?(99"*5/#*#:"9"*0%,"#*$"1%>*+:%',('-%1#9#%($%9@"%A#@++B%9++,<#*%,+45C0"0%08))+*9%3"'9+*%:#'@($"0%#$1%,"#*$0%#%*#$-($4%>8$'9(+$5D"8*(09('#,,E%(1"$9(="1%#%4*+8)%+>%FGHI%9?(99"*%80"*0%($%9@"%9++,<#*%#$1%80"1%9@"(*%',('-0%9+%9*#($%#$1%9"09%9@"%0E09":5

What!"#$%%(0%#%$"?%:"9@+1+,+4E%>+*%*"'+::"$1($4%($9"*"09($4%$"?0%9+%80"*0%<E%"J),+(9($4%9@"%($>+*:#9(+$%($%9@"(*%9?(99"*%)"*0+$#5

Content Model Γ&'(')'*'+%?@"*"%&,-./0%(0%9@"%'+$9"$9%*","3#$'"%+>%$"?0%1/'>+*%80"*%2-5

Social Model Σ!3'('45'*')'*'+%?@"*"%3,-./0%(0%9@"%0+'(#,%*","3#$'"%+>%$"?0%1/'>+*%80"*%2-5

Popularity Model Π6'('7'*'8%?@"*"'6,/0%(0%9@"%)+)8,#*(9E%+>%$"?0%#*9(',"%1/5

in updating the popularity counts is to take into account recency: newentities of interest should dominate the popularity counts of older enti-ties. In this work, we choose to update the popularity counts using anexponential decay rule. We discuss the details in Section 5.3.1. However,note that the popularity update is independent of our recommendationmodel, and any other decaying function can be used.

Finally, we propose a ranking function for recommending news arti-cles to users. The ranking function is linear combination of the scoringcomponents described above. We plan to investigate the effect of non-linear combinations in the future.

Definition 10 (Recommendation ranking Rτ (u, n)). Given the componentsΣτ , Γτ and Πτ , resulting form a stream of news N and a stream of tweets Tauthored by users U up to time τ , the recommendation score of a news articlen ∈ N for a user u ∈ U at time τ is defined as

Rτ (u, n) = α · Στ (u, n) + β · Γτ (u, n) + γ · Πτ (n),

where α, β, γ are coefficients that specify the relative weight of the components.

At any given time, the recommender system produces a set of newsrecommendation by ranking a set of candidate news, e.g., the most re-cent ones, according to the ranking function R. To motivate the pro-posed ranking function we note similarities with popular recommenda-tion techniques. When β = γ = 0, the ranking function R resemblescollaborative filtering, where user similarity is computed on the basisof their social circles. When α = γ = 0, the function R implements acontent-based recommender system, where a user is profiled by the bag-of-entities occurring in the tweets of the user. Finally, when α = β = 0,the most popular items recommended, regardless of the user profile.

Note that Σ, Γ, Π and R are all time dependent. At any given time τ

the social network and the set of authored tweets vary, thus affecting Σ

and Γ. More importantly, some entities may abruptly become popular,hence of interest to many user. This dependency is captured by Π. Whilethe changes in Σ and Γ derive directly from the tweet stream T and thesocial network S, the update of Π is non-trivial, and plays a fundamentalrole in the recommendation system that we describe in the next section.

108

Recommendation Model R

T.Rex+KE09":%9*#($"1%?(9@%#11(9(+$#,%>"#98*"0LM "$9(9E%@+9$"00%N*#?%$8:<"*%+>%:"$9(+$0%($%$"?0%#$1%9?(99"*OM $"?0%',('-%'+8$9M $"?0%#*9(',"%#4"

;(3"$L N = $"?0%09*"#: T = 9?""9%09*"#: U = 0"9%+>%80"*0

"#$%!&'(!&)*+,!-).&!/(0(12$&!$(3.!4)/!5.(/!&!2&!&#-(τ6Why Twitter?%%P(:",($"00%#$1%)"*0+$#,(Q#9(+$5%R"?0%<"'+:"%09#,"%3"*E%>#09%#$1%0)*"#1%>#09"*%+$%9?(99"*5%P?(99"*%(0%#%4++1%)*"1('9+*%+>%($9"*"095

How!"#$%%80"0%#%:(J%+>%0(4$#,0%9+%:+1",%*","3#$'"%+>%$"?0%#*9(',"0%>+*%80"*0L%9@"%)*+=,"%+>%9@"%0+'(#,%$"(4@<+*@++1%+>%9@"%80"*0.%9@"%'+$9"$9%9@"(*%9?""9%09*"#:.%#$1%9+)('%)+)8,#*(9E%($%9@"%$"?0%#$1%#'*+00%9?(99"*5

Results !"#$%%(0%#<,"%9+%)*"1('9%?(9@%4++1%#''8*#'E%9@"%$"?0%#*9(',"0%',('-"1%<E%9@"%80"*0%#$1%*#$-%9@":%@(4@"*%9@#$%+9@"*%$"?0%#*9(',"05

DataR"?0L%SIT-%#*9(',"0%>*+:%A#@++B%$"?0P?(99"*L%H%:+$9@%+>%'*#?,"1%9?""9052,('-0L%80"*0%+>%9?(99"*%($%A#@++B%9++,<#*%,+405

EvaluationU"%"3#,8#9"%!"#$%%#0%#%',('-%)*"1('9(+$%0E09":5%U"%9*#($%+8*%:+1",%80($4%#%,"#*$($4V9+V*#$-%#))*+#'@%#$1%08))+*9%3"'9+*%:#'@($"05P@"%9*#($%#$1%9"09%0"9%#*"%1*#?$%>*+:%',('-%,+405

Claudio [email protected]

Gianmarco De Francisci [email protected]

Aristides [email protected]

Overwhelmed by information overload! W($1%($9"*"09($4%09+*("0%($%#$%+'"#$%+>%+$,($"%$"?0%#*9(',"05

0

5

10

15

20

25

30

35

40

45

1 10 100 1000 10000

Minutes

News-click delay

$8:<"

*%+>%+

''8**"$'

"0

R"?0V',('-%1",#E%1(09*(<89(+$

-0.2

0

0.2

0.4

0.6

0.8

1

1.2

May-01 h20

May-02 h00

May-02 h04

May-02 h08

May-02 h12

May-02 h16

May-02 h20

May-03 h00

May-03 h04

May-03 h08

newstwitterclicks

9:;<;'=-1'>;?$1%9*"$10

$+*:

#,(Q"1

%$8:

<"*%+

>%+''8**"$'

"0

-0.2

0

0.2

0.4

0.6

0.8

1

1.2

1.4

May-22 h00

May-22 h12

May-23 h00

May-23 h12

May-24 h00

May-24 h12

May-25 h00

May-25 h12

May-26 h00

newstwitterclicks

$+*:

#,(Q"1

%$8:

<"*%+

>%+''8**"$'

"0

@ABC-1'!AD1;?A'9*"$10

),-./0'('E%(X%2-%(0%9@"%#89@+*%+>%9?""9%F/

U

T

''(%#89@+*0@()%:#9*(J

4,-./0'('E%(X%2-%(0%($9"*"09"1%($%9@"%'+$9"$9%

)*+18'"1%<E%2/

U

U

('('0+'(#,%:#9*(J

in N according to a user-dependent relevance criteria. We also aim atincorporating time recency into our model, so that our recommendationsfavor the most recently published news articles.

We now proceed to model the factors that affect the relevance of newsfor a given user. We first model the social-network aspect. In our case,the social component is induced by the twitter following relationship. Wedefine S to be the social network adjacency matrix, were S(i, j) is equalto 1 divided by the number of users followed by user ui if ui follows uj ,and 0 otherwise. We also adopt a functional ranking (Baeza-Yates et al.,2006) that spreads the interests of a user among its neighbors recursively.By limiting the maximum hop distance d, we define the social influencein a network as follows.

Definition 4 (Social influence S∗). Given a set of users U = {u0, u1, . . .},organized in a social network where each user may express an interest to thecontent published by another user, we define the social influence model S∗ as the|U| × |U| matrix where S∗(i, j) measures the interest of user ui to the contentgenerated by user uj and it is computed as

S∗ =

�i=d�

i=1

σiSi

�,

where S is the row-normalized adjacency matrix of the social network, d is themaximum hop-distance up to which users may influence their neighbors, and σis a damping factor.

Next we model the profile of a user based on the content that the userhas generated. We first define a binary authorship matrix A to capturethe relationship between users and the tweets they produce.

Definition 5 (Tweet authorship A). Let A be a |U|×|T | matrix where A(i, j)is 1 if ui is the author of tj , and 0 otherwise.

The matrix A can be extended to deal with different types of relation-ships between users and posts, e.g., weigh differently re-tweets, or likes.In this work, we limit the concept of authorship to the posts actuallywritten by the user.

104

0+'(#,%($9"*"09

45,-./0%Y%,"3",%+>%($9"*"09%+>%2-%9+%9@"%'+$9"$9%)*+18'"1%<E%2/5

Z = $1F-FG':B;H$'+$9+%?@('@%T%#$1'N%#*"%:#))"15U"%80"%U(-()"1(#%)#4"0%#0%+8*%"$9(9E%0)#'"5

C)1#9"1%<E%9*#'-($4%:"$9(+$0%($%$"?0%#$1%9?(99"*%?(9@%"J)+$"$9(#,%1"'#E5

Z

7,-0'(%)+)8,#*(9E%+>%"$9(9E%I-)'(%)+)8,#*(9E%3"'9+*

+,-./0'('*",#9"1$"00%+>%

9?""9%F-%9+%$"?0%1/T

N

*'('9?""9V9+V$"?0%:#9*(J

*+,+!+-+.

!,-./0'(%*",#9"1$"00%+>%9?""9%F-'9+%"$9(9E%I/

T

Z

!'(%9?""9%:#9*(J

8,-./0'(%*",#9"1$"00%+>%%"$9(9E%I-'9+%$"?0%1/

Z

N

.'(%$"?0%:#9*(J

yandex aug 31, 2012

Page 105: Арис Гионис «Методы анализа поведения пользователей и его применение в веб-поиске и рекомендации

yandex aug 31, 2012

Page 106: Арис Гионис «Методы анализа поведения пользователей и его применение в веб-поиске и рекомендации

challenges

scale to large volumes of news and tweets

high dynamicity of news and tweets

news have short life-cycle

twitter users use jargon language

find the right degree of personalization

cope with inactive twitter users

yandex aug 31, 2012

Page 107: Арис Гионис «Методы анализа поведения пользователей и его применение в веб-поиске и рекомендации

relate users, tweets, and news articles

yandex aug 31, 2012

Page 108: Арис Гионис «Методы анализа поведения пользователей и его применение в веб-поиске и рекомендации

T.rex architecture

Entities

News

Tweets

From Chatter to Headlines:Harnessing the Real-Time Web

for Personalized News Recommendation

Overview Motivation Problem

Model Method Results

tweetsUser

tweetsFollowee

tweetsFollowee

tweetsFollowee

tweetstwitter

articlesnews

T.Rex

User Model

!

"

#

Personalized ranked list of news articles

Table 5.2: MRR, precision and coverage.

Algorithm MRR P@1 P@5 P@10 CoverageRECENCY 0.020 0.002 0.018 0.036 1.000CLICKCOUNT 0.059 0.024 0.086 0.135 1.000SOCIAL 0.017 0.002 0.018 0.036 0.606CONTENT 0.107 0.029 0.171 0.286 0.158POPULARITY 0.008 0.003 0.005 0.012 1.000T.REX 0.107 0.073 0.130 0.168 1.000T.REX+ 0.109 0.062 0.146 0.189 1.000

RECENCY: it ranks news articles by time of publication (most recent first);CLICKCOUNT: it ranks news articles by click count (highest count first);SOCIAL: it ranks news articles by using T.REX with β = γ = 0;CONTENT: it ranks news articles by using T.REX with α = γ = 0;POPULARITY: it ranks news articles by using T.REX with α = β = 0.

5.6.5 Results

We report MRR, precision and coverage results in Table 5.6.3. The twovariants of our system, T.REX and T.REX+, have the best results overall.

T.REX+ has the highest MRR of all the alternatives. This result meansthat our model has a good overall performance across the dataset. CON-TENT has also a very high MRR. Unfortunately, the coverage level achievedby the CONTENT strategy is very low. This issue is mainly caused by thesparsity of the user profiles. It is well know that most of twitter usersbelong to the “silent majority,” and do not tweet very much.

The SOCIAL strategy is affected by the same problem, albeit to a muchlesser extent. The reason for this difference is that SOCIAL draws froma large social neighborhood of user profiles, instead of just one. So ithas more chances to provide a recommendation. The quality of the rec-ommendation is however quite low, probably because the social-basedprofile only is not able to catch the specific user interests.

It is worth noting that in almost 20% of the cases T.REX+ was able torank the clicked news in the top 10 results. Ranking by the CLICKCOUNT

124

!"#$%&"'()*+'#,%&#$-.%/*"'(0(+$%#$1%2+3"*#4"5

0

2

4

6

8

10

12

14

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Ave

rage D

CG

Rank

T.Rex+T.Rex

PopularityContent

SocialRecency

Click count

63"*#4"%7(0'+8$9"1%28:8,#9(3"%;#($5

T.Rex!"#$%%<8(,10%80"*%)*+=,"0%>*+:%9?(99"*5/#*#:"9"*0%,"#*$"1%>*+:%',('-%1#9#%($%9@"%A#@++B%9++,<#*%,+45C0"0%08))+*9%3"'9+*%:#'@($"0%#$1%,"#*$0%#%*#$-($4%>8$'9(+$5D"8*(09('#,,E%(1"$9(="1%#%4*+8)%+>%FGHI%9?(99"*%80"*0%($%9@"%9++,<#*%#$1%80"1%9@"(*%',('-0%9+%9*#($%#$1%9"09%9@"%0E09":5

What!"#$%%(0%#%$"?%:"9@+1+,+4E%>+*%*"'+::"$1($4%($9"*"09($4%$"?0%9+%80"*0%<E%"J),+(9($4%9@"%($>+*:#9(+$%($%9@"(*%9?(99"*%)"*0+$#5

Content Model Γ&'(')'*'+%?@"*"%&,-./0%(0%9@"%'+$9"$9%*","3#$'"%+>%$"?0%1/'>+*%80"*%2-5

Social Model Σ!3'('45'*')'*'+%?@"*"%3,-./0%(0%9@"%0+'(#,%*","3#$'"%+>%$"?0%1/'>+*%80"*%2-5

Popularity Model Π6'('7'*'8%?@"*"'6,/0%(0%9@"%)+)8,#*(9E%+>%$"?0%#*9(',"%1/5

in updating the popularity counts is to take into account recency: newentities of interest should dominate the popularity counts of older enti-ties. In this work, we choose to update the popularity counts using anexponential decay rule. We discuss the details in Section 5.3.1. However,note that the popularity update is independent of our recommendationmodel, and any other decaying function can be used.

Finally, we propose a ranking function for recommending news arti-cles to users. The ranking function is linear combination of the scoringcomponents described above. We plan to investigate the effect of non-linear combinations in the future.

Definition 10 (Recommendation ranking Rτ (u, n)). Given the componentsΣτ , Γτ and Πτ , resulting form a stream of news N and a stream of tweets Tauthored by users U up to time τ , the recommendation score of a news articlen ∈ N for a user u ∈ U at time τ is defined as

Rτ (u, n) = α · Στ (u, n) + β · Γτ (u, n) + γ · Πτ (n),

where α, β, γ are coefficients that specify the relative weight of the components.

At any given time, the recommender system produces a set of newsrecommendation by ranking a set of candidate news, e.g., the most re-cent ones, according to the ranking function R. To motivate the pro-posed ranking function we note similarities with popular recommenda-tion techniques. When β = γ = 0, the ranking function R resemblescollaborative filtering, where user similarity is computed on the basisof their social circles. When α = γ = 0, the function R implements acontent-based recommender system, where a user is profiled by the bag-of-entities occurring in the tweets of the user. Finally, when α = β = 0,the most popular items recommended, regardless of the user profile.

Note that Σ, Γ, Π and R are all time dependent. At any given time τ

the social network and the set of authored tweets vary, thus affecting Σ

and Γ. More importantly, some entities may abruptly become popular,hence of interest to many user. This dependency is captured by Π. Whilethe changes in Σ and Γ derive directly from the tweet stream T and thesocial network S, the update of Π is non-trivial, and plays a fundamentalrole in the recommendation system that we describe in the next section.

108

Recommendation Model R

T.Rex+KE09":%9*#($"1%?(9@%#11(9(+$#,%>"#98*"0LM "$9(9E%@+9$"00%N*#?%$8:<"*%+>%:"$9(+$0%($%$"?0%#$1%9?(99"*OM $"?0%',('-%'+8$9M $"?0%#*9(',"%#4"

;(3"$L N = $"?0%09*"#: T = 9?""9%09*"#: U = 0"9%+>%80"*0

"#$%!&'(!&)*+,!-).&!/(0(12$&!$(3.!4)/!5.(/!&!2&!&#-(τ6Why Twitter?%%P(:",($"00%#$1%)"*0+$#,(Q#9(+$5%R"?0%<"'+:"%09#,"%3"*E%>#09%#$1%0)*"#1%>#09"*%+$%9?(99"*5%P?(99"*%(0%#%4++1%)*"1('9+*%+>%($9"*"095

How!"#$%%80"0%#%:(J%+>%0(4$#,0%9+%:+1",%*","3#$'"%+>%$"?0%#*9(',"0%>+*%80"*0L%9@"%)*+=,"%+>%9@"%0+'(#,%$"(4@<+*@++1%+>%9@"%80"*0.%9@"%'+$9"$9%9@"(*%9?""9%09*"#:.%#$1%9+)('%)+)8,#*(9E%($%9@"%$"?0%#$1%#'*+00%9?(99"*5

Results !"#$%%(0%#<,"%9+%)*"1('9%?(9@%4++1%#''8*#'E%9@"%$"?0%#*9(',"0%',('-"1%<E%9@"%80"*0%#$1%*#$-%9@":%@(4@"*%9@#$%+9@"*%$"?0%#*9(',"05

DataR"?0L%SIT-%#*9(',"0%>*+:%A#@++B%$"?0P?(99"*L%H%:+$9@%+>%'*#?,"1%9?""9052,('-0L%80"*0%+>%9?(99"*%($%A#@++B%9++,<#*%,+405

EvaluationU"%"3#,8#9"%!"#$%%#0%#%',('-%)*"1('9(+$%0E09":5%U"%9*#($%+8*%:+1",%80($4%#%,"#*$($4V9+V*#$-%#))*+#'@%#$1%08))+*9%3"'9+*%:#'@($"05P@"%9*#($%#$1%9"09%0"9%#*"%1*#?$%>*+:%',('-%,+405

Claudio [email protected]

Gianmarco De Francisci [email protected]

Aristides [email protected]

Overwhelmed by information overload! W($1%($9"*"09($4%09+*("0%($%#$%+'"#$%+>%+$,($"%$"?0%#*9(',"05

0

5

10

15

20

25

30

35

40

45

1 10 100 1000 10000

Minutes

News-click delay

$8:<"

*%+>%+

''8**"$'

"0

R"?0V',('-%1",#E%1(09*(<89(+$

-0.2

0

0.2

0.4

0.6

0.8

1

1.2

May-01 h20

May-02 h00

May-02 h04

May-02 h08

May-02 h12

May-02 h16

May-02 h20

May-03 h00

May-03 h04

May-03 h08

newstwitterclicks

9:;<;'=-1'>;?$1%9*"$10

$+*:

#,(Q"1

%$8:

<"*%+

>%+''8**"$'

"0

-0.2

0

0.2

0.4

0.6

0.8

1

1.2

1.4

May-22 h00

May-22 h12

May-23 h00

May-23 h12

May-24 h00

May-24 h12

May-25 h00

May-25 h12

May-26 h00

newstwitterclicks

$+*:

#,(Q"1

%$8:

<"*%+

>%+''8**"$'

"0

@ABC-1'!AD1;?A'9*"$10

),-./0'('E%(X%2-%(0%9@"%#89@+*%+>%9?""9%F/

U

T

''(%#89@+*0@()%:#9*(J

4,-./0'('E%(X%2-%(0%($9"*"09"1%($%9@"%'+$9"$9%

)*+18'"1%<E%2/

U

U

('('0+'(#,%:#9*(J

in N according to a user-dependent relevance criteria. We also aim atincorporating time recency into our model, so that our recommendationsfavor the most recently published news articles.

We now proceed to model the factors that affect the relevance of newsfor a given user. We first model the social-network aspect. In our case,the social component is induced by the twitter following relationship. Wedefine S to be the social network adjacency matrix, were S(i, j) is equalto 1 divided by the number of users followed by user ui if ui follows uj ,and 0 otherwise. We also adopt a functional ranking (Baeza-Yates et al.,2006) that spreads the interests of a user among its neighbors recursively.By limiting the maximum hop distance d, we define the social influencein a network as follows.

Definition 4 (Social influence S∗). Given a set of users U = {u0, u1, . . .},organized in a social network where each user may express an interest to thecontent published by another user, we define the social influence model S∗ as the|U| × |U| matrix where S∗(i, j) measures the interest of user ui to the contentgenerated by user uj and it is computed as

S∗ =

�i=d�

i=1

σiSi

�,

where S is the row-normalized adjacency matrix of the social network, d is themaximum hop-distance up to which users may influence their neighbors, and σis a damping factor.

Next we model the profile of a user based on the content that the userhas generated. We first define a binary authorship matrix A to capturethe relationship between users and the tweets they produce.

Definition 5 (Tweet authorship A). Let A be a |U|×|T | matrix where A(i, j)is 1 if ui is the author of tj , and 0 otherwise.

The matrix A can be extended to deal with different types of relation-ships between users and posts, e.g., weigh differently re-tweets, or likes.In this work, we limit the concept of authorship to the posts actuallywritten by the user.

104

0+'(#,%($9"*"09

45,-./0%Y%,"3",%+>%($9"*"09%+>%2-%9+%9@"%'+$9"$9%)*+18'"1%<E%2/5

Z = $1F-FG':B;H$'+$9+%?@('@%T%#$1'N%#*"%:#))"15U"%80"%U(-()"1(#%)#4"0%#0%+8*%"$9(9E%0)#'"5

C)1#9"1%<E%9*#'-($4%:"$9(+$0%($%$"?0%#$1%9?(99"*%?(9@%"J)+$"$9(#,%1"'#E5

Z

7,-0'(%)+)8,#*(9E%+>%"$9(9E%I-)'(%)+)8,#*(9E%3"'9+*

+,-./0'('*",#9"1$"00%+>%

9?""9%F-%9+%$"?0%1/T

N

*'('9?""9V9+V$"?0%:#9*(J

*+,+!+-+.

!,-./0'(%*",#9"1$"00%+>%9?""9%F-'9+%"$9(9E%I/

T

Z

!'(%9?""9%:#9*(J

8,-./0'(%*",#9"1$"00%+>%%"$9(9E%I-'9+%$"?0%1/

Z

N

.'(%$"?0%:#9*(J

yandex aug 31, 2012

Page 109: Арис Гионис «Методы анализа поведения пользователей и его применение в веб-поиске и рекомендации

recommendation model

Rτ(u, n) = α · Στ(u, n) + β · Γτ(u, n) + γ · Πτ(n)

social modelΣ(i , j) social relevance ofnews j to user i

content modelΓ(i , j) content relevanceof news j to user i

popularity modelΠ(j) popularity model ofnews article j

yandex aug 31, 2012

Page 110: Арис Гионис «Методы анализа поведения пользователей и его применение в веб-поиске и рекомендации

recommendation model

Rτ(u, n) = α · Στ(u, n) + β · Γτ(u, n) + γ · Πτ(n)

social modelΣ(i , j) social relevance ofnews j to user i

content modelΓ(i , j) content relevanceof news j to user i

popularity modelΠ(j) popularity model ofnews article j

yandex aug 31, 2012

Page 111: Арис Гионис «Методы анализа поведения пользователей и его применение в веб-поиске и рекомендации

recommendation model

Rτ(u, n) = α · Στ(u, n) + β · Γτ(u, n) + γ · Πτ(n)

social modelΣ(i , j) social relevance ofnews j to user i

content modelΓ(i , j) content relevanceof news j to user i

popularity modelΠ(j) popularity model ofnews article j

yandex aug 31, 2012

Page 112: Арис Гионис «Методы анализа поведения пользователей и его применение в веб-поиске и рекомендации

recommendation model

Rτ(u, n) = α · Στ(u, n) + β · Γτ(u, n) + γ · Πτ(n)

social modelΣ(i , j) social relevance ofnews j to user i

content modelΓ(i , j) content relevanceof news j to user i

popularity modelΠ(j) popularity model ofnews article j

yandex aug 31, 2012

Page 113: Арис Гионис «Методы анализа поведения пользователей и его применение в веб-поиске и рекомендации

popularity update rule

Entities

News

Tweets

From Chatter to Headlines:Harnessing the Real-Time Web

for Personalized News Recommendation

Overview Motivation Problem

Model Method Results

tweetsUser

tweetsFollowee

tweetsFollowee

tweetsFollowee

tweetstwitter

articlesnews

T.Rex

User Model

!

"

#

Personalized ranked list of news articles

Table 5.2: MRR, precision and coverage.

Algorithm MRR P@1 P@5 P@10 CoverageRECENCY 0.020 0.002 0.018 0.036 1.000CLICKCOUNT 0.059 0.024 0.086 0.135 1.000SOCIAL 0.017 0.002 0.018 0.036 0.606CONTENT 0.107 0.029 0.171 0.286 0.158POPULARITY 0.008 0.003 0.005 0.012 1.000T.REX 0.107 0.073 0.130 0.168 1.000T.REX+ 0.109 0.062 0.146 0.189 1.000

RECENCY: it ranks news articles by time of publication (most recent first);CLICKCOUNT: it ranks news articles by click count (highest count first);SOCIAL: it ranks news articles by using T.REX with β = γ = 0;CONTENT: it ranks news articles by using T.REX with α = γ = 0;POPULARITY: it ranks news articles by using T.REX with α = β = 0.

5.6.5 Results

We report MRR, precision and coverage results in Table 5.6.3. The twovariants of our system, T.REX and T.REX+, have the best results overall.

T.REX+ has the highest MRR of all the alternatives. This result meansthat our model has a good overall performance across the dataset. CON-TENT has also a very high MRR. Unfortunately, the coverage level achievedby the CONTENT strategy is very low. This issue is mainly caused by thesparsity of the user profiles. It is well know that most of twitter usersbelong to the “silent majority,” and do not tweet very much.

The SOCIAL strategy is affected by the same problem, albeit to a muchlesser extent. The reason for this difference is that SOCIAL draws froma large social neighborhood of user profiles, instead of just one. So ithas more chances to provide a recommendation. The quality of the rec-ommendation is however quite low, probably because the social-basedprofile only is not able to catch the specific user interests.

It is worth noting that in almost 20% of the cases T.REX+ was able torank the clicked news in the top 10 results. Ranking by the CLICKCOUNT

124

!"#$%&"'()*+'#,%&#$-.%/*"'(0(+$%#$1%2+3"*#4"5

0

2

4

6

8

10

12

14

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Ave

rag

e D

CG

Rank

T.Rex+T.Rex

PopularityContent

SocialRecency

Click count

63"*#4"%7(0'+8$9"1%28:8,#9(3"%;#($5

T.Rex!"#$%%<8(,10%80"*%)*+=,"0%>*+:%9?(99"*5/#*#:"9"*0%,"#*$"1%>*+:%',('-%1#9#%($%9@"%A#@++B%9++,<#*%,+45C0"0%08))+*9%3"'9+*%:#'@($"0%#$1%,"#*$0%#%*#$-($4%>8$'9(+$5D"8*(09('#,,E%(1"$9(="1%#%4*+8)%+>%FGHI%9?(99"*%80"*0%($%9@"%9++,<#*%#$1%80"1%9@"(*%',('-0%9+%9*#($%#$1%9"09%9@"%0E09":5

What!"#$%%(0%#%$"?%:"9@+1+,+4E%>+*%*"'+::"$1($4%($9"*"09($4%$"?0%9+%80"*0%<E%"J),+(9($4%9@"%($>+*:#9(+$%($%9@"(*%9?(99"*%)"*0+$#5

Content Model Γ&'(')'*'+%?@"*"%&,-./0%(0%9@"%'+$9"$9%*","3#$'"%+>%$"?0%1/'>+*%80"*%2-5

Social Model Σ!3'('45'*')'*'+%?@"*"%3,-./0%(0%9@"%0+'(#,%*","3#$'"%+>%$"?0%1/'>+*%80"*%2-5

Popularity Model Π6'('7'*'8%?@"*"'6,/0%(0%9@"%)+)8,#*(9E%+>%$"?0%#*9(',"%1/5

in updating the popularity counts is to take into account recency: newentities of interest should dominate the popularity counts of older enti-ties. In this work, we choose to update the popularity counts using anexponential decay rule. We discuss the details in Section 5.3.1. However,note that the popularity update is independent of our recommendationmodel, and any other decaying function can be used.

Finally, we propose a ranking function for recommending news arti-cles to users. The ranking function is linear combination of the scoringcomponents described above. We plan to investigate the effect of non-linear combinations in the future.

Definition 10 (Recommendation ranking Rτ (u, n)). Given the componentsΣτ , Γτ and Πτ , resulting form a stream of news N and a stream of tweets Tauthored by users U up to time τ , the recommendation score of a news articlen ∈ N for a user u ∈ U at time τ is defined as

Rτ (u, n) = α · Στ (u, n) + β · Γτ (u, n) + γ · Πτ (n),

where α, β, γ are coefficients that specify the relative weight of the components.

At any given time, the recommender system produces a set of newsrecommendation by ranking a set of candidate news, e.g., the most re-cent ones, according to the ranking function R. To motivate the pro-posed ranking function we note similarities with popular recommenda-tion techniques. When β = γ = 0, the ranking function R resemblescollaborative filtering, where user similarity is computed on the basisof their social circles. When α = γ = 0, the function R implements acontent-based recommender system, where a user is profiled by the bag-of-entities occurring in the tweets of the user. Finally, when α = β = 0,the most popular items recommended, regardless of the user profile.

Note that Σ, Γ, Π and R are all time dependent. At any given time τ

the social network and the set of authored tweets vary, thus affecting Σ

and Γ. More importantly, some entities may abruptly become popular,hence of interest to many user. This dependency is captured by Π. Whilethe changes in Σ and Γ derive directly from the tweet stream T and thesocial network S, the update of Π is non-trivial, and plays a fundamentalrole in the recommendation system that we describe in the next section.

108

Recommendation Model R

T.Rex+KE09":%9*#($"1%?(9@%#11(9(+$#,%>"#98*"0LM "$9(9E%@+9$"00%N*#?%$8:<"*%+>%:"$9(+$0%($%$"?0%#$1%9?(99"*OM $"?0%',('-%'+8$9M $"?0%#*9(',"%#4"

;(3"$L N = $"?0%09*"#: T = 9?""9%09*"#: U = 0"9%+>%80"*0

"#$%!&'(!&)*+,!-).&!/(0(12$&!$(3.!4)/!5.(/!&!2&!&#-(τ6Why Twitter?%%P(:",($"00%#$1%)"*0+$#,(Q#9(+$5%R"?0%<"'+:"%09#,"%3"*E%>#09%#$1%0)*"#1%>#09"*%+$%9?(99"*5%P?(99"*%(0%#%4++1%)*"1('9+*%+>%($9"*"095

How!"#$%%80"0%#%:(J%+>%0(4$#,0%9+%:+1",%*","3#$'"%+>%$"?0%#*9(',"0%>+*%80"*0L%9@"%)*+=,"%+>%9@"%0+'(#,%$"(4@<+*@++1%+>%9@"%80"*0.%9@"%'+$9"$9%9@"(*%9?""9%09*"#:.%#$1%9+)('%)+)8,#*(9E%($%9@"%$"?0%#$1%#'*+00%9?(99"*5

Results !"#$%%(0%#<,"%9+%)*"1('9%?(9@%4++1%#''8*#'E%9@"%$"?0%#*9(',"0%',('-"1%<E%9@"%80"*0%#$1%*#$-%9@":%@(4@"*%9@#$%+9@"*%$"?0%#*9(',"05

DataR"?0L%SIT-%#*9(',"0%>*+:%A#@++B%$"?0P?(99"*L%H%:+$9@%+>%'*#?,"1%9?""9052,('-0L%80"*0%+>%9?(99"*%($%A#@++B%9++,<#*%,+405

EvaluationU"%"3#,8#9"%!"#$%%#0%#%',('-%)*"1('9(+$%0E09":5%U"%9*#($%+8*%:+1",%80($4%#%,"#*$($4V9+V*#$-%#))*+#'@%#$1%08))+*9%3"'9+*%:#'@($"05P@"%9*#($%#$1%9"09%0"9%#*"%1*#?$%>*+:%',('-%,+405

Claudio [email protected]

Gianmarco De Francisci [email protected]

Aristides [email protected]

Overwhelmed by information overload! W($1%($9"*"09($4%09+*("0%($%#$%+'"#$%+>%+$,($"%$"?0%#*9(',"05

0

5

10

15

20

25

30

35

40

45

1 10 100 1000 10000

Minutes

News-click delay

$8:<"

*%+>%+

''8**"$'

"0

R"?0V',('-%1",#E%1(09*(<89(+$

-0.2

0

0.2

0.4

0.6

0.8

1

1.2

May-01 h20

May-02 h00

May-02 h04

May-02 h08

May-02 h12

May-02 h16

May-02 h20

May-03 h00

May-03 h04

May-03 h08

newstwitterclicks

9:;<;'=-1'>;?$1%9*"$10$+

*:#,(Q"1

%$8:

<"*%+

>%+''8**"$'

"0

-0.2

0

0.2

0.4

0.6

0.8

1

1.2

1.4

May-22 h00

May-22 h12

May-23 h00

May-23 h12

May-24 h00

May-24 h12

May-25 h00

May-25 h12

May-26 h00

newstwitterclicks

$+*:

#,(Q"1

%$8:

<"*%+

>%+''8**"$'

"0

@ABC-1'!AD1;?A'9*"$10

),-./0'('E%(X%2-%(0%9@"%#89@+*%+>%9?""9%F/

U

T

''(%#89@+*0@()%:#9*(J

4,-./0'('E%(X%2-%(0%($9"*"09"1%($%9@"%'+$9"$9%

)*+18'"1%<E%2/

U

U

('('0+'(#,%:#9*(J

in N according to a user-dependent relevance criteria. We also aim atincorporating time recency into our model, so that our recommendationsfavor the most recently published news articles.

We now proceed to model the factors that affect the relevance of newsfor a given user. We first model the social-network aspect. In our case,the social component is induced by the twitter following relationship. Wedefine S to be the social network adjacency matrix, were S(i, j) is equalto 1 divided by the number of users followed by user ui if ui follows uj ,and 0 otherwise. We also adopt a functional ranking (Baeza-Yates et al.,2006) that spreads the interests of a user among its neighbors recursively.By limiting the maximum hop distance d, we define the social influencein a network as follows.

Definition 4 (Social influence S∗). Given a set of users U = {u0, u1, . . .},organized in a social network where each user may express an interest to thecontent published by another user, we define the social influence model S∗ as the|U| × |U| matrix where S∗(i, j) measures the interest of user ui to the contentgenerated by user uj and it is computed as

S∗ =

�i=d�

i=1

σiSi

�,

where S is the row-normalized adjacency matrix of the social network, d is themaximum hop-distance up to which users may influence their neighbors, and σis a damping factor.

Next we model the profile of a user based on the content that the userhas generated. We first define a binary authorship matrix A to capturethe relationship between users and the tweets they produce.

Definition 5 (Tweet authorship A). Let A be a |U|×|T | matrix where A(i, j)is 1 if ui is the author of tj , and 0 otherwise.

The matrix A can be extended to deal with different types of relation-ships between users and posts, e.g., weigh differently re-tweets, or likes.In this work, we limit the concept of authorship to the posts actuallywritten by the user.

104

0+'(#,%($9"*"09

45,-./0%Y%,"3",%+>%($9"*"09%+>%2-%9+%9@"%'+$9"$9%)*+18'"1%<E%2/5

Z = $1F-FG':B;H$'+$9+%?@('@%T%#$1'N%#*"%:#))"15U"%80"%U(-()"1(#%)#4"0%#0%+8*%"$9(9E%0)#'"5

C)1#9"1%<E%9*#'-($4%:"$9(+$0%($%$"?0%#$1%9?(99"*%?(9@%"J)+$"$9(#,%1"'#E5

Z

7,-0'(%)+)8,#*(9E%+>%"$9(9E%I-)'(%)+)8,#*(9E%3"'9+*

+,-./0'('*",#9"1$"00%+>%

9?""9%F-%9+%$"?0%1/T

N

*'('9?""9V9+V$"?0%:#9*(J

*+,+!+-+.

!,-./0'(%*",#9"1$"00%+>%9?""9%F-'9+%"$9(9E%I/

T

Z

!'(%9?""9%:#9*(J

8,-./0'(%*",#9"1$"00%+>%%"$9(9E%I-'9+%$"?0%1/

Z

N

.'(%$"?0%:#9*(J

news become stale after twodays

track mentions in news andtweets with exponentialdecay

Zτ = λZτ−1 + wTHT + wNHN

yandex aug 31, 2012

Page 114: Арис Гионис «Методы анализа поведения пользователей и его применение в веб-поиске и рекомендации

model learning and evaluation

Rτ (u, n) = α · Στ (u, n) + β · Γτ (u, n) + γ · Πτ (n)

Yahoo! toolbar data

the recommendation model should rank highnews articles that users click

learn the model using SVM

use clicks and twitter profiles of 3K usersto train and test the system

yandex aug 31, 2012

Page 115: Арис Гионис «Методы анализа поведения пользователей и его применение в веб-поиске и рекомендации

systems evaluated

T.rex: basic model using only user profiles

Rτ (u, n) = α · Στ (u, n) + β · Γτ (u, n) + γ · Πτ (n)

T.rex+: additional features

entity hotness

news click count

news article age

yandex aug 31, 2012

Page 116: Арис Гионис «Методы анализа поведения пользователей и его применение в веб-поиске и рекомендации

results

Entities

News

Tweets

From Chatter to Headlines:Harnessing the Real-Time Web

for Personalized News Recommendation

Overview Motivation Problem

Model Method Results

tweetsUser

tweetsFollowee

tweetsFollowee

tweetsFollowee

tweetstwitter

articlesnews

T.Rex

User Model

!

"

#

Personalized ranked list of news articles

Table 5.2: MRR, precision and coverage.

Algorithm MRR P@1 P@5 P@10 CoverageRECENCY 0.020 0.002 0.018 0.036 1.000CLICKCOUNT 0.059 0.024 0.086 0.135 1.000SOCIAL 0.017 0.002 0.018 0.036 0.606CONTENT 0.107 0.029 0.171 0.286 0.158POPULARITY 0.008 0.003 0.005 0.012 1.000T.REX 0.107 0.073 0.130 0.168 1.000T.REX+ 0.109 0.062 0.146 0.189 1.000

RECENCY: it ranks news articles by time of publication (most recent first);CLICKCOUNT: it ranks news articles by click count (highest count first);SOCIAL: it ranks news articles by using T.REX with β = γ = 0;CONTENT: it ranks news articles by using T.REX with α = γ = 0;POPULARITY: it ranks news articles by using T.REX with α = β = 0.

5.6.5 Results

We report MRR, precision and coverage results in Table 5.6.3. The twovariants of our system, T.REX and T.REX+, have the best results overall.

T.REX+ has the highest MRR of all the alternatives. This result meansthat our model has a good overall performance across the dataset. CON-TENT has also a very high MRR. Unfortunately, the coverage level achievedby the CONTENT strategy is very low. This issue is mainly caused by thesparsity of the user profiles. It is well know that most of twitter usersbelong to the “silent majority,” and do not tweet very much.

The SOCIAL strategy is affected by the same problem, albeit to a muchlesser extent. The reason for this difference is that SOCIAL draws froma large social neighborhood of user profiles, instead of just one. So ithas more chances to provide a recommendation. The quality of the rec-ommendation is however quite low, probably because the social-basedprofile only is not able to catch the specific user interests.

It is worth noting that in almost 20% of the cases T.REX+ was able torank the clicked news in the top 10 results. Ranking by the CLICKCOUNT

124

!"#$%&"'()*+'#,%&#$-.%/*"'(0(+$%#$1%2+3"*#4"5

0

2

4

6

8

10

12

14

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Ave

rage D

CG

Rank

T.Rex+T.Rex

PopularityContent

SocialRecency

Click count

63"*#4"%7(0'+8$9"1%28:8,#9(3"%;#($5

T.Rex!"#$%%<8(,10%80"*%)*+=,"0%>*+:%9?(99"*5/#*#:"9"*0%,"#*$"1%>*+:%',('-%1#9#%($%9@"%A#@++B%9++,<#*%,+45C0"0%08))+*9%3"'9+*%:#'@($"0%#$1%,"#*$0%#%*#$-($4%>8$'9(+$5D"8*(09('#,,E%(1"$9(="1%#%4*+8)%+>%FGHI%9?(99"*%80"*0%($%9@"%9++,<#*%#$1%80"1%9@"(*%',('-0%9+%9*#($%#$1%9"09%9@"%0E09":5

What!"#$%%(0%#%$"?%:"9@+1+,+4E%>+*%*"'+::"$1($4%($9"*"09($4%$"?0%9+%80"*0%<E%"J),+(9($4%9@"%($>+*:#9(+$%($%9@"(*%9?(99"*%)"*0+$#5

Content Model Γ&'(')'*'+%?@"*"%&,-./0%(0%9@"%'+$9"$9%*","3#$'"%+>%$"?0%1/'>+*%80"*%2-5

Social Model Σ!3'('45'*')'*'+%?@"*"%3,-./0%(0%9@"%0+'(#,%*","3#$'"%+>%$"?0%1/'>+*%80"*%2-5

Popularity Model Π6'('7'*'8%?@"*"'6,/0%(0%9@"%)+)8,#*(9E%+>%$"?0%#*9(',"%1/5

in updating the popularity counts is to take into account recency: newentities of interest should dominate the popularity counts of older enti-ties. In this work, we choose to update the popularity counts using anexponential decay rule. We discuss the details in Section 5.3.1. However,note that the popularity update is independent of our recommendationmodel, and any other decaying function can be used.

Finally, we propose a ranking function for recommending news arti-cles to users. The ranking function is linear combination of the scoringcomponents described above. We plan to investigate the effect of non-linear combinations in the future.

Definition 10 (Recommendation ranking Rτ (u, n)). Given the componentsΣτ , Γτ and Πτ , resulting form a stream of news N and a stream of tweets Tauthored by users U up to time τ , the recommendation score of a news articlen ∈ N for a user u ∈ U at time τ is defined as

Rτ (u, n) = α · Στ (u, n) + β · Γτ (u, n) + γ · Πτ (n),

where α, β, γ are coefficients that specify the relative weight of the components.

At any given time, the recommender system produces a set of newsrecommendation by ranking a set of candidate news, e.g., the most re-cent ones, according to the ranking function R. To motivate the pro-posed ranking function we note similarities with popular recommenda-tion techniques. When β = γ = 0, the ranking function R resemblescollaborative filtering, where user similarity is computed on the basisof their social circles. When α = γ = 0, the function R implements acontent-based recommender system, where a user is profiled by the bag-of-entities occurring in the tweets of the user. Finally, when α = β = 0,the most popular items recommended, regardless of the user profile.

Note that Σ, Γ, Π and R are all time dependent. At any given time τ

the social network and the set of authored tweets vary, thus affecting Σ

and Γ. More importantly, some entities may abruptly become popular,hence of interest to many user. This dependency is captured by Π. Whilethe changes in Σ and Γ derive directly from the tweet stream T and thesocial network S, the update of Π is non-trivial, and plays a fundamentalrole in the recommendation system that we describe in the next section.

108

Recommendation Model R

T.Rex+KE09":%9*#($"1%?(9@%#11(9(+$#,%>"#98*"0LM "$9(9E%@+9$"00%N*#?%$8:<"*%+>%:"$9(+$0%($%$"?0%#$1%9?(99"*OM $"?0%',('-%'+8$9M $"?0%#*9(',"%#4"

;(3"$L N = $"?0%09*"#: T = 9?""9%09*"#: U = 0"9%+>%80"*0

"#$%!&'(!&)*+,!-).&!/(0(12$&!$(3.!4)/!5.(/!&!2&!&#-(τ6Why Twitter?%%P(:",($"00%#$1%)"*0+$#,(Q#9(+$5%R"?0%<"'+:"%09#,"%3"*E%>#09%#$1%0)*"#1%>#09"*%+$%9?(99"*5%P?(99"*%(0%#%4++1%)*"1('9+*%+>%($9"*"095

How!"#$%%80"0%#%:(J%+>%0(4$#,0%9+%:+1",%*","3#$'"%+>%$"?0%#*9(',"0%>+*%80"*0L%9@"%)*+=,"%+>%9@"%0+'(#,%$"(4@<+*@++1%+>%9@"%80"*0.%9@"%'+$9"$9%9@"(*%9?""9%09*"#:.%#$1%9+)('%)+)8,#*(9E%($%9@"%$"?0%#$1%#'*+00%9?(99"*5

Results !"#$%%(0%#<,"%9+%)*"1('9%?(9@%4++1%#''8*#'E%9@"%$"?0%#*9(',"0%',('-"1%<E%9@"%80"*0%#$1%*#$-%9@":%@(4@"*%9@#$%+9@"*%$"?0%#*9(',"05

DataR"?0L%SIT-%#*9(',"0%>*+:%A#@++B%$"?0P?(99"*L%H%:+$9@%+>%'*#?,"1%9?""9052,('-0L%80"*0%+>%9?(99"*%($%A#@++B%9++,<#*%,+405

EvaluationU"%"3#,8#9"%!"#$%%#0%#%',('-%)*"1('9(+$%0E09":5%U"%9*#($%+8*%:+1",%80($4%#%,"#*$($4V9+V*#$-%#))*+#'@%#$1%08))+*9%3"'9+*%:#'@($"05P@"%9*#($%#$1%9"09%0"9%#*"%1*#?$%>*+:%',('-%,+405

Claudio [email protected]

Gianmarco De Francisci [email protected]

Aristides [email protected]

Overwhelmed by information overload! W($1%($9"*"09($4%09+*("0%($%#$%+'"#$%+>%+$,($"%$"?0%#*9(',"05

0

5

10

15

20

25

30

35

40

45

1 10 100 1000 10000

Minutes

News-click delay

$8:<"

*%+>%+

''8**"$'

"0

R"?0V',('-%1",#E%1(09*(<89(+$

-0.2

0

0.2

0.4

0.6

0.8

1

1.2

May-01 h20

May-02 h00

May-02 h04

May-02 h08

May-02 h12

May-02 h16

May-02 h20

May-03 h00

May-03 h04

May-03 h08

newstwitterclicks

9:;<;'=-1'>;?$1%9*"$10

$+*:

#,(Q"1

%$8:

<"*%+

>%+''8**"$'

"0

-0.2

0

0.2

0.4

0.6

0.8

1

1.2

1.4

May-22 h00

May-22 h12

May-23 h00

May-23 h12

May-24 h00

May-24 h12

May-25 h00

May-25 h12

May-26 h00

newstwitterclicks

$+*:

#,(Q"1

%$8:

<"*%+

>%+''8**"$'

"0

@ABC-1'!AD1;?A'9*"$10

),-./0'('E%(X%2-%(0%9@"%#89@+*%+>%9?""9%F/

U

T

''(%#89@+*0@()%:#9*(J

4,-./0'('E%(X%2-%(0%($9"*"09"1%($%9@"%'+$9"$9%

)*+18'"1%<E%2/

U

U

('('0+'(#,%:#9*(J

in N according to a user-dependent relevance criteria. We also aim atincorporating time recency into our model, so that our recommendationsfavor the most recently published news articles.

We now proceed to model the factors that affect the relevance of newsfor a given user. We first model the social-network aspect. In our case,the social component is induced by the twitter following relationship. Wedefine S to be the social network adjacency matrix, were S(i, j) is equalto 1 divided by the number of users followed by user ui if ui follows uj ,and 0 otherwise. We also adopt a functional ranking (Baeza-Yates et al.,2006) that spreads the interests of a user among its neighbors recursively.By limiting the maximum hop distance d, we define the social influencein a network as follows.

Definition 4 (Social influence S∗). Given a set of users U = {u0, u1, . . .},organized in a social network where each user may express an interest to thecontent published by another user, we define the social influence model S∗ as the|U| × |U| matrix where S∗(i, j) measures the interest of user ui to the contentgenerated by user uj and it is computed as

S∗ =

�i=d�

i=1

σiSi

�,

where S is the row-normalized adjacency matrix of the social network, d is themaximum hop-distance up to which users may influence their neighbors, and σis a damping factor.

Next we model the profile of a user based on the content that the userhas generated. We first define a binary authorship matrix A to capturethe relationship between users and the tweets they produce.

Definition 5 (Tweet authorship A). Let A be a |U|×|T | matrix where A(i, j)is 1 if ui is the author of tj , and 0 otherwise.

The matrix A can be extended to deal with different types of relation-ships between users and posts, e.g., weigh differently re-tweets, or likes.In this work, we limit the concept of authorship to the posts actuallywritten by the user.

104

0+'(#,%($9"*"09

45,-./0%Y%,"3",%+>%($9"*"09%+>%2-%9+%9@"%'+$9"$9%)*+18'"1%<E%2/5

Z = $1F-FG':B;H$'+$9+%?@('@%T%#$1'N%#*"%:#))"15U"%80"%U(-()"1(#%)#4"0%#0%+8*%"$9(9E%0)#'"5

C)1#9"1%<E%9*#'-($4%:"$9(+$0%($%$"?0%#$1%9?(99"*%?(9@%"J)+$"$9(#,%1"'#E5

Z

7,-0'(%)+)8,#*(9E%+>%"$9(9E%I-)'(%)+)8,#*(9E%3"'9+*

+,-./0'('*",#9"1$"00%+>%

9?""9%F-%9+%$"?0%1/T

N

*'('9?""9V9+V$"?0%:#9*(J

*+,+!+-+.

!,-./0'(%*",#9"1$"00%+>%9?""9%F-'9+%"$9(9E%I/

T

Z

!'(%9?""9%:#9*(J

8,-./0'(%*",#9"1$"00%+>%%"$9(9E%I-'9+%$"?0%1/

Z

N

.'(%$"?0%:#9*(J

yandex aug 31, 2012

Page 117: Арис Гионис «Методы анализа поведения пользователей и его применение в веб-поиске и рекомендации

results

Entities

News

Tweets

From Chatter to Headlines:Harnessing the Real-Time Web

for Personalized News Recommendation

Overview Motivation Problem

Model Method Results

tweetsUser

tweetsFollowee

tweetsFollowee

tweetsFollowee

tweetstwitter

articlesnews

T.Rex

User Model

!

"

#

Personalized ranked list of news articles

Table 5.2: MRR, precision and coverage.

Algorithm MRR P@1 P@5 P@10 CoverageRECENCY 0.020 0.002 0.018 0.036 1.000CLICKCOUNT 0.059 0.024 0.086 0.135 1.000SOCIAL 0.017 0.002 0.018 0.036 0.606CONTENT 0.107 0.029 0.171 0.286 0.158POPULARITY 0.008 0.003 0.005 0.012 1.000T.REX 0.107 0.073 0.130 0.168 1.000T.REX+ 0.109 0.062 0.146 0.189 1.000

RECENCY: it ranks news articles by time of publication (most recent first);CLICKCOUNT: it ranks news articles by click count (highest count first);SOCIAL: it ranks news articles by using T.REX with β = γ = 0;CONTENT: it ranks news articles by using T.REX with α = γ = 0;POPULARITY: it ranks news articles by using T.REX with α = β = 0.

5.6.5 Results

We report MRR, precision and coverage results in Table 5.6.3. The twovariants of our system, T.REX and T.REX+, have the best results overall.

T.REX+ has the highest MRR of all the alternatives. This result meansthat our model has a good overall performance across the dataset. CON-TENT has also a very high MRR. Unfortunately, the coverage level achievedby the CONTENT strategy is very low. This issue is mainly caused by thesparsity of the user profiles. It is well know that most of twitter usersbelong to the “silent majority,” and do not tweet very much.

The SOCIAL strategy is affected by the same problem, albeit to a muchlesser extent. The reason for this difference is that SOCIAL draws froma large social neighborhood of user profiles, instead of just one. So ithas more chances to provide a recommendation. The quality of the rec-ommendation is however quite low, probably because the social-basedprofile only is not able to catch the specific user interests.

It is worth noting that in almost 20% of the cases T.REX+ was able torank the clicked news in the top 10 results. Ranking by the CLICKCOUNT

124

!"#$%&"'()*+'#,%&#$-.%/*"'(0(+$%#$1%2+3"*#4"5

0

2

4

6

8

10

12

14

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Ave

rage D

CG

Rank

T.Rex+T.Rex

PopularityContent

SocialRecency

Click count

63"*#4"%7(0'+8$9"1%28:8,#9(3"%;#($5

T.Rex!"#$%%<8(,10%80"*%)*+=,"0%>*+:%9?(99"*5/#*#:"9"*0%,"#*$"1%>*+:%',('-%1#9#%($%9@"%A#@++B%9++,<#*%,+45C0"0%08))+*9%3"'9+*%:#'@($"0%#$1%,"#*$0%#%*#$-($4%>8$'9(+$5D"8*(09('#,,E%(1"$9(="1%#%4*+8)%+>%FGHI%9?(99"*%80"*0%($%9@"%9++,<#*%#$1%80"1%9@"(*%',('-0%9+%9*#($%#$1%9"09%9@"%0E09":5

What!"#$%%(0%#%$"?%:"9@+1+,+4E%>+*%*"'+::"$1($4%($9"*"09($4%$"?0%9+%80"*0%<E%"J),+(9($4%9@"%($>+*:#9(+$%($%9@"(*%9?(99"*%)"*0+$#5

Content Model Γ&'(')'*'+%?@"*"%&,-./0%(0%9@"%'+$9"$9%*","3#$'"%+>%$"?0%1/'>+*%80"*%2-5

Social Model Σ!3'('45'*')'*'+%?@"*"%3,-./0%(0%9@"%0+'(#,%*","3#$'"%+>%$"?0%1/'>+*%80"*%2-5

Popularity Model Π6'('7'*'8%?@"*"'6,/0%(0%9@"%)+)8,#*(9E%+>%$"?0%#*9(',"%1/5

in updating the popularity counts is to take into account recency: newentities of interest should dominate the popularity counts of older enti-ties. In this work, we choose to update the popularity counts using anexponential decay rule. We discuss the details in Section 5.3.1. However,note that the popularity update is independent of our recommendationmodel, and any other decaying function can be used.

Finally, we propose a ranking function for recommending news arti-cles to users. The ranking function is linear combination of the scoringcomponents described above. We plan to investigate the effect of non-linear combinations in the future.

Definition 10 (Recommendation ranking Rτ (u, n)). Given the componentsΣτ , Γτ and Πτ , resulting form a stream of news N and a stream of tweets Tauthored by users U up to time τ , the recommendation score of a news articlen ∈ N for a user u ∈ U at time τ is defined as

Rτ (u, n) = α · Στ (u, n) + β · Γτ (u, n) + γ · Πτ (n),

where α, β, γ are coefficients that specify the relative weight of the components.

At any given time, the recommender system produces a set of newsrecommendation by ranking a set of candidate news, e.g., the most re-cent ones, according to the ranking function R. To motivate the pro-posed ranking function we note similarities with popular recommenda-tion techniques. When β = γ = 0, the ranking function R resemblescollaborative filtering, where user similarity is computed on the basisof their social circles. When α = γ = 0, the function R implements acontent-based recommender system, where a user is profiled by the bag-of-entities occurring in the tweets of the user. Finally, when α = β = 0,the most popular items recommended, regardless of the user profile.

Note that Σ, Γ, Π and R are all time dependent. At any given time τ

the social network and the set of authored tweets vary, thus affecting Σ

and Γ. More importantly, some entities may abruptly become popular,hence of interest to many user. This dependency is captured by Π. Whilethe changes in Σ and Γ derive directly from the tweet stream T and thesocial network S, the update of Π is non-trivial, and plays a fundamentalrole in the recommendation system that we describe in the next section.

108

Recommendation Model R

T.Rex+KE09":%9*#($"1%?(9@%#11(9(+$#,%>"#98*"0LM "$9(9E%@+9$"00%N*#?%$8:<"*%+>%:"$9(+$0%($%$"?0%#$1%9?(99"*OM $"?0%',('-%'+8$9M $"?0%#*9(',"%#4"

;(3"$L N = $"?0%09*"#: T = 9?""9%09*"#: U = 0"9%+>%80"*0

"#$%!&'(!&)*+,!-).&!/(0(12$&!$(3.!4)/!5.(/!&!2&!&#-(τ6Why Twitter?%%P(:",($"00%#$1%)"*0+$#,(Q#9(+$5%R"?0%<"'+:"%09#,"%3"*E%>#09%#$1%0)*"#1%>#09"*%+$%9?(99"*5%P?(99"*%(0%#%4++1%)*"1('9+*%+>%($9"*"095

How!"#$%%80"0%#%:(J%+>%0(4$#,0%9+%:+1",%*","3#$'"%+>%$"?0%#*9(',"0%>+*%80"*0L%9@"%)*+=,"%+>%9@"%0+'(#,%$"(4@<+*@++1%+>%9@"%80"*0.%9@"%'+$9"$9%9@"(*%9?""9%09*"#:.%#$1%9+)('%)+)8,#*(9E%($%9@"%$"?0%#$1%#'*+00%9?(99"*5

Results !"#$%%(0%#<,"%9+%)*"1('9%?(9@%4++1%#''8*#'E%9@"%$"?0%#*9(',"0%',('-"1%<E%9@"%80"*0%#$1%*#$-%9@":%@(4@"*%9@#$%+9@"*%$"?0%#*9(',"05

DataR"?0L%SIT-%#*9(',"0%>*+:%A#@++B%$"?0P?(99"*L%H%:+$9@%+>%'*#?,"1%9?""9052,('-0L%80"*0%+>%9?(99"*%($%A#@++B%9++,<#*%,+405

EvaluationU"%"3#,8#9"%!"#$%%#0%#%',('-%)*"1('9(+$%0E09":5%U"%9*#($%+8*%:+1",%80($4%#%,"#*$($4V9+V*#$-%#))*+#'@%#$1%08))+*9%3"'9+*%:#'@($"05P@"%9*#($%#$1%9"09%0"9%#*"%1*#?$%>*+:%',('-%,+405

Claudio [email protected]

Gianmarco De Francisci [email protected]

Aristides [email protected]

Overwhelmed by information overload! W($1%($9"*"09($4%09+*("0%($%#$%+'"#$%+>%+$,($"%$"?0%#*9(',"05

0

5

10

15

20

25

30

35

40

45

1 10 100 1000 10000

Minutes

News-click delay

$8:<"

*%+>%+

''8**"$'

"0

R"?0V',('-%1",#E%1(09*(<89(+$

-0.2

0

0.2

0.4

0.6

0.8

1

1.2

May-01 h20

May-02 h00

May-02 h04

May-02 h08

May-02 h12

May-02 h16

May-02 h20

May-03 h00

May-03 h04

May-03 h08

newstwitterclicks

9:;<;'=-1'>;?$1%9*"$10

$+*:

#,(Q"1

%$8:

<"*%+

>%+''8**"$'

"0

-0.2

0

0.2

0.4

0.6

0.8

1

1.2

1.4

May-22 h00

May-22 h12

May-23 h00

May-23 h12

May-24 h00

May-24 h12

May-25 h00

May-25 h12

May-26 h00

newstwitterclicks

$+*:

#,(Q"1

%$8:

<"*%+

>%+''8**"$'

"0

@ABC-1'!AD1;?A'9*"$10

),-./0'('E%(X%2-%(0%9@"%#89@+*%+>%9?""9%F/

U

T

''(%#89@+*0@()%:#9*(J

4,-./0'('E%(X%2-%(0%($9"*"09"1%($%9@"%'+$9"$9%

)*+18'"1%<E%2/

U

U

('('0+'(#,%:#9*(J

in N according to a user-dependent relevance criteria. We also aim atincorporating time recency into our model, so that our recommendationsfavor the most recently published news articles.

We now proceed to model the factors that affect the relevance of newsfor a given user. We first model the social-network aspect. In our case,the social component is induced by the twitter following relationship. Wedefine S to be the social network adjacency matrix, were S(i, j) is equalto 1 divided by the number of users followed by user ui if ui follows uj ,and 0 otherwise. We also adopt a functional ranking (Baeza-Yates et al.,2006) that spreads the interests of a user among its neighbors recursively.By limiting the maximum hop distance d, we define the social influencein a network as follows.

Definition 4 (Social influence S∗). Given a set of users U = {u0, u1, . . .},organized in a social network where each user may express an interest to thecontent published by another user, we define the social influence model S∗ as the|U| × |U| matrix where S∗(i, j) measures the interest of user ui to the contentgenerated by user uj and it is computed as

S∗ =

�i=d�

i=1

σiSi

�,

where S is the row-normalized adjacency matrix of the social network, d is themaximum hop-distance up to which users may influence their neighbors, and σis a damping factor.

Next we model the profile of a user based on the content that the userhas generated. We first define a binary authorship matrix A to capturethe relationship between users and the tweets they produce.

Definition 5 (Tweet authorship A). Let A be a |U|×|T | matrix where A(i, j)is 1 if ui is the author of tj , and 0 otherwise.

The matrix A can be extended to deal with different types of relation-ships between users and posts, e.g., weigh differently re-tweets, or likes.In this work, we limit the concept of authorship to the posts actuallywritten by the user.

104

0+'(#,%($9"*"09

45,-./0%Y%,"3",%+>%($9"*"09%+>%2-%9+%9@"%'+$9"$9%)*+18'"1%<E%2/5

Z = $1F-FG':B;H$'+$9+%?@('@%T%#$1'N%#*"%:#))"15U"%80"%U(-()"1(#%)#4"0%#0%+8*%"$9(9E%0)#'"5

C)1#9"1%<E%9*#'-($4%:"$9(+$0%($%$"?0%#$1%9?(99"*%?(9@%"J)+$"$9(#,%1"'#E5

Z

7,-0'(%)+)8,#*(9E%+>%"$9(9E%I-)'(%)+)8,#*(9E%3"'9+*

+,-./0'('*",#9"1$"00%+>%

9?""9%F-%9+%$"?0%1/T

N

*'('9?""9V9+V$"?0%:#9*(J

*+,+!+-+.

!,-./0'(%*",#9"1$"00%+>%9?""9%F-'9+%"$9(9E%I/

T

Z

!'(%9?""9%:#9*(J

8,-./0'(%*",#9"1$"00%+>%%"$9(9E%I-'9+%$"?0%1/

Z

N

.'(%$"?0%:#9*(J

yandex aug 31, 2012

Page 118: Арис Гионис «Методы анализа поведения пользователей и его применение в веб-поиске и рекомендации

conclusions

real-time web information can be leveraged to deliverrelevant information

future directions

LSI analysis on entities

models for different user clusters

georgaphic information

yandex aug 31, 2012

Page 119: Арис Гионис «Методы анализа поведения пользователей и его применение в веб-поиске и рекомендации

conclusions

real-time web information can be leveraged to deliverrelevant information

future directions

LSI analysis on entities

models for different user clusters

georgaphic information

yandex aug 31, 2012

Page 120: Арис Гионис «Методы анализа поведения пользователей и его применение в веб-поиске и рекомендации

summary

review concepts on query-log mining

answering directly queries with useful tips

challenges and opportunities in information dissemination

news recommendations using real-time web

many nice problems and research opportunities

yandex aug 31, 2012

Page 121: Арис Гионис «Методы анализа поведения пользователей и его применение в веб-поиске и рекомендации

thank you!

yandex aug 31, 2012

Page 122: Арис Гионис «Методы анализа поведения пользователей и его применение в веб-поиске и рекомендации

references I

Anagnostopoulos, A., Becchetti, L., Castillo, C., and Gionis, A.(2010).

An optimization framework for query recommendation.

In WSDM.

Baeza-Yates, R. A., Gionis, A., Junqueira, F., Murdock, V.,Plachouras, V., and Silvestri, F. (2007).

The impact of caching on search engines.

In SIGIR.

Boldi, P., Bonchi, F., Castillo, C., Donato, D., Gionis, A., andVigna, S. (2008).

The query-flow graph: model and applications.

In Proceeding of the 17th ACM conference on Information andknowledge management (CIKM).

yandex aug 31, 2012

Page 123: Арис Гионис «Методы анализа поведения пользователей и его применение в веб-поиске и рекомендации

references II

Bordino, I., Castillo, C., Donato, D., and Gionis, A. (2010).

Query similarity by projecting the query-flow graph.

In SIGIR.

Craswell, N. and Szummer, M. (2007).

Random walks on the click graph.

In Proceedings of the 30th annual international ACM conference onResearch and development in information retrieval (SIGIR).

De Francisci Morales, G., Gionis, A., and Lucchese, C. (2012).

From chatter to headlines: Harnessing the real-time web forpersonalized news recommendation.

In WSDM.

Szpektor, I., Gionis, A., and Maarek, Y. (2011).

Improving recommendation for long-tail queries via templates.

In WWW.

yandex aug 31, 2012

Page 124: Арис Гионис «Методы анализа поведения пользователей и его применение в веб-поиске и рекомендации

references III

Weber, I., Ukkonen, A., and Gioni, A. (2011).

Answers, not links: Extracting tips from yahoo! answers to addresshow-to web queries.

In CIKM.

yandex aug 31, 2012