Identifying Users’ Topical Tasks in Web Search

Identifying Users’ Topical Tasks

in Web Search

W. Hua, Y. Song, H. Wang, Z. Zhou, WSDM 2013

2013/06/30 SEXI2013読み会

@harapon

この論文の内容

Search Taskはクエリとそのreformationで成り立つ

task identificationはサーチエンジンにとって重要

• 価値のある情報を提供する

• ユーザーのサーチの意図を予測する

• ユーザーにクエリをサジェストする

これまでのアプローチでは

• クエリのtemporal features (時間特徴量)

• lexical features (単語の特徴量)を用いる

多くのクエリのreformationはtopicalなので，その

reformationは単語レベルでは同じではないかも

（問題1）

2

“flight to LA”に対して“cheap US flight“をサジェスト→やりやすい

“flight to LA”に対して“hotel in LA”をサジェスト→難しい

この論文の内容

更に同じsearch session内であっても複数のタスク

が挟まれている場合があり，タイムスタンプに

よってクエリを時間順に並べることができない

（問題2）

このような問題に対して，

• セマンティックレベルで2つのクエリを比較した類似度

をつくることで問題1に対処し，

• 時間的に離れたクエリ間においてSequential Cut and

Merge (SCM)アルゴリズムを提案し，問題2に対処した

3

3. ProBase

http://research.microsoft.com/en-us/projects/probase/

4

bag-of-words表現と人間の理解の間にはギャップ

セマンティックネットワーク




3.1 Knowledge Representation

node

• entity (“Barack Obama”)

• concept (“President of America”)

• attribute (eg, “age”, “color”)

edge

• isA relationship (“Barack Obama” isA “President of America”)

• isAttributeOf relationship

(“population” isAttributeOf “country”)

Probaseのedgesは確率的情報の重み付け• P(instance | concept)

• eg. P(“poodle”| “dogs”) > P(“pugs”|”dogs”)

• P(concept | instance), P(concept | attribute), P(attribute | concept)も定義可能

何十億のweb pageからひっぱってきて構築5

4. Methodology

task identificationは次の2つの問題に集約される

• 2つのクエリ間の類似性・関連性をどう定量化するか？

• 1つのクエリセッション内で類似クエリを効率的に

どうクラスター化するか？

taskの定義

6

あるセッションがクエリとして与えられたとき，taskは

: クエリのタイムスタンプ

: クエリ , の類似度

: 類似度の閾値

は連続していないかもしれない

4.1 Similarity Calculation

クエリ類似度のために4種類の特徴量をつくる

• conceptual features

• lexical features

• template features

• temporal features

特にconceptual featuresがメイン

• クエリ曖昧性の解消がモチベーション

• “apple”が「リンゴ(果物)」なのか「アップル(企業)」なのか

7

4.1.1 Conceptual Features

クエリの背後にある概念を特徴量化する

Step 1: Parsing

• クエリを単語に分割

• Probase内のinstance/attributeに写像される一番長い

単語列を用いる

• 同じ長さならより多くのconceptに繋がっている方

• クエリが”truck driving school pay after training”なら

”truck driving”, “school”, “pay”, “training”がProbaseに表れる

最も長いインスタンス．”driving”はダメ

• クエリが”tiger woods”なら

”tiger”, “woods“ではなく”tiger woods”

• このようにBoW表現よりも解釈しやすいクエリ

8


Step 2: Conceptualization

• あるクエリをinstances/attributesの集合に写像

• これらのinstances/attributesを表す最も良いconceptを推測

• まず，以下を用いて候補となるconceptを特定する

• ここで，concept vector

• MはProbase内のconceptの総数

• 上位K位のconceptのみ選ぶ

9


クラスター化• 次にあるクエリ内の複数のトピックを見つけるために

各クラスターが一つのトピックになるよう，instance/attribituesをグループ化する

• eg. “alabama home insurance”なら”alabama”(“state”)と”home insurance”(“insurance” and “benefits”)

• 重み付けグラフをつくる

• 各edgeはnodeのconcept vector間のコサイン類似度

• 閾値より小さいコサイン類似度であればグラフからedgeを除去し，instance/attributes clusterを表すサブグラフを作成

• クラスター r：10

: Tと一致するノード集

合，

: エッジ集合


クエリ曖昧性問題の解消

• 各クラスターr内のinstance/attributesを

concept vector crでコンセプト化

• Naive Bayes fuctionによってクラスターr内の各

instance/attributesのconcept vectorの共通部分を計算

• ここでinstance/attributesはそれぞれ独立と仮定

• クラスターr内の共通コンセプトをランク付け

11

P(ck|tlr): instance/attributes tl

rびconcept vector のk番目の値

P(ck): Probaseにおけるconcept ckのpopularity


クエリ曖昧性解消の例

クエリ全体のコンセプト化

• 曖昧性のないconcept vectorからコンセプト化

12


Step 3: Calculating conceptual similarity

• 各クエリqのコンセプト化の結果(concept vector cq)か

ら，クエリ間のコサイン類似度を計算

13

クエリ

qi

単語列

単語列

単語列

instance/attributes集合

T = (t1, t2, …, tL)

t1 → c1

t2 → c2

tL → cL

concept vectort2

t1 t3

t4

ti-1

ti ti+1

クラスター化

T1 → c1

Tr → cr

各クラスターの

concept vectorクエリqiの

concept vector

cq

一連の流れ

4.1.2 Lexical Features

クエリ間のBoW類似度を表す2つの方法

• N-word Jaccard

• “the car james bond drive”を2-wordsでやると

[“the car”, “car james”, “james bond”, “bond drive”]

• N-char Jaccard

• 同様に文字単位で定義

14

vi : the N-word set of query qi

vi k : the term-frequency of the kth N-word in set vi

m, n : the size of set vi and vj

ki , kj : the indexes of that N- word in set vi and vj

vi ki, vijkj

: the term frequencies of that common N-word in set vi and vj

4.1.3 Template Features

Huang et al.(2009)の方法

substring/superstring, add/remove

words, stemming, spelling correction, acronym and

abbreviation, etc.

要はタイプミスや派生語の編集距離

Levenshtein edit distance

15

ed(qi , qj) : the Levenshtein edit distance between query qi and qj

len(qi): the length of query qi

4.1.4 Temporal Features

連続するクエリ間のtime interval

時間的に近ければ近いほど同じタスクである確率

が高い

16

t(qi) : the time query qi is issued

d(qi): the dwelltime of query qi (the sum of dwelltimes of clicks after qi)

4.2 Task Identification

Sequential Cut and Merge (SCM)

17

挟み込まれた場合に

対応できず

計算時間がかかる

over-merge

4.2 Task Identification

Sequential Cut and Merge (SCM)

• まず，SCを適用し得られたtaskをsub taskと命名

• sub taskに含まれるクエリのBoWをマージし，新しい

クエリをつくる．これはsub taskを表現

• sub task集合でGCを適用．閾値以下のedgeをカット

• SCMのウリ

• SCの問題点(挟み込み)に対処

• GCに比べ，計算時間が少ない(上位概念でGCしているので)

18

5. Evaluation and Results

2012年5月のある1日の商用ブラウザから得られた

セッションを抽出

簡単のため，英語で書かれているUS住人で1つの

セッションに少なくとも10クエリあるセッション

にフィルタリング

45813セッション得られて，600サンプルを人手

でラベル付け

19


Effectiveness of Classifiers and Features

20

Error Rate


Accuracy of Algorithms

Computational Time

• 速い

21

GCはSCに比べてf measureで12.03%, jaccardで46.21%改善，

SCMはGCに比べて1.49%, jaccardで12.61%改善している

6. Conclusions

Probaseを使ってconceptual featuresつくった

これまでの特徴量と合わせて使うと分類器の精度改善

task identificationにもこれまでのアルゴリズムをあわせたようなSCMアルゴリズムを使うと，計算時間も同定精度も改善する

今後の課題として

”celtics members”と”kevin garnette”が同じタスクにされてしまう問題を解消したい• 前者はNBAのチーム，後者はNBAのプレイヤー

22

Documents

Identifying Users’ Topical Tasks in Web Search