21
Enrich Query Representation by Query Understanding Gu Xu Microsoft Research Asia

Enrich Query Representation by Query Understanding Gu Xu Microsoft Research Asia

Embed Size (px)

Citation preview

Page 1: Enrich Query Representation by Query Understanding Gu Xu Microsoft Research Asia

Enrich Query Representation by Query Understanding

Gu XuMicrosoft Research Asia

Page 2: Enrich Query Representation by Query Understanding Gu Xu Microsoft Research Asia

Mismatching Problem

• Mismatching is Fundamental Problem in Search– Examples:

• NY ↔ New York, game cheats ↔ game cheatcodes

• Search Engine Challenges– Head or frequent queries

• Rich information available: clicks, query sessions, anchor texts, and etc.

– Tail or infrequent queries• Information becomes sparse and limited

• Our Proposal– Enrich both queries and documents and conduct matching on

the enriched representation.

Page 3: Enrich Query Representation by Query Understanding Gu Xu Microsoft Research Asia

Matching at Different Semantic Levels

Structure

Term

Sense

Topic

Leve

l of S

eman

tics

Match exactly same termsNY New York

disk disc

Match terms with same meaningsNY New York

motherboard mainboard

utube youtube

Match topics of query and documentsMicrosoft Office … working for Microsoft … my office is in …

Topic: PC Software Topic: Personal Homepage

Match intent with answers (structures of query and document)Microsoft Office home find homepage of Microsoft Office

21 movie find movie named 21buy laptop less than 1000 find online dealers to buy

laptop with less than 1000 dollars

Page 4: Enrich Query Representation by Query Understanding Gu Xu Microsoft Research Asia

Enrich Query Representation

Term Level

michael jordan berkele

<token>michael</token><token>jordan</token><token>berkele</token>

Sense Level

<correction token =“berkele”>berkeley</correction><similar-queries>michael I. jordan berkeley</ similar-queries >

Topic Level

Structure Level

<query-topics>academic </query-topics>

<person-name>michael jordan</person-name><location>berkeley</location>

TokenizationC# C 1,000 1 000MAX_PATH MAX PATH

Query RefinementAlternative Query Finding

ill-formed well-formed

Ambiguity: msil or mailEquivalence (or dependency): department or dept, login or sign on

Query ClassificationDefinition of classesAccuracy & efficiency

Query ParsingNamed entity segmentation and disambiguation Large-scale knowledge base

Representation Understanding

Page 5: Enrich Query Representation by Query Understanding Gu Xu Microsoft Research Asia

QUERY REFINEMENT USING CRF-QR (SIGIR’08)

Page 6: Enrich Query Representation by Query Understanding Gu Xu Microsoft Research Asia

Query Refinement

Papers on Machin Learn

Papers on

Spelling Error Correction

Inflection

Machine Learning“ ”

Phrase Segmentation

Operations are mutually dependant: Spelling Error Correction Inflection Phrase Segmentation

Page 7: Enrich Query Representation by Query Understanding Gu Xu Microsoft Research Asia

Conventional CRF

X

Y

x0 x1 x2 x3

… … … …

y10on

y30learn

y00papers

y01paper

y11in

y20machin

y21machine

y31learning

papers machinon learn

Intractable

papers machin

learn

on

…… ……

on

papers machinlearn

machine

machines

learning

learns paper in

upon

…… ………… …… …… ………… …… …… ……

Page 8: Enrich Query Representation by Query Understanding Gu Xu Microsoft Research Asia

h

CRF for Query Refinement

X

Y

O

Operation DescriptionDeletion Delete a letter in a wordInsertion Insert a letter into a wordSubstitution Replace one letter with anotherExchange Switch two letters in a word

Page 9: Enrich Query Representation by Query Understanding Gu Xu Microsoft Research Asia

CRF for Query Refinement

X

Y

O

x2

machin

x3

learn

y2 y3

… … … … … … … … … … …

lean walk machined super soccer machining datathe learning paper mp3 book think macinmachina lyrics learned machi new pc com learharry machine journal university net blearn

clearn

course

1. O constrains the mapping from X to Y (Reduce Space)

o2 o3

Page 10: Enrich Query Representation by Query Understanding Gu Xu Microsoft Research Asia

CRF for Query Refinement

X

Y

O

x2

machin

x3

learn

… … … … … … … … … … …

lean

walk

machined

super soccer

machining

datathe

learning

paper mp3 book think

macinmachina

lyrics

learned machi

new pc com

lear

harry

machine

journal university net

blearn clearn

course

1. O constrains the mapping from X to Y (Reduce Space)2. O indexes the mapping from X to Y (Sharing Parameters)

y3y2 y2 y2 y2 y3 y3 y3

Deletion

Insertion+ed

+ing Deletion

Insertion+ed

+ing

Page 11: Enrich Query Representation by Query Understanding Gu Xu Microsoft Research Asia

NAMED ENTITY RECOGNITION IN QUERY (SIGIR’09, SIGKDD’09)

Page 12: Enrich Query Representation by Query Understanding Gu Xu Microsoft Research Asia

Named Entity Recognition in Query

harry potter filmharry potter harry potter author

harry potter – Movie (0.5)harry potter – Book (0.4)harry potter – Game (0.1)

harry potter filmharry potter – Movie (0.95)

harry potter authorharry potter – Book (0.95)

Page 13: Enrich Query Representation by Query Understanding Gu Xu Microsoft Research Asia

Challenges

• Named Entity Recognition in Document• Challenges– Queries are short (2-3 words on average)

• Less context features

– Queries are not well-formed (typos, lower cased, …)• Less content features

• Knowledge Database– Coverage and Freshness– Ambiguity

Page 14: Enrich Query Representation by Query Understanding Gu Xu Microsoft Research Asia

Our Approach to NERQ

• Goal of NERQ becomes to find the best triple (e, t, c)* for query q satisfying

Harry Potter Walkthrough

“Harry Potter” (Named Entity) + “# Walkthrough” (Context) te“Game” Class c

ctpecpep

qctep

qGcte

cte

)(),,(

),,(

maxarg

,,,maxarg *c) t,(e,

q

Page 15: Enrich Query Representation by Query Understanding Gu Xu Microsoft Research Asia

Training With Topic Model

• Ideal Training Data T = {(ei, ti, ci)}

• Real Training Data T = {(ei, ti, *)}

– Queries are ambiguous (harry potter, harry potter review)

– Training data are a relatively few

i iii ctep ,,max

e eei c i

i c iiii c ii

ictpecpep

ctpecpepctep

max

max,,max

Page 16: Enrich Query Representation by Query Understanding Gu Xu Microsoft Research Asia

Training With Topic Model (cont.)

e eei c ii

ctpecpepmax

harry potterkung fu pandairon man…………………………………………………………………………………………………………

# wallpapers# movies# walkthrough# book price……………………………………………………………………………………

# is a placeholder for name entity. Here # means “harry potter”

Movie

Game

Book……………………

Topics

e t c

Page 17: Enrich Query Representation by Query Understanding Gu Xu Microsoft Research Asia

Weakly Supervised Topic Model

• Introducing Supervisions– Supervisions are always better– Alignment between Implicit Topics and Explicit Classes

• Weak Supervisions– Label named entities rather than queries (doc. class labels)– Multiple class labels (binary Indicator)

Kung Fu Panda

Movie Game Book

??

Distribution Over Classes

Page 18: Enrich Query Representation by Query Understanding Gu Xu Microsoft Research Asia

WS-LDA

• LDA + Soft Constraints (w.r.t. Supervisions)

• Soft Constraints

,,log, yCwpywL LDA Probability Soft Constraints

ii iz y y C ,

Document Probability on i-th Class

Document Binary Label on i-th Class

1 1 0

iz

iy 1 1 0

iz

iy

Page 19: Enrich Query Representation by Query Understanding Gu Xu Microsoft Research Asia

Extension: Leveraging Clicks

# wallpapers# movies# walkthrough# book price……………………

t

Movie Game Book

www.imdb.comwww.wikipedia.comwww.gamespot.comwww.sparknotes.com cheats.ign.com……………………

t’Clicked Host Name

Context

URL wordsTitle wordsSnippet wordsContent wordsOther features

Page 20: Enrich Query Representation by Query Understanding Gu Xu Microsoft Research Asia

Summary

The goal of query understanding is to enrich query representation and essentially solve the problem of

term mismatching.

Page 21: Enrich Query Representation by Query Understanding Gu Xu Microsoft Research Asia

THANKS!