Upload
tabitha-hutchcraft
View
220
Download
0
Embed Size (px)
Citation preview
Enrich Query Representation by Query Understanding
Gu XuMicrosoft Research Asia
Mismatching Problem
• Mismatching is Fundamental Problem in Search– Examples:
• NY ↔ New York, game cheats ↔ game cheatcodes
• Search Engine Challenges– Head or frequent queries
• Rich information available: clicks, query sessions, anchor texts, and etc.
– Tail or infrequent queries• Information becomes sparse and limited
• Our Proposal– Enrich both queries and documents and conduct matching on
the enriched representation.
Matching at Different Semantic Levels
Structure
Term
Sense
Topic
Leve
l of S
eman
tics
Match exactly same termsNY New York
disk disc
Match terms with same meaningsNY New York
motherboard mainboard
utube youtube
Match topics of query and documentsMicrosoft Office … working for Microsoft … my office is in …
Topic: PC Software Topic: Personal Homepage
Match intent with answers (structures of query and document)Microsoft Office home find homepage of Microsoft Office
21 movie find movie named 21buy laptop less than 1000 find online dealers to buy
laptop with less than 1000 dollars
Enrich Query Representation
Term Level
michael jordan berkele
<token>michael</token><token>jordan</token><token>berkele</token>
Sense Level
<correction token =“berkele”>berkeley</correction><similar-queries>michael I. jordan berkeley</ similar-queries >
Topic Level
Structure Level
<query-topics>academic </query-topics>
<person-name>michael jordan</person-name><location>berkeley</location>
TokenizationC# C 1,000 1 000MAX_PATH MAX PATH
Query RefinementAlternative Query Finding
ill-formed well-formed
Ambiguity: msil or mailEquivalence (or dependency): department or dept, login or sign on
Query ClassificationDefinition of classesAccuracy & efficiency
Query ParsingNamed entity segmentation and disambiguation Large-scale knowledge base
Representation Understanding
QUERY REFINEMENT USING CRF-QR (SIGIR’08)
Query Refinement
Papers on Machin Learn
Papers on
Spelling Error Correction
Inflection
Machine Learning“ ”
Phrase Segmentation
Operations are mutually dependant: Spelling Error Correction Inflection Phrase Segmentation
Conventional CRF
X
Y
x0 x1 x2 x3
… … … …
y10on
y30learn
y00papers
y01paper
y11in
y20machin
y21machine
y31learning
papers machinon learn
Intractable
papers machin
learn
on
…… ……
on
papers machinlearn
machine
machines
learning
learns paper in
upon
…… ………… …… …… ………… …… …… ……
h
CRF for Query Refinement
X
Y
O
Operation DescriptionDeletion Delete a letter in a wordInsertion Insert a letter into a wordSubstitution Replace one letter with anotherExchange Switch two letters in a word
CRF for Query Refinement
X
Y
O
x2
machin
x3
learn
y2 y3
… … … … … … … … … … …
lean walk machined super soccer machining datathe learning paper mp3 book think macinmachina lyrics learned machi new pc com learharry machine journal university net blearn
clearn
course
1. O constrains the mapping from X to Y (Reduce Space)
o2 o3
CRF for Query Refinement
X
Y
O
x2
machin
x3
learn
… … … … … … … … … … …
lean
walk
machined
super soccer
machining
datathe
learning
paper mp3 book think
macinmachina
lyrics
learned machi
new pc com
lear
harry
machine
journal university net
blearn clearn
course
1. O constrains the mapping from X to Y (Reduce Space)2. O indexes the mapping from X to Y (Sharing Parameters)
y3y2 y2 y2 y2 y3 y3 y3
Deletion
Insertion+ed
+ing Deletion
Insertion+ed
+ing
NAMED ENTITY RECOGNITION IN QUERY (SIGIR’09, SIGKDD’09)
Named Entity Recognition in Query
harry potter filmharry potter harry potter author
harry potter – Movie (0.5)harry potter – Book (0.4)harry potter – Game (0.1)
harry potter filmharry potter – Movie (0.95)
harry potter authorharry potter – Book (0.95)
Challenges
• Named Entity Recognition in Document• Challenges– Queries are short (2-3 words on average)
• Less context features
– Queries are not well-formed (typos, lower cased, …)• Less content features
• Knowledge Database– Coverage and Freshness– Ambiguity
Our Approach to NERQ
• Goal of NERQ becomes to find the best triple (e, t, c)* for query q satisfying
Harry Potter Walkthrough
“Harry Potter” (Named Entity) + “# Walkthrough” (Context) te“Game” Class c
ctpecpep
qctep
qGcte
cte
)(),,(
),,(
maxarg
,,,maxarg *c) t,(e,
q
Training With Topic Model
• Ideal Training Data T = {(ei, ti, ci)}
• Real Training Data T = {(ei, ti, *)}
– Queries are ambiguous (harry potter, harry potter review)
– Training data are a relatively few
i iii ctep ,,max
e eei c i
i c iiii c ii
ictpecpep
ctpecpepctep
max
max,,max
Training With Topic Model (cont.)
e eei c ii
ctpecpepmax
harry potterkung fu pandairon man…………………………………………………………………………………………………………
# wallpapers# movies# walkthrough# book price……………………………………………………………………………………
# is a placeholder for name entity. Here # means “harry potter”
Movie
Game
Book……………………
Topics
e t c
Weakly Supervised Topic Model
• Introducing Supervisions– Supervisions are always better– Alignment between Implicit Topics and Explicit Classes
• Weak Supervisions– Label named entities rather than queries (doc. class labels)– Multiple class labels (binary Indicator)
Kung Fu Panda
Movie Game Book
??
Distribution Over Classes
WS-LDA
• LDA + Soft Constraints (w.r.t. Supervisions)
• Soft Constraints
,,log, yCwpywL LDA Probability Soft Constraints
ii iz y y C ,
Document Probability on i-th Class
Document Binary Label on i-th Class
1 1 0
iz
iy 1 1 0
iz
iy
Extension: Leveraging Clicks
# wallpapers# movies# walkthrough# book price……………………
t
Movie Game Book
www.imdb.comwww.wikipedia.comwww.gamespot.comwww.sparknotes.com cheats.ign.com……………………
t’Clicked Host Name
Context
URL wordsTitle wordsSnippet wordsContent wordsOther features
Summary
The goal of query understanding is to enrich query representation and essentially solve the problem of
term mismatching.
THANKS!