Upload
malise
View
66
Download
0
Embed Size (px)
DESCRIPTION
Jiafeng Guo, Gu Xu, Xueqi Cheng, Hang Li. Named EntIty Recognition in Query. SIGIR 2009. Presentation by Gonçalo Simões Course: Recuperação de Informação. Outline. Basic Concepts Named Entity Recognition in Query Conclusions. Outline. Basic Concepts Information Extraction - PowerPoint PPT Presentation
Citation preview
Jiafeng Guo, Gu Xu, Xueqi Cheng, Hang Li
Presentation by Gonçalo SimõesCourse: Recuperação de Informação
SIGIR 2009
Outline
Basic Concepts Named Entity Recognition in Query Conclusions
Outline
Basic ConceptsInformation ExtractionNamed Entity Recognition
Named Entity Recognition in Query Conclusions
Information Extraction Information Extraction (IE) proposes
techniques to extract relevant information from non-structured or semi-structured textsExtracted information is transformed so that it can
be represented in a fixed format
Named Entity Recognition Named Entity Recognition (NER) is an IE task
that seeks to locate and classify text segments into predefined classes (e.g., Person, Location, Time expression)
Named Entity Recognition
CENTER FOR INNOVATION IN LEARNING (CIL)
EDUCATION SEMINAR SERIES
Joe Mertz & Brian Mckenzie
Center for Innovation in Learning, CMU
ANNOUNCEMENT: We are proud to announce that this Friday, February 17, we will have two sessions in our Education Seminar. At 12:30pm, at the Student Center Room 207, Joe Mertz will present "Using a Cognitive Architecture to Design Instructions“. His session ends at 1pm. After a small lunch break, at 14:00, we meet again at Student Center Room 208, where Brian McKenzie will start his presentation. He will present “Information Extraction: how to automatically learn new models”. This session ends arround 15h.
We hope to see you in these sessions
Please direct questions to Pamela Yocca at 268-7675.
Named Entity Recognition
CENTER FOR INNOVATION IN LEARNING (CIL)
EDUCATION SEMINAR SERIES
Joe Mertz & Brian Mckenzie
Center for Innovation in Learning, CMU
ANNOUNCEMENT: We are proud to announce that this Friday, February 17, we will have two sessions in our Education Seminar. At 12:30pm, at the Student Center Room 207, Joe Mertz will present "Using a Cognitive Architecture to Design Instructions“. His session ends at 1pm. After a small lunch break, at 14:00, we meet again at Student Center Room 208, where Brian McKenzie will start his presentation. He will present “Information Extraction: how to automatically learn new models”. This session ends arround 15h.
We hope to see you in these sessions
Please direct questions to Pamela Yocca at 268-7675.
Classes/entities:PersonLocationTemporal Expression
NER in IR NER has been used for some IR tasks Example: NER + Coreference resolution
When Mozart first arrived in Vienna, he’d get up at 6am, settle into composing at his desk by 7, working until 9 or 10 after which he’d make the round of his pupils, taking a break for lunch at 1pm. If there’s no concert, he might get back to work by 5 or 6pm, working until 9pm. He might go out and socialize for a few hours and then come back to work another hour or two before going to bed around 1am. Amadeus preferred getting seven hours of sleep but often made do with five or six ...
NER in IR NER has been used for some IR tasks Example: NER + Coreference resolution
Instead of using a bag of words explore the fact that the highlighted entities correspond to the same real world entity
When Mozart first arrived in Vienna, he’d get up at 6am, settle into composing at his desk by 7, working until 9 or 10 after which he’d make the round of his pupils, taking a break for lunch at 1pm. If there’s no concert, he might get back to work by 5 or 6pm, working until 9pm. He might go out and socialize for a few hours and then come back to work another hour or two before going to bed around 1am. Amadeus preferred getting seven hours of sleep but often made do with five or six ...
Outline Basic Concepts Named Entity Recognition in Query
IntroductionNERQ ProblemNotationProbabilistic ApproachProbability EstimationWS-LDA AlgorithmTraining Process
Experimental Results Conclusions
Introduction
71% of the queries in search engines contain named entities
These named entities may be useful to process the query
Introduction
Motivating ExamplesConsider the query “harry potter walkthrough”
○ The context of the query strongly indicates that the named entity “harry potter” is a “Game”
Consider the query “harry potter cast”○ The context of the query strongly indicates that the
named entity “harry potter” is a “Movie”
Introduction
Identifying named entities can be very useful. Consider the following examples related to the query “harry potter walkthrough”:Ranking: Documents about videogames should be
pushed up in the rankings (Altavista search)Suggestion: Relevant suggestions can be
generated like “harry potter cheats” or “lord of the rings walkthrough”
NERQ Problem
Named Entity Recognition in Query (NERQ) is a task that tries to detect the named entities within a query and categorize it into predefined classes
The work that was previously performed in this area was focused on query log mining and not on query processing
NERQ Problem
NER vs NERQThe techniques used in NER are adapted for
Natural Language textsThey do not have good results for queries
because:○ queries only have 2-3 words on average○ queries are not well formed (e.g., all letters all
typically lower case)
Notation
A single-named-entity query q can be represented as a triple (e,t,c)e denotes a named entityt denotes the context
○ A context is expressed as α # β where α and β denotes the the left and right context respectively and # denotes a placehoder for the named entity
c denotes the class of e
Example“harry potter walkthrough” is associated to the
triple (“harry potter”, “# walkthrough”, “Game”)
Probabilistic Approach
The goal of NERQ is to detect the named entity e in query q, and assign the most likely class c, to e
Goal: Find (e,t,c)* such that:
(e,t,c)* = argmax(e,t,c) P(q,e,t,c)
Probabilistic Approach
The goal of NERQ is to detect the named entity e in query q, and assign the most likely class c, to e
Goal: Find (e,t,c)* such that:
(e,t,c)* = argmax(e,t,c) P(q | e,t,c) P(e,t,c)
Probabilistic Approach
The goal of NERQ is to detect the named entity e in query q, and assign the most likely class c, to e
Goal: Find (e,t,c)* such that:
(e,t,c)* = argmax(e,t,c) ϵ G(q) P(e,t,c)
Probabilistic Approach
For each triple (e,t,c ) ϵ G(q), we only need to compute P(e,t,c)
P(e,t,c) = P(t,c | e) P(e)
Probabilistic Approach
For each triple (e,t,c ) ϵ G(q), we only need to compute P(e,t,c)
P(e,t,c) = P(t | c,e) P(c | e) P(e)
Probabilistic Approach
For each triple (e,t,c ) ϵ G(q), we only need to compute P(e,t,c)
P(e,t,c) = P(t | c) P(c | e) P(e)
How to estimate these probabilities?
Probability Estimation
P(t | c), P(c | e) and P(e) can be estimated through training
The input for the training process is:Set of seed named entities with the respective
classesQuery log
Probability Estimation
Consider the existence of a training data set with N triples from labeled queries
T = {(ei,ti,ci) | i=1,…,N}
With this training data set, the learning problem can be formalized as:
N
1iiii )c,t,P(emax
Probability Estimation Building the training corpus for full queries would be difficult
and time-consuming when each named entity can belong to several classes
A solution is to collect training data as:
T = {(ei,ti) | i=1,…,N}
and the list of possible classes for each named entity in training
With this training data set, the learning problem can be formalized as:
N
1iii
N
1ii
N
1iii c)|)P(te|P(c)P(emax)t,P(emax
c
Probability Estimation Building the training corpus for full queries would be difficult
and time-consuming when each named entity can belong to several classes
A solution is to collect training data as:
T = {(ei,ti) | i=1,…,N}
and the list of possible classes for each named entity in training
With this training data set, the learning problem can be formalized as:
N
1iii
N
1ii
N
1iii c)|)P(te|P(c)P(emax)t,P(emax
c
Probability Estimation P(t | c) and P(c | e) can be predicted using a Topic
Model There is a relationship between Topic Model and
NERQ notions
Without loss of generality, the authors decided to use a variation of LDA called WS-LDA
Query Document Symbol
Context Word wn
Named Entity Document w
Class Topic zn
WS-LDA Algorithm
Unsupervised learning methods for topic model would not work in NERQ
WS-LDA introduces weak supervision for training by using a set of named entity seeds
It is assumed that a named entity has high probabilities on labeled classes and very low probabilities on unlabeled classes
WS-LDA Algorithm Objective function for each named entity
O(e|y,Θ) = log P(w | Θ) +λ C(y, Θ)
y, binary vector that assigns an entity to the respective classes
Θ = {α,β}, parameters of the Dirichlet distribution and the Multinomial distribution used in the process
λ, coeficient given by the user that indicates the weight of the supervision constraints
C(y, Θ), constraint function
Training Process The training process is divided into two steps:
1. Find queries of the query log contatining the named entity seeds
2. Generate the contexts associated to the named entity seeds in the queries
3. Generate the query training data (ei,ti) to train the WS-LDA topic model
4. Use the topic model to learn P(t|c)5. Scan the query log with the previously generated
contexts to extract new named entities6. Use the topic model to learn P(c|e) for each new
entity7. Estimate P(e) with the frequency of e in the query log
Outline
Basic Concepts Named Entity Recognition in Query Experimental Results
Data SetNERQ by WS-LDAWS-LDA vs BaselinesSupervision in WS-LDA
Conclusions
Data Set
6 billion queries Four semantic classes: “Movie”, “Game”,
“Book” and “Music” 180 seed named entity from Amazon,
Gamespot and Lyrics annotated by four Human Beings120 named entities for training60 named entities for testing
Data Set
After training a WS-LDA model with the 120 seed named entities:432.304 contextsAbout 1.5 million named entities
NERQ by WS-LDA
NERQ conducted on queries from a separate query log with about 12 million queries
140.000 recognition results Evaluation with 400 randomly sampled
queries
NERQ by WS-LDA
Three types of errors:1. Inacurate estimation of P(e)
2. Uncommon contexts that were not learned
3. Queries containing named entities out of the predefined classes
WS-LDA vs baselines
Comparison between WS-LDA and two other approaches:A deterministic approach that learns the contexts
of a class by aggregating all the contexts of named entities of the class
Latent Dirichlet Allocation
WS-LDA vs baselines
Modeling Contexts of classes
WS-LDA vs baselines
Modeling Contexts of classes
WS-LDA vs baselines
Class prediction
WS-LDA vs baselines
Convergence speed
Supervision in WS-LDA
How can λ affect the performace of WS-LDA?
Outline
Basic Concepts Named Entity Recognition in Query Experimental Results Conclusions
Conclusions NERQ is potentially useful in many
search applications This paper is a first apporach to NERQ
and proposed a probabilistic approach to perform this taskWS-LDA is presented as na alternative to
LDA Experimental results indicate that the
proposed approach can accurately perform NERQ