Named EntIty Recognition in Query

Jiafeng Guo, Gu Xu, Xueqi Cheng, Hang Li

Presentation by Gonçalo SimõesCourse: Recuperação de Informação

SIGIR 2009

Outline

Basic Concepts Named Entity Recognition in Query Conclusions

Outline

Basic ConceptsInformation ExtractionNamed Entity Recognition

Named Entity Recognition in Query Conclusions

Information Extraction Information Extraction (IE) proposes

techniques to extract relevant information from non-structured or semi-structured textsExtracted information is transformed so that it can

be represented in a fixed format

Named Entity Recognition Named Entity Recognition (NER) is an IE task

that seeks to locate and classify text segments into predefined classes (e.g., Person, Location, Time expression)

Named Entity Recognition

CENTER FOR INNOVATION IN LEARNING (CIL)

EDUCATION SEMINAR SERIES

Joe Mertz & Brian Mckenzie

Center for Innovation in Learning, CMU

ANNOUNCEMENT: We are proud to announce that this Friday, February 17, we will have two sessions in our Education Seminar. At 12:30pm, at the Student Center Room 207, Joe Mertz will present "Using a Cognitive Architecture to Design Instructions“. His session ends at 1pm. After a small lunch break, at 14:00, we meet again at Student Center Room 208, where Brian McKenzie will start his presentation. He will present “Information Extraction: how to automatically learn new models”. This session ends arround 15h.

We hope to see you in these sessions

Please direct questions to Pamela Yocca at 268-7675.

Named Entity Recognition

CENTER FOR INNOVATION IN LEARNING (CIL)

EDUCATION SEMINAR SERIES

Joe Mertz & Brian Mckenzie

Center for Innovation in Learning, CMU

ANNOUNCEMENT: We are proud to announce that this Friday, February 17, we will have two sessions in our Education Seminar. At 12:30pm, at the Student Center Room 207, Joe Mertz will present "Using a Cognitive Architecture to Design Instructions“. His session ends at 1pm. After a small lunch break, at 14:00, we meet again at Student Center Room 208, where Brian McKenzie will start his presentation. He will present “Information Extraction: how to automatically learn new models”. This session ends arround 15h.

We hope to see you in these sessions

Please direct questions to Pamela Yocca at 268-7675.

Classes/entities:PersonLocationTemporal Expression

NER in IR NER has been used for some IR tasks Example: NER + Coreference resolution

When Mozart first arrived in Vienna, he’d get up at 6am, settle into composing at his desk by 7, working until 9 or 10 after which he’d make the round of his pupils, taking a break for lunch at 1pm. If there’s no concert, he might get back to work by 5 or 6pm, working until 9pm. He might go out and socialize for a few hours and then come back to work another hour or two before going to bed around 1am. Amadeus preferred getting seven hours of sleep but often made do with five or six ...

NER in IR NER has been used for some IR tasks Example: NER + Coreference resolution

Instead of using a bag of words explore the fact that the highlighted entities correspond to the same real world entity

When Mozart first arrived in Vienna, he’d get up at 6am, settle into composing at his desk by 7, working until 9 or 10 after which he’d make the round of his pupils, taking a break for lunch at 1pm. If there’s no concert, he might get back to work by 5 or 6pm, working until 9pm. He might go out and socialize for a few hours and then come back to work another hour or two before going to bed around 1am. Amadeus preferred getting seven hours of sleep but often made do with five or six ...

Outline Basic Concepts Named Entity Recognition in Query

IntroductionNERQ ProblemNotationProbabilistic ApproachProbability EstimationWS-LDA AlgorithmTraining Process

Experimental Results Conclusions

Introduction

71% of the queries in search engines contain named entities

These named entities may be useful to process the query

Introduction

Motivating ExamplesConsider the query “harry potter walkthrough”

○ The context of the query strongly indicates that the named entity “harry potter” is a “Game”

Consider the query “harry potter cast”○ The context of the query strongly indicates that the

named entity “harry potter” is a “Movie”

Introduction

Identifying named entities can be very useful. Consider the following examples related to the query “harry potter walkthrough”:Ranking: Documents about videogames should be

pushed up in the rankings (Altavista search)Suggestion: Relevant suggestions can be

generated like “harry potter cheats” or “lord of the rings walkthrough”

NERQ Problem

Named Entity Recognition in Query (NERQ) is a task that tries to detect the named entities within a query and categorize it into predefined classes

The work that was previously performed in this area was focused on query log mining and not on query processing

NERQ Problem

NER vs NERQThe techniques used in NER are adapted for

Natural Language textsThey do not have good results for queries

because:○ queries only have 2-3 words on average○ queries are not well formed (e.g., all letters all

typically lower case)

Notation

A single-named-entity query q can be represented as a triple (e,t,c)e denotes a named entityt denotes the context

○ A context is expressed as α # β where α and β denotes the the left and right context respectively and # denotes a placehoder for the named entity

c denotes the class of e

Example“harry potter walkthrough” is associated to the

triple (“harry potter”, “# walkthrough”, “Game”)

Probabilistic Approach

The goal of NERQ is to detect the named entity e in query q, and assign the most likely class c, to e

Goal: Find (e,t,c)* such that:

(e,t,c)* = argmax(e,t,c) P(q,e,t,c)




(e,t,c)* = argmax(e,t,c) P(q | e,t,c) P(e,t,c)




(e,t,c)* = argmax(e,t,c) ϵ G(q) P(e,t,c)


For each triple (e,t,c ) ϵ G(q), we only need to compute P(e,t,c)

P(e,t,c) = P(t,c | e) P(e)



P(e,t,c) = P(t | c,e) P(c | e) P(e)



P(e,t,c) = P(t | c) P(c | e) P(e)

How to estimate these probabilities?

Probability Estimation

P(t | c), P(c | e) and P(e) can be estimated through training

The input for the training process is:Set of seed named entities with the respective

classesQuery log

Probability Estimation

Consider the existence of a training data set with N triples from labeled queries

T = {(ei,ti,ci) | i=1,…,N}

With this training data set, the learning problem can be formalized as:

N

1iiii )c,t,P(emax

Probability Estimation Building the training corpus for full queries would be difficult

and time-consuming when each named entity can belong to several classes

A solution is to collect training data as:

T = {(ei,ti) | i=1,…,N}

and the list of possible classes for each named entity in training


N

1iii

N

1ii

N

1iii c)|)P(te|P(c)P(emax)t,P(emax

c

Probability Estimation Building the training corpus for full queries would be difficult

and time-consuming when each named entity can belong to several classes

A solution is to collect training data as:

T = {(ei,ti) | i=1,…,N}

and the list of possible classes for each named entity in training


N

1iii

N

1ii

N

1iii c)|)P(te|P(c)P(emax)t,P(emax

c

Probability Estimation P(t | c) and P(c | e) can be predicted using a Topic

Model There is a relationship between Topic Model and

NERQ notions

Without loss of generality, the authors decided to use a variation of LDA called WS-LDA

Query Document Symbol

Context Word wn

Named Entity Document w

Class Topic zn

WS-LDA Algorithm

Unsupervised learning methods for topic model would not work in NERQ

WS-LDA introduces weak supervision for training by using a set of named entity seeds

It is assumed that a named entity has high probabilities on labeled classes and very low probabilities on unlabeled classes

WS-LDA Algorithm Objective function for each named entity

O(e|y,Θ) = log P(w | Θ) +λ C(y, Θ)

y, binary vector that assigns an entity to the respective classes

Θ = {α,β}, parameters of the Dirichlet distribution and the Multinomial distribution used in the process

λ, coeficient given by the user that indicates the weight of the supervision constraints

C(y, Θ), constraint function

Training Process The training process is divided into two steps:

1. Find queries of the query log contatining the named entity seeds

2. Generate the contexts associated to the named entity seeds in the queries

3. Generate the query training data (ei,ti) to train the WS-LDA topic model

4. Use the topic model to learn P(t|c)5. Scan the query log with the previously generated

contexts to extract new named entities6. Use the topic model to learn P(c|e) for each new

entity7. Estimate P(e) with the frequency of e in the query log

Outline

Basic Concepts Named Entity Recognition in Query Experimental Results

Data SetNERQ by WS-LDAWS-LDA vs BaselinesSupervision in WS-LDA

Conclusions

Data Set

6 billion queries Four semantic classes: “Movie”, “Game”,

“Book” and “Music” 180 seed named entity from Amazon,

Gamespot and Lyrics annotated by four Human Beings120 named entities for training60 named entities for testing

Data Set

After training a WS-LDA model with the 120 seed named entities:432.304 contextsAbout 1.5 million named entities

NERQ by WS-LDA

NERQ conducted on queries from a separate query log with about 12 million queries

140.000 recognition results Evaluation with 400 randomly sampled

queries

NERQ by WS-LDA

Three types of errors:1. Inacurate estimation of P(e)

2. Uncommon contexts that were not learned

3. Queries containing named entities out of the predefined classes

WS-LDA vs baselines

Comparison between WS-LDA and two other approaches:A deterministic approach that learns the contexts

of a class by aggregating all the contexts of named entities of the class

Latent Dirichlet Allocation

WS-LDA vs baselines

Modeling Contexts of classes

WS-LDA vs baselines

Modeling Contexts of classes

WS-LDA vs baselines

Class prediction

WS-LDA vs baselines

Convergence speed

Supervision in WS-LDA

How can λ affect the performace of WS-LDA?

Outline

Basic Concepts Named Entity Recognition in Query Experimental Results Conclusions

Conclusions NERQ is potentially useful in many

search applications This paper is a first apporach to NERQ

and proposed a probabilistic approach to perform this taskWS-LDA is presented as na alternative to

LDA Experimental results indicate that the

proposed approach can accurately perform NERQ