43
Jiafeng Guo, Gu Xu, Xueqi Cheng, Hang Li Presentation by Gonçalo Simões Course: Recuperação de Informação SIGIR 2009

Named EntIty Recognition in Query

  • Upload
    malise

  • View
    66

  • Download
    0

Embed Size (px)

DESCRIPTION

Jiafeng Guo, Gu Xu, Xueqi Cheng, Hang Li. Named EntIty Recognition in Query. SIGIR 2009. Presentation by Gonçalo Simões Course: Recuperação de Informação. Outline. Basic Concepts Named Entity Recognition in Query Conclusions. Outline. Basic Concepts Information Extraction - PowerPoint PPT Presentation

Citation preview

Page 1: Named EntIty Recognition in Query

Jiafeng Guo, Gu Xu, Xueqi Cheng, Hang Li

Presentation by Gonçalo SimõesCourse: Recuperação de Informação

SIGIR 2009

Page 2: Named EntIty Recognition in Query

Outline

Basic Concepts Named Entity Recognition in Query Conclusions

Page 3: Named EntIty Recognition in Query

Outline

Basic ConceptsInformation ExtractionNamed Entity Recognition

Named Entity Recognition in Query Conclusions

Page 4: Named EntIty Recognition in Query

Information Extraction Information Extraction (IE) proposes

techniques to extract relevant information from non-structured or semi-structured textsExtracted information is transformed so that it can

be represented in a fixed format

Page 5: Named EntIty Recognition in Query

Named Entity Recognition Named Entity Recognition (NER) is an IE task

that seeks to locate and classify text segments into predefined classes (e.g., Person, Location, Time expression)

Page 6: Named EntIty Recognition in Query

Named Entity Recognition

CENTER FOR INNOVATION IN LEARNING (CIL)

EDUCATION SEMINAR SERIES

Joe Mertz & Brian Mckenzie

Center for Innovation in Learning, CMU

ANNOUNCEMENT: We are proud to announce that this Friday, February 17, we will have two sessions in our Education Seminar. At 12:30pm, at the Student Center Room 207, Joe Mertz will present "Using a Cognitive Architecture to Design Instructions“. His session ends at 1pm. After a small lunch break, at 14:00, we meet again at Student Center Room 208, where Brian McKenzie will start his presentation. He will present “Information Extraction: how to automatically learn new models”. This session ends arround 15h.

We hope to see you in these sessions

Please direct questions to Pamela Yocca at 268-7675.

Page 7: Named EntIty Recognition in Query

Named Entity Recognition

CENTER FOR INNOVATION IN LEARNING (CIL)

EDUCATION SEMINAR SERIES

Joe Mertz & Brian Mckenzie

Center for Innovation in Learning, CMU

ANNOUNCEMENT: We are proud to announce that this Friday, February 17, we will have two sessions in our Education Seminar. At 12:30pm, at the Student Center Room 207, Joe Mertz will present "Using a Cognitive Architecture to Design Instructions“. His session ends at 1pm. After a small lunch break, at 14:00, we meet again at Student Center Room 208, where Brian McKenzie will start his presentation. He will present “Information Extraction: how to automatically learn new models”. This session ends arround 15h.

We hope to see you in these sessions

Please direct questions to Pamela Yocca at 268-7675.

Classes/entities:PersonLocationTemporal Expression

Page 8: Named EntIty Recognition in Query

NER in IR NER has been used for some IR tasks Example: NER + Coreference resolution

When Mozart first arrived in Vienna, he’d get up at 6am, settle into composing at his desk by 7, working until 9 or 10 after which he’d make the round of his pupils, taking a break for lunch at 1pm. If there’s no concert, he might get back to work by 5 or 6pm, working until 9pm. He might go out and socialize for a few hours and then come back to work another hour or two before going to bed around 1am. Amadeus preferred getting seven hours of sleep but often made do with five or six ...

Page 9: Named EntIty Recognition in Query

NER in IR NER has been used for some IR tasks Example: NER + Coreference resolution

Instead of using a bag of words explore the fact that the highlighted entities correspond to the same real world entity

When Mozart first arrived in Vienna, he’d get up at 6am, settle into composing at his desk by 7, working until 9 or 10 after which he’d make the round of his pupils, taking a break for lunch at 1pm. If there’s no concert, he might get back to work by 5 or 6pm, working until 9pm. He might go out and socialize for a few hours and then come back to work another hour or two before going to bed around 1am. Amadeus preferred getting seven hours of sleep but often made do with five or six ...

Page 10: Named EntIty Recognition in Query

Outline Basic Concepts Named Entity Recognition in Query

IntroductionNERQ ProblemNotationProbabilistic ApproachProbability EstimationWS-LDA AlgorithmTraining Process

Experimental Results Conclusions

Page 11: Named EntIty Recognition in Query

Introduction

71% of the queries in search engines contain named entities

These named entities may be useful to process the query

Page 12: Named EntIty Recognition in Query

Introduction

Motivating ExamplesConsider the query “harry potter walkthrough”

○ The context of the query strongly indicates that the named entity “harry potter” is a “Game”

Consider the query “harry potter cast”○ The context of the query strongly indicates that the

named entity “harry potter” is a “Movie”

Page 13: Named EntIty Recognition in Query

Introduction

Identifying named entities can be very useful. Consider the following examples related to the query “harry potter walkthrough”:Ranking: Documents about videogames should be

pushed up in the rankings (Altavista search)Suggestion: Relevant suggestions can be

generated like “harry potter cheats” or “lord of the rings walkthrough”

Page 14: Named EntIty Recognition in Query

NERQ Problem

Named Entity Recognition in Query (NERQ) is a task that tries to detect the named entities within a query and categorize it into predefined classes

The work that was previously performed in this area was focused on query log mining and not on query processing

Page 15: Named EntIty Recognition in Query

NERQ Problem

NER vs NERQThe techniques used in NER are adapted for

Natural Language textsThey do not have good results for queries

because:○ queries only have 2-3 words on average○ queries are not well formed (e.g., all letters all

typically lower case)

Page 16: Named EntIty Recognition in Query

Notation

A single-named-entity query q can be represented as a triple (e,t,c)e denotes a named entityt denotes the context

○ A context is expressed as α # β where α and β denotes the the left and right context respectively and # denotes a placehoder for the named entity

c denotes the class of e

Example“harry potter walkthrough” is associated to the

triple (“harry potter”, “# walkthrough”, “Game”)

Page 17: Named EntIty Recognition in Query

Probabilistic Approach

The goal of NERQ is to detect the named entity e in query q, and assign the most likely class c, to e

Goal: Find (e,t,c)* such that:

(e,t,c)* = argmax(e,t,c) P(q,e,t,c)

Page 18: Named EntIty Recognition in Query

Probabilistic Approach

The goal of NERQ is to detect the named entity e in query q, and assign the most likely class c, to e

Goal: Find (e,t,c)* such that:

(e,t,c)* = argmax(e,t,c) P(q | e,t,c) P(e,t,c)

Page 19: Named EntIty Recognition in Query

Probabilistic Approach

The goal of NERQ is to detect the named entity e in query q, and assign the most likely class c, to e

Goal: Find (e,t,c)* such that:

(e,t,c)* = argmax(e,t,c) ϵ G(q) P(e,t,c)

Page 20: Named EntIty Recognition in Query

Probabilistic Approach

For each triple (e,t,c ) ϵ G(q), we only need to compute P(e,t,c)

P(e,t,c) = P(t,c | e) P(e)

Page 21: Named EntIty Recognition in Query

Probabilistic Approach

For each triple (e,t,c ) ϵ G(q), we only need to compute P(e,t,c)

P(e,t,c) = P(t | c,e) P(c | e) P(e)

Page 22: Named EntIty Recognition in Query

Probabilistic Approach

For each triple (e,t,c ) ϵ G(q), we only need to compute P(e,t,c)

P(e,t,c) = P(t | c) P(c | e) P(e)

How to estimate these probabilities?

Page 23: Named EntIty Recognition in Query

Probability Estimation

P(t | c), P(c | e) and P(e) can be estimated through training

The input for the training process is:Set of seed named entities with the respective

classesQuery log

Page 24: Named EntIty Recognition in Query

Probability Estimation

Consider the existence of a training data set with N triples from labeled queries

T = {(ei,ti,ci) | i=1,…,N}

With this training data set, the learning problem can be formalized as:

N

1iiii )c,t,P(emax

Page 25: Named EntIty Recognition in Query

Probability Estimation Building the training corpus for full queries would be difficult

and time-consuming when each named entity can belong to several classes

A solution is to collect training data as:

T = {(ei,ti) | i=1,…,N}

and the list of possible classes for each named entity in training

With this training data set, the learning problem can be formalized as:

N

1iii

N

1ii

N

1iii c)|)P(te|P(c)P(emax)t,P(emax

c

Page 26: Named EntIty Recognition in Query

Probability Estimation Building the training corpus for full queries would be difficult

and time-consuming when each named entity can belong to several classes

A solution is to collect training data as:

T = {(ei,ti) | i=1,…,N}

and the list of possible classes for each named entity in training

With this training data set, the learning problem can be formalized as:

N

1iii

N

1ii

N

1iii c)|)P(te|P(c)P(emax)t,P(emax

c

Page 27: Named EntIty Recognition in Query

Probability Estimation P(t | c) and P(c | e) can be predicted using a Topic

Model There is a relationship between Topic Model and

NERQ notions

Without loss of generality, the authors decided to use a variation of LDA called WS-LDA

Query Document Symbol

Context Word wn

Named Entity Document w

Class Topic zn

Page 28: Named EntIty Recognition in Query

WS-LDA Algorithm

Unsupervised learning methods for topic model would not work in NERQ

WS-LDA introduces weak supervision for training by using a set of named entity seeds

It is assumed that a named entity has high probabilities on labeled classes and very low probabilities on unlabeled classes

Page 29: Named EntIty Recognition in Query

WS-LDA Algorithm Objective function for each named entity

O(e|y,Θ) = log P(w | Θ) +λ C(y, Θ)

y, binary vector that assigns an entity to the respective classes

Θ = {α,β}, parameters of the Dirichlet distribution and the Multinomial distribution used in the process

λ, coeficient given by the user that indicates the weight of the supervision constraints

C(y, Θ), constraint function

Page 30: Named EntIty Recognition in Query

Training Process The training process is divided into two steps:

1. Find queries of the query log contatining the named entity seeds

2. Generate the contexts associated to the named entity seeds in the queries

3. Generate the query training data (ei,ti) to train the WS-LDA topic model

4. Use the topic model to learn P(t|c)5. Scan the query log with the previously generated

contexts to extract new named entities6. Use the topic model to learn P(c|e) for each new

entity7. Estimate P(e) with the frequency of e in the query log

Page 31: Named EntIty Recognition in Query

Outline

Basic Concepts Named Entity Recognition in Query Experimental Results

Data SetNERQ by WS-LDAWS-LDA vs BaselinesSupervision in WS-LDA

Conclusions

Page 32: Named EntIty Recognition in Query

Data Set

6 billion queries Four semantic classes: “Movie”, “Game”,

“Book” and “Music” 180 seed named entity from Amazon,

Gamespot and Lyrics annotated by four Human Beings120 named entities for training60 named entities for testing

Page 33: Named EntIty Recognition in Query

Data Set

After training a WS-LDA model with the 120 seed named entities:432.304 contextsAbout 1.5 million named entities

Page 34: Named EntIty Recognition in Query

NERQ by WS-LDA

NERQ conducted on queries from a separate query log with about 12 million queries

140.000 recognition results Evaluation with 400 randomly sampled

queries

Page 35: Named EntIty Recognition in Query

NERQ by WS-LDA

Three types of errors:1. Inacurate estimation of P(e)

2. Uncommon contexts that were not learned

3. Queries containing named entities out of the predefined classes

Page 36: Named EntIty Recognition in Query

WS-LDA vs baselines

Comparison between WS-LDA and two other approaches:A deterministic approach that learns the contexts

of a class by aggregating all the contexts of named entities of the class

Latent Dirichlet Allocation

Page 37: Named EntIty Recognition in Query

WS-LDA vs baselines

Modeling Contexts of classes

Page 38: Named EntIty Recognition in Query

WS-LDA vs baselines

Modeling Contexts of classes

Page 39: Named EntIty Recognition in Query

WS-LDA vs baselines

Class prediction

Page 40: Named EntIty Recognition in Query

WS-LDA vs baselines

Convergence speed

Page 41: Named EntIty Recognition in Query

Supervision in WS-LDA

How can λ affect the performace of WS-LDA?

Page 42: Named EntIty Recognition in Query

Outline

Basic Concepts Named Entity Recognition in Query Experimental Results Conclusions

Page 43: Named EntIty Recognition in Query

Conclusions NERQ is potentially useful in many

search applications This paper is a first apporach to NERQ

and proposed a probabilistic approach to perform this taskWS-LDA is presented as na alternative to

LDA Experimental results indicate that the

proposed approach can accurately perform NERQ