Learning to Find Answers to Questions

Learning to Find Answers to Questions

Eugene Agichtein Steve Lawrence

Columbia University NEC Research

Luis Gravano Columbia University

Motivation

Millions of natural language questions are submitted to web search engines daily.

An increasing number of search services specifically target natural language questions AskJeeves

Databases of precompiled information, metasearching, other proprietary methods

AskMe.com and similar Facilitate interaction with human experts

Problem Statement Problem/Goal: Find documents

containing answers to questions within a collection of text documents

Collection: Pages on the web as indexed by a web search engine

General Method: Transform questions into a set of new queries that maximize the probability of returning answers to the questions, using existing IR systems or Search Engines.

Example Question: “What is a hard disk?” Current search engines might ignore the stopwords,

processing a query {hard, disk} and may return homepages of hard drive manufacturers.

A good answer might include a definition or an explanation of what a hard disk is.

Such answers are likely to contain phrases such as “… is a …”, “… is used to …”, etc…

Submitting queries such as { {hard, disk} NEAR “is a” }, {{hard, disk} NEAR “is used to” }, etc., may bias the search engine to return answers to the original question.

Method1. Automatically learn generally applicable

question-answer transformations based on the question-answer pairs in the training collection.

2. Automatically probe each IR system (e.g., web search engine) to discover which transformations work better than others for each IR system.

3. At run-time, transform the question using the best transformations from Step 2 and submit to each IR system.

Tra

inin

g

Background/Related Work Decades of NLP research on Question-

Answering Manual methods-linguistics, parsing, heuristics… Learning-based methods

General approach to Q-A: Find candidate documents in database (our focus) Extract answer from documents (traditional focus)

Text Retrieval Evaluation Conference (TREC) Question-Answering Track Retrieve a short (50 or 250 byte) answer to a set of

test questions

Related Work (cont.) Most systems focus on extracting answers

from documents Most use variants of standard vector space

or probabilistic retrieval models to retrieve documents, followed by heuristics and/or linguistics-based methods to extract best passages

Evaluation has focused on questions with precise answers

Abney et al., Cardie et al., …

Related Work (cont.) Berger et al. – independently considered statistical

models for finding co-occurring terms in question/answer pairs to facilitate answer retrieval (SIGIR 2000).

Lawrence, Giles (IEEE IC 1998) Queries transformed into specific ways of

expressing an answer, e.g. “What is <X>?” is transformed into phrases such as “<X> is” and “<X> refers to”.

Transformations manually coded, are same for all search engines

Glover et al. (SAINT 2001) – Category-specific query modification

Our Contributions: The Tritus System

Introduced a method for automatically learning multiple query transformations optimized for a specific information retrieval system, with the goal of maximizing the probability of retrieving documents containing the answers.

Developed a prototype implementation of a meta-search engine Tritus automatically optimized for real-world web search engines.

Performed a thorough evaluation of the Tritus system, comparing it to state-of-the-art search engines.

TrainingTritus Training Algorithm

1. Generate question phrases2. Generate candidate transforms3. Evaluate candidate transforms on target IR system(s)

Data 30,000 question-answer pairs from 270 Frequently Asked

Question (FAQ) files obtained from the FAQFinder project

Question Answer

What is a Lisp Machine (LISPM)?

A Lisp machine (or LISPM) is a computer which has been optimized to run lisp efficiently and….

What is a near-field monitor?

A near field monitor is one that is designed to be…

Training Step 1: Generating Question Phrases Generate phrases that identify different

categories of questions For example, the phrase "what is a" in

the question "what is a virtual private network?" tells us the goal of the question

Find commonly occurring n-grams at the beginning of questions

Training Step 1 (cont.) Limitations – e.g. “How do I find out

what a sonic boom is?” Advantages of this approach

Very inexpensive to compute (especially important at run-time)

Domain and language independent, and can be extended in a relatively straightforward fashion for other European languages.

Training Step 2: Generating Candidate Transforms Generate candidate terms and phrases

for each of the question phrases from the previous stage

For each question in the training data matching the current question phrase we rank n-grams in the corresponding answers according to co-occurrence frequency.

To reduce domain bias, candidate transforms with nouns were discarded using a part-of-speech tagger (Brill's)

e.g., the term "telephone" is intuitively not very useful for a question "what is a rainbow?"

Sample candidate transforms when nouns are not excluded for the question phrase "what is a"

The term

Component

Collection of

A computer

Telephone

Stands for

Unit

Ans

Training Step 2: Generating Candidate Transforms (cont.)

We take the top topKphrases n-grams with the highest frequency counts and apply term weighting

Weight calculated as in Okapi BM25 (uses Robertson/Sparck Jones weights)

where r = number of relevant documents containing t, R = number of relevant documents, n = number of documents containing t, N = number of documents in collection

Estimate of selectivity/discrimination of a candidate transform with respect to a specific question type

Weighting extended to phrases

)5.0/()5.0(

)5.0/()5.0(log

rRnNrn

rRrwt

Training Step 2: Sample Candidate Transforms

Final Term Selection Weight twt = qtft * wt

where qtft = frequency of t in the relevant question type, wt = term selectivity/discrimination weight, twt = resulting candidate transform weight

Question type qt (question phrase)

Candidate transform t

qtft wt twt

"what is a"

"refers to" 30 2.71 81.3

"refers" 30 2.67 80.1

"meets" 12 3.21 38.5

"driven" 14 2.72 38.1

"named after" 10 3.63 36.3

"often used" 12 3.00 36.0

"to describe" 13 2.70 35.1

Training Step 3: Evaluate Candidate Transforms on a Target IR System

Search engines have different ranking methods and treat different queries in different ways (phrases, stop words, etc.)

Candidate transforms grouped into buckets according to length Phrases of different length may be

treated differently Top n in each bucket evaluated on

target IR system

Training Step 3 (cont.) For each question phrase and search engine:

For up to numExamples QA pairs matching question, sorted by answer length, test each candidate transform

e.g. for the QP "what is a", candidate transform "refers to", and question "what is a VPN", the rewritten query {VPN and "refers to" } is sent to each SE

Similarity of retrieved documents to known answer computed

Final weight for transforms is computed as average similarity between known answers and documents retrieved, across all matching questions evaluated

Query syntax transformed for each search engine, transforms encoded as phrases, "NEAR" operator used for AltaVista [Google reports including term proximity in ranking]

Computing Similarity of Known Answers and Retrieved Documents

Consider subdocuments of length subdocLen within the retrieved documents, overlapping by subdocLen / 2

Assumption that answers are localized Find maximum similarity of any subdocument with the known

answer docScore(D) = max (BM25phrase (Answer, Di))

where t = term, Q = query, k1 = 1.2, k3 = 1000, K = k1((1-b)+b.dl/avdl), b = 0.5, dl is the document length in tokens, avdl is the average document length in tokens, wt is the term relevance weight, tft is the frequency of term t in the document, qtft is the term frequency within the question phrase (query topic in original BM25), and terms include phrases

))((

)1()1(25

3

31

tt

tt

Qttphrase qtfktfK

qtfktfkwBM

Sample Transforms

AltaVista

Transform t TWt

"is usually" 377.3

"refers to" 373.2

"usually" 371.6

"refers" 370.1

"is used" 360.1

Google

Transform t TWt

"is usually" 280.7

"usually" 275.7

"called" 256.6

"sometimes" 253.5

"is one" 253.2

Evaluating Queries at Runtime Search for matching question phrases, with

preference for longer (more specific) phrases

Retrieve corresponding transforms and send transformed queries to search engine

Compute similarity of returned documents with respect to transformed query If document retrieved by multiple transforms,

use maximum similarity

Sample Query

Experimental Setup/Evaluation Real questions from the query log of the Excite search engine

from 12/20/99 Evaluated the following four question types: Where, What, How,

and Who These are the four most common types of questions and account

for over 90% of natural language questions to Excite Random sample of 50 questions extracted for each question type Potentially offensive queries removed Checked that queries were not in the training set None of the evaluation queries were used in any part of the training

process Results from each search engine retrieved in advance Results shown to evaluators in random order

Evaluators do not know which engine produced the results 89 questions evaluated

Sample Questions Evaluated Who was the first Japanese player in baseball? Who was the original singer of fly me to the moon? Where is the fastest passenger train located? How do I replace a Chevy engine in my pickup? How do I keep my refrigerator smelling good? How do I get a day off school? How do I improve my vocal range? What are ways people can be motivated? What is a sonic boom?

What are the advantages of being unicellular?

Systems Evaluated

AskJeeves (AJ) – Search engine specializing in answering natural language questions Returns different types of responses - we

parse each different type Google (GO) – The Google search engine as is Tritus optimized for Google (TR-GO ) AltaVista (AV) – The AltaVista search engine as

is Tritus optimized for AltaVista (TR-AV)

Best Performing System

Percentage of questions where a system returns the most relevant documents at document cutoff K. All engines considered best for ties. Results for lowest performing systems not statistically

significant (very small number of queries where they perform best)

Average Precision

Average precision at document cutoff K

Precision by Question Type

Results indicate advantages of Tritus, and best underlying search engine to use vary, but amount of data limits strong conclusions

Precision at K for What (a), How (b), Where (c) and Who (d) type questions.

Document Overlap

0

20

40

60

80

100

10 30 50 70 90 110 130 150

N

Docu

men

ts C

onta

ined

(%)

TR-GO

TR-AV

0

20

40

60

80

100

10 30 50 70 90 110 130 150

N

Docu

men

ts C

onta

ined

(%)

TR-GO

TR-AV

0

10

20

30

40

50

60

70

80

90

100

10 20 30 40 50 60 70 80 90 100 110 120 130 140 150

N

Docu

men

ts C

onta

ined

(%)

TR-GO

TR-AV

(a) (b) (c)Overlap of documents retrieved by transformed queries with

the original system: top 150 (a), top 10 (b) and relevant of the top 10 (c).

Future Research Combining multiple transformations into a

single query Using multiple search engines simultaneously Identifying and routing the questions to the

best search engines for different question types Identifying phrase transforms containing

content words from the query Dynamic query submission using results of

initial transformations to guide subsequent transformations.

Summary Introduced a method for learning query transformations

that improves the ability to retrieve documents containing answers to questions from an IR system

In our approach, we: Automatically classify questions into different question

types Automatically generate candidate transforms from a

training set of question/answer pairs Automatically evaluate transforms on the target IR

system(s) Implemented and evaluated for web search engines Blind evaluation on a set of real queries shows the

method significantly outperforms the underlying search engines for common question types.

Additional Information

http://tritus.cs.columbia.edu/

Contact the authors: http://www.cs.columbia.edu/~eugene/ http://www.neci.nj.nec.com/homepages/lawrence/ http://www.cs.columbia.edu/~gravano/

Assumption For some common types of natural

language questions (e.g., “What is”, “Who is”, etc…) there exist common ways of expressing answers to the question.