12
Finding needles in the haystack: Search and candidate generation J. Chu-Carroll J. Fan B. K. Boguraev D. Carmel D. Sheinwald C. Welty A key phase in the DeepQA architecture is Hypothesis Generation, in which candidate system responses are generated for downstream scoring and ranking. In the IBM Watsoni system, these hypotheses are potential answers to Jeopardy!i questions and are generated by two components: search and candidate generation. The search component retrieves content relevant to a given question from Watson’s knowledge resources. The candidate generation component identifies potential answers to the question from the retrieved content. In this paper, we present strategies developed to use characteristics of Watson’s different knowledge sources and to formulate effective search queries against those sources. We further discuss a suite of candidate generation strategies that use various kinds of metadata, such as document titles or anchor texts in hyperlinked documents. We demonstrate that a combination of these strategies brings the correct answer into the candidate answer pool for 87.17% of all the questions in a blind test set, facilitating high end-to-end question-answering performance. Introduction A key component in the IBM Watson* system is Hypothesis Generation, which is the process of producing possible answers to a given question. These candidate answers are scored by the Evidence Gathering and Hypothesis Scoring components [1–4] and are ranked by the Final Merging and Ranking component [5] to produce the final ranked list of answers. Since the outcome of Hypothesis Generation represents all possible candidates that the system will consider, it is crucial that a wide net be cast at this stage. It is also important, however, that the system includes the correct answer among its candidates without overwhelming the downstream answer scorers with noise. Too many wrong candidates reduce system efficiency and can potentially hamper the system’s ability to identify the correct answer from the overly large pool of candidates. In the IBM Watson system, the Hypothesis Generation phase consists of two components: search and candidate generation. In its search component, Watson adopts a multipronged approach to retrieve relevant content from its diverse knowledge resources. The search results may be documents or passages from unstructured sources or arguments that satisfy partially instantiated predicates from structured sources. Watson’s search strategies extend the passage search approach adopted by most existing question-answering (QA) systems in two ways. First, Watson employs specific search strategies to exploit the relationship between titles and content in title-oriented documents (e.g., encyclopedia articles) to improve search recall. Second, Watson uses structured resources through queries based on syntactic and semantic relations extracted from the question. The candidate generation component identifies potential answers to a question from the retrieved unstructured content. Most existing QA systems adopt a type-based approach to candidate generation with respect to a predefined type ontology. However, because of the broad range of lexical answer types (LATs) [6] observed in the Jeopardy!** domain, Watson relies on a type-independent approach to its primary candidate generation strategiesVthose of producing candidate answers from unstructured content. It uses the knowledge inherent in human-generated text and associated metadata, as well as syntactic and lexical cues in the search results, to identify salient concepts from text and hypothesize them as candidate answers. ÓCopyright 2012 by International Business Machines Corporation. Copying in printed form for private use is permitted without payment of royalty provided that (1) each reproduction is done without alteration and (2) the Journal reference and IBM copyright notice are included on the first page. The title and abstract, but no other portions, of this paper may be copied by any means or distributed royalty free without further permission by computer-based and other information-service systems. Permission to republish any other portion of this paper must be obtained from the Editor. Digital Object Identifier: 10.1147/JRD.2012.2186682 J. CHU-CARROLL ET AL. 6:1 IBM J. RES. & DEV. VOL. 56 NO. 3/4 PAPER 6 MAY/JULY 2012 0018-8646/12/$5.00 B 2012 IBM

Finding needles in the haystack: Search and candidate generation … · 2013-02-11 · adopts a document-oriented search strategy to find documents that, as a whole, best match the

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Finding needles in the haystack: Search and candidate generation … · 2013-02-11 · adopts a document-oriented search strategy to find documents that, as a whole, best match the

Finding needles in thehaystack: Search andcandidate generation

J. Chu-CarrollJ. Fan

B. K. BoguraevD. Carmel

D. SheinwaldC. Welty

A key phase in the DeepQA architecture is Hypothesis Generation, inwhich candidate system responses are generated for downstreamscoring and ranking. In the IBM Watsoni system, these hypothesesare potential answers to Jeopardy!i questions and are generated bytwo components: search and candidate generation. The searchcomponent retrieves content relevant to a given question fromWatson’s knowledge resources. The candidate generation componentidentifies potential answers to the question from the retrievedcontent. In this paper, we present strategies developed to usecharacteristics of Watson’s different knowledge sources and toformulate effective search queries against those sources. We furtherdiscuss a suite of candidate generation strategies that use variouskinds of metadata, such as document titles or anchor texts inhyperlinked documents. We demonstrate that a combination of thesestrategies brings the correct answer into the candidate answerpool for 87.17% of all the questions in a blind test set, facilitatinghigh end-to-end question-answering performance.

IntroductionA key component in the IBM Watson* system is HypothesisGeneration, which is the process of producing possibleanswers to a given question. These candidate answers arescored by the Evidence Gathering and Hypothesis Scoringcomponents [1–4] and are ranked by the Final Merging andRanking component [5] to produce the final ranked list ofanswers. Since the outcome of Hypothesis Generationrepresents all possible candidates that the system willconsider, it is crucial that a wide net be cast at this stage. It isalso important, however, that the system includes the correctanswer among its candidates without overwhelming thedownstream answer scorers with noise. Too many wrongcandidates reduce system efficiency and can potentiallyhamper the system’s ability to identify the correct answerfrom the overly large pool of candidates.In the IBM Watson system, the Hypothesis Generation

phase consists of two components: search and candidategeneration. In its search component, Watson adopts amultipronged approach to retrieve relevant content from itsdiverse knowledge resources. The search results may be

documents or passages from unstructured sources orarguments that satisfy partially instantiated predicatesfrom structured sources. Watson’s search strategies extendthe passage search approach adopted by most existingquestion-answering (QA) systems in two ways. First, Watsonemploys specific search strategies to exploit the relationshipbetween titles and content in title-oriented documents(e.g., encyclopedia articles) to improve search recall. Second,Watson uses structured resources through queriesbased on syntactic and semantic relations extracted fromthe question.The candidate generation component identifies potential

answers to a question from the retrieved unstructured content.Most existing QA systems adopt a type-based approach tocandidate generation with respect to a predefined typeontology. However, because of the broad range of lexicalanswer types (LATs) [6] observed in the Jeopardy!**domain, Watson relies on a type-independent approach to itsprimary candidate generation strategiesVthose of producingcandidate answers from unstructured content. It uses theknowledge inherent in human-generated text and associatedmetadata, as well as syntactic and lexical cues in thesearch results, to identify salient concepts from text andhypothesize them as candidate answers.

�Copyright 2012 by International Business Machines Corporation. Copying in printed form for private use is permitted without payment of royalty provided that (1) each reproduction is done withoutalteration and (2) the Journal reference and IBM copyright notice are included on the first page. The title and abstract, but no other portions, of this paper may be copied by any means or distributed

royalty free without further permission by computer-based and other information-service systems. Permission to republish any other portion of this paper must be obtained from the Editor.

Digital Object Identifier: 10.1147/JRD.2012.2186682

J. CHU-CARROLL ET AL. 6 : 1IBM J. RES. & DEV. VOL. 56 NO. 3/4 PAPER 6 MAY/JULY 2012

0018-8646/12/$5.00 B 2012 IBM

Page 2: Finding needles in the haystack: Search and candidate generation … · 2013-02-11 · adopts a document-oriented search strategy to find documents that, as a whole, best match the

Note that the strategies described in this paper focus onquestions whose answers can plausibly be found in Watson’sstandard knowledge sources. These questions constitutethe vast majority of Jeopardy! questions and typically focuson entities and facts that are true about them in the world.Questions that require specialized techniques for producingcandidate answers are described in [7]. Examples of thesespecial question types are those in the BRHYME TIME[category, whose answers are typically made up of phrasesof words that rhyme with each other. Excluding these specialquestion types, our combined search and candidategeneration strategies placed the correct answer in thecandidate answer list for 87.17% of the questions in a3,344-question unseen test set, facilitating high end-to-endQA performance.

Question analysis overviewIn this section, we briefly discuss Watson’s questionanalysis output leveraged by the search and candidategeneration components. The output encapsulates syntacticand semantic analysis of the question, as described in[4, 6, 8], including dependency parse, semantic relations,LAT, and focus.For parsing, we employ ESG (English Slot Grammar),

a comprehensive deep Slot Grammar parser [8]. Each nodein its dependency parse tree contains 1) a headword and itsassociated morpho-lexical, syntactic, and semantic featuresand 2) a list of Bchildren[ that are generally modifiers of thenode, along with the slot that each modifier fills. The focusdetection component identifies the question focus, whichis the part of the question that refers to the answer and isoften the head of the noun phrase with demonstrativedeterminers Bthis[ or Bthese[. The relation recognizerannotates certain types of unary and binary relations in thequestion; of particular interest here are semantic relationsrecognized over the focus whose predicates can be mapped tocodified relationships in Watson’s structured knowledgeresources, such as actorIn and authorOf. Finally, the LATdetection module identifies the LAT of the correct answer tothe question. The LAT is typically a term in the focus butmay also include other nouns that are co-referential withthe focus in the question and/or category.Consider the following Jeopardy! question:

(1) MOVIE-BING[: Robert Redford and Paul Newmanstarred in this depression-era grifter flick. (Answer:BThe Sting[)

The focus identification component recognizes the focus tobe Bflick[, the head of the noun phrase Bthis depression-eragrifter flick[. Since the focus Bflick[ does not have anyco-referential terms in the question, the only LAT is thefocus. Finally, the relation detection component recognizestwo relation instances: actorInðRobert Redford; flick :

focusÞ and actorInðPaul Newman; flick : focusÞ.These recognized relations are used in the Answer Lookupcomponent described below.

Search and candidate generation overviewThe DeepQA architecture is designed to be a large-scalehypothesis generation, evidence-gathering, and scoringarchitecture [9]. Figure 1 illustrates this architecture,focusing on how Watson’s search and candidate generationcomponents fit into the overall QA pipeline. The HypothesisGeneration phase takes as input results from questionanalysis, summarized in the previous section. The first fourprimary search components in the diagram show Watson’sDocument and Passage search strategies, which targetunstructured knowledge resources such as encyclopediadocuments and newswire articles [10]. On the other hand, thelast two search components, namely, Answer Lookup andPRISMATIC search, use different types of structuredresources. One or more candidate generation techniques areapplied to each search result to produce plausible candidateanswers. In our effort, we explored how to exploit therelationship between the title and content of title-orienteddocuments (such as encyclopedia articles; see below for moredetail) and how to use metadata present in linked data (suchas web documents) to help with effective candidategeneration.

Searching unstructured resourcesWatson’s text corpora contain both title-oriented documents,such as encyclopedia documents, and non-title-orientedsources, such as newswire articles [10]. For title-orientedsources, the document title is typically an entity or a concept,and the content of the document provides salient informationabout that entity or concept. This relationship betweendocument title and content inspired us to devise specialsearch strategies to take advantage of the relationship formore effective search. We analyzed Jeopardy! question/answer pairs and the title-oriented documents that provideanswers to the questions. We observed three possiblerelationships between the question/answer pair and thoserelevant documents.In the first case, the correct answer is the title of the

document that answers the question. For example,consider the Jeopardy! question BThis country singerwas imprisoned for robbery and in 1972 waspardoned by Ronald Reagan.[ The Wikipedia** article forMerle Haggard, the correct answer, mentions him as acountry singer, his imprisonment for robbery, and his pardonby Reagan and is therefore an excellent match for thequestion. Jeopardy! questions, which often contain multiplefacts about their answers, are frequently well-matched bythese encyclopedia documents that cover most of the factsin the question. To exploit the strong association betweenthe title and content for title-oriented documents, Watson

6 : 2 J. CHU-CARROLL ET AL. IBM J. RES. & DEV. VOL. 56 NO. 3/4 PAPER 6 MAY/JULY 2012

Page 3: Finding needles in the haystack: Search and candidate generation … · 2013-02-11 · adopts a document-oriented search strategy to find documents that, as a whole, best match the

adopts a document-oriented search strategy to finddocuments that, as a whole, best match the question.In the second case, the title of a document that answers the

question is in the question itself. For instance, consider thequestion BAleksander Kwasniewski became the president ofthis country in 1995.[ The first sentence of the Wikipediaarticle on Aleksander Kwasniewski states, BAleksanderKwasniewski is a Polish socialist politician who served asthe President of Poland from 1995 to 2005.[ Motivated bythis observation, Watson adopts a passage search strategyagainst a subcorpus of documents, consisting of thosedocuments whose titles appear in the question. We refer tothis search strategy as a TIC (Title-in-Clue) Passage search.In the third case, the answer-justifying document is a

third-party document whose title is neither the answer norin the question. We expect traditional passage searchstrategies adopted in existing QA systems [11] to be effectiveon those questions.Although the three sets of search results have a fair degree

of overlap, we have empirically shown that each strategybrings a unique contribution to the aggregated search results

and thus helps improve the overall recall of our searchprocess [12]. We discuss how we formulate search queriesand how the three search strategies are performed in Watsonin the remainder of this section.

Search query generationSearch queries for a question are generated from the resultsof question analysis. For all questions, a full query isgenerated on the basis of content terms or phrases extractedfrom the question, as well as any LAT detected in thecategory. Arguments of relations recognized over the focusare considered more salient query terms and are weightedhigher. For instance, in Example (1) above, the following fullquery is shown, where the arguments of the actorInrelations are given empirically determined higher weights:

(2.0 BRobert Redford[) (2.0 BPaul Newman[) stardepression era grifter (1.5 flick)

For those questions where the LAT has modifiers, as inthe current example, a LAT-only query is generated in

Figure 1

DeepQA architecture with details for search and candidate generation. (Modified and used, with permission, from D. Ferrucci, E. Brown, J. Chu-Carroll, J. Fan, D. Gondek, A. Kalyanpur, A. Lally, J. W. Murdock, E. Nyberg, J. Prager, N. Schlaefer, and C. Welty, BBuilding Watson: an overviewof the DeepQA project,[ AI Magazine, vol. 31, no. 3, pp. 59–79, 2010; �2010 Association for the Advancement of Artificial Intelligence.)

J. CHU-CARROLL ET AL. 6 : 3IBM J. RES. & DEV. VOL. 56 NO. 3/4 PAPER 6 MAY/JULY 2012

Page 4: Finding needles in the haystack: Search and candidate generation … · 2013-02-11 · adopts a document-oriented search strategy to find documents that, as a whole, best match the

addition to the full query, based on the LAT and itsmodifiers. In this example, the full noun phrase that containsthe LAT is Bthis depression-era grifter flick[; therefore,the LAT-only query contains the terms

depression era grifter flick

The motivation for the LAT-only query is that in somecases, the LAT and its modifiers uniquely identify the answeror narrow the candidate answer space to a small number ofpossibilities. This example falls into the latter case. SomeLAT-only queries that identify a unique entity include capitalOntario and first 20th century US president. Whereas someLAT-only queries, such as French composer and 20thcentury playwright, are too imprecise to be useful, we havefound that, overall, these LAT-only queries are helpful inincreasing system performance.In the search strategies described below, the full query is

applied to both Document search and Passage search. TheLAT-only queries are applied to Passage search only becausethese queries tend to be short and are less effective formatching against full documents. The different queries andsearch strategies produce results that are aggregated forfurther processing.In determining the sizes of the search hit lists, we

attempted to strike a balance between increased candidaterecall, processing time for all candidates, and the effect of theincorrect candidates on Watson’s ability to rank the correctanswer in top position. We empirically determined the hitlist sizes reported below on the basis of experiments thatmeasure hit list sizes against candidate recall and end-to-endsystem performance.

Document searchWatson’s Document search component uses the open-sourceIndri search engine [13] and targets title-oriented documentsthat, as a whole, best match the question. Two separateIndri indices are used, one consisting of long documents suchas encyclopedia articles and the other of short documentssuch as dictionary entries. The two separate indices arenecessary because the significant size differences between thedocuments caused highly relevant short documents to bedrowned out by longer documents when combined inone index. The full query, constructed as previouslydescribed, is used to retrieve the top 50 most relevantlong documents and the top 5 most relevant short documents.A Document search rank and a search score are associatedwith each result, which are used as features for scoringcandidate answers, in conjunction with additional featuresgenerated by other downstream answer scorers.

Passage searchTo retrieve relevant passages from its unstructuredknowledge resources, Watson extends common existing

approaches [11] along two dimensions. First, we adopt theTIC Passage search strategy to target search against a smallsubset of highly relevant documents, i.e., those whosetitles appear in the question. Second, we leverage multiplesearch engines in our implementation of traditional passagesearch to increase diversity and hence recall of the results.To this end, we adopt Indri’s existing passage searchcapability and extend Lucene [14] to support passageretrieval. Indri and Lucene passage retrieval differ in twomajor aspects. The first concerns the retrieval model used,where Indri uses a combination of language modeling andinference network for ranking [15], and Lucene leverages tf(term frequency) and idf (inverse document frequency) forscoring relevance [16]. The second key difference is in theimplementation of passage retrieval, which we discuss in thesection on Lucene Passage search below.Regardless of the strategy or search engine used, Watson’s

passage search components return one to two sentencepassages scored to be most relevant to a given question.Watson retrieves ten passages each from TIC Passage searchand Passage search. For the latter, five passages come fromIndri and five from Lucene. Our empirical results showthat an aggregation of five passages from each search engineachieves higher recall than retrieving ten passages usingeither search engine alone. The passage rank is used as afeature for scoring candidate answers extracted from thatpassage. The passage score feature is not used because thesearch scores returned by the two search engines are notcomparable.

Indri Passage searchIndri supports passage retrieval through the use of the prefix#passage½X : Y � for a query. This prefix specifies thatpassages be evaluated within an X -word window, shifting Ywords at a time, using a scoring algorithm analogous tothat for document retrieval. In Watson’s implementation, X isset to 20 and Y to 6, to balance the quality of results andspeed. Watson’s passage search component enhances Indri’snative passage search results for a QA application in twoways. First, we extend each 20-word passage at both ends tosentence boundaries so that the results can be analyzed byour natural-language processing (NLP) components [4, 8].Second, we augment Indri’s native scoring algorithm toaccount for coverage of query terms, i.e., rewards are givento passages that cover a large portion of query terms overthose that match a small fraction at high frequency. Ourexperiments showed that the rescoring process led topassages that are more likely to contain the answer tothe question.

Title-in-Clue Passage search using IndriWatson’s TIC Passage search component uses characteristicsof title-oriented documents to focus search on a subset ofpotentially more relevant documents. Its implementation

6 : 4 J. CHU-CARROLL ET AL. IBM J. RES. & DEV. VOL. 56 NO. 3/4 PAPER 6 MAY/JULY 2012

Page 5: Finding needles in the haystack: Search and candidate generation … · 2013-02-11 · adopts a document-oriented search strategy to find documents that, as a whole, best match the

makes use of Indri’s support for dynamic subcorpusspecification and a dictionary that maps canonical Wikipediadocument titles1 (e.g., Naomi) to all Wikipedia documentswith that string as their title [e.g., Naomi (band) and Naomi(Bible), among others]. The dictionary is used to identifydocument titles in the question, and a subcorpus ofdocuments is dynamically constructed by aggregating alltarget documents in the matched dictionary entries. Althoughthe same query is used as in Indri Passage search, we foundthat by constraining the corpus to a much smaller andpotentially more relevant subset, this search strategy cansucceed when the general passage search strategy fails.

Lucene Passage searchIndri’s approach to passage search is to treat each passage asa Bmini-document[ and to apply the same document-scoringalgorithm to these passages. However, this approach didnot fare well in Lucene. Instead, we observed that passagesextracted from documents that are highly relevant to thequestion are more likely to contain the answer. We thereforeintroduced a two-phase passage retrieval process to Lucene.First, Lucene Document search is used to retrieve relevantdocuments. Second, passages are extracted from thesedocuments and are ranked according to several criteriadiscussed below.Our Lucene Document search component adopts a

modification of Lucene’s built-in similarity-scoring scheme,introduced in [17]. It primarily consists of improvementsin term frequency normalization and in document lengthnormalization. The search query is constructed from questionkeywords and phrases and uses Lucene’s support for lexicalaffinities [18].Our extension to Lucene for passage search consists of

evaluating each single-sentence passage from top-scoringdocuments separately using a set of query-independentfeatures and a relevance measure between the sentence andthe query. We identified three query-independent featuresthat affect the a priori probability of a sentence’s relevanceto a Jeopardy! question as follows.

• Sentence offsetVSentences that appear closer to thebeginning of the document are more likely to be relevant;therefore, sentence offset is used as a scoring feature.

• Sentence lengthVLonger sentences are more likely to berelevant than shorter sentences; thus, sentence length isadopted as a feature.

• Number of named entitiesVSentences containing morenamed entities are more likely to be relevant, since mostJeopardy! answers are named entities. We approximatethe recognition of named entities in documents through

occurrences of anchor texts and document titles in eachsentence.

For estimating the relevance of a passage for a givensearch query, given that the passage is extracted from adocument relevant to the query, we found that a simplemeasure of keyword and phrase matching outperforms moresophisticated alternatives. This score is then multiplied by thesearch score of the document from which the passage isextracted.The query-dependent similarity score is combined with the

query-independent scores described above to determine theoverall search score for each passage. In order to increaserecall, each top-scoring single-sentence passage is expandedto include the sentence that precedes it in the document. Thisexpansion is helpful because co-reference targets in thecurrent sentence can often be found in the precedingsentence.

Searching structured resourcesIn addition to searching over unstructured resources, Watsonalso attempts to identify relevant content from structuredsyntactic and semantic resources. Watson adopts two primaryapproaches toward searching against two different typesof structured resources, as shown in Figure 1. The firstapproach, Answer Lookup, targets existing knowledgesources encoding semantic relations, such as DBpedia [19].The second approach, PRISMATIC search, uses acustom-built PRISMATIC knowledge base [20], whichencodes syntactic and shallow semantic relations derivedfrom a large collection of text documents.

Answer LookupEarly QA systems took the approach of translating naturallanguage into formal machine language, such as first-orderlogic or Bayesian logic. They then either looked up theanswer from a structured knowledge base or performedreasoning on known facts to derive the answer. Most ofWatson’s QA capability does not depend on this approach,because the problem of translating natural language into amachine-understandable form has proven too difficult to doreliably. There are cases, however, where parts of a questioncan be translated into a machine-understandable form: forinstance, when the question asks for a relation between theanswer and a named entity in the question, such as theactorIn relations shown in Example (1). Watson turnsthat part of the question into a query against structuredsources such as DBpedia [19] and the Internet MovieDatabase (IMDB) [21] in an attempt to instantiate thevariable in each query (Bflick[ in the example). We callthis capability Answer Lookup.Effective Answer Lookup requires components that match

entity names in questions to those in structured sources.More crucially, it directly depends on the quality and

1Wikipedia document titles contain disambiguation information for nonprimary senses ofambiguous titles. For example, BNaomi (band)[ is the title for the article describing the Germanelectronic duo, and BNaomi (Bible)[ is the title for the article about the biblical characterNaomi.

J. CHU-CARROLL ET AL. 6 : 5IBM J. RES. & DEV. VOL. 56 NO. 3/4 PAPER 6 MAY/JULY 2012

Page 6: Finding needles in the haystack: Search and candidate generation … · 2013-02-11 · adopts a document-oriented search strategy to find documents that, as a whole, best match the

quantity of semantic relations that can be identified inquestions, and relation detection in text, while a problemstudied for some time, remains an open issue. Broadlyspeaking, work in that area aims to develop recognitioncapabilities either manually or by statistical means.For the DeepQA technology base, components have been

developed using both recognition styles; these are describedin [4]. The Watson system used only rule-based relationdetection, largely because of performance constraints. Formanual development of relation-bearing patterns, it is criticalto analyze the domainVto identify the most frequentlyoccurring relations in questions and to obtain appropriatestructured sources that cover those relations.Our approach, which is predicated on developing

recognition grammars for each relation, entails an effortsignificant enough to drive us to focus only on the mostfrequent relations. However, the difficulty in automatic

relation detection per se hampers our ability to automaticallydetermine the most frequently occurring relations in aquestion set; in order to count them, one has to be able todetect them. As an approximation, we analyzed 20,000randomly selected Jeopardy! questions with their correctanswers against a few data sources we judged to match theJeopardy! domain well. For each question, we looked forknown relations in any of the chosen sources between anamed entity in the question and the answer. The frequencyof each relation in the sources linking a named entity in aquestion to an answer was aggregated over all 20,000questions. Based on this approximation, Figure 2 showsthe relative histogram of most frequent Freebase [22]relations in these questions.This ranked list of relations sets priorities for the

implementation of relation recognizers in questions. Ingeneral, we are looking for English expressions of these

Figure 2

Approximate distribution of the 20 most frequent Freebase relations in 20,000 randomly selected Jeopardy! questions. (Modified and used, withpermission, fromD. Ferrucci, E. Brown, J. Chu-Carroll, J. Fan, D. Gondek, A. Kalyanpur, A. Lally, J.W.Murdock, E. Nyberg, J. Prager, N. Schlaefer, andC.Welty, BBuildingWatson: an overview of the DeepQA project,[AIMagazine, vol. 31, no. 3, pp. 59–79, 2010;�2010Association for the Advancementof Artificial Intelligence.)

6 : 6 J. CHU-CARROLL ET AL. IBM J. RES. & DEV. VOL. 56 NO. 3/4 PAPER 6 MAY/JULY 2012

Page 7: Finding needles in the haystack: Search and candidate generation … · 2013-02-11 · adopts a document-oriented search strategy to find documents that, as a whole, best match the

known relations between a named entity and the questionfocus, such as actorInðRobert Redford; flick : focusÞand actorInðPaul Newman; flick : focusÞ. The AnswerLookup component first tries to find the known namedentity in the database from the relation argument string(e.g., BRobert Redford[) using a step called entitydisambiguation and matching (EDM) [2]. If a matchingentity is found, a query is generated to find movies starringthe given entity. As with this example, if multiple relationsare found in the question, each separate relation isindividually processed and its results pooled. An identicalAnswer Lookup score is assigned to each search result,except for those results reinforced by multiple search querieswhose scores are boosted.The top 20 relations in the Jeopardy! domain (see Figure 2)

are predominately entertainment and geographical relations,which together appear in approximately 11% of ourquestions. The relation detection recall is approximately60%, and its precision is approximately 80%. The EDM stephas a recall of approximately 80%, and the data sources coverapproximately 80% of the relations we need. Multiplyingthese probabilities together, we expect Answer Lookup toreturn the correct answer in approximately 4% of thequestions. This estimate is validated by evaluation on ablind test set discussed later in this paper. The AnswerLookup score is discriminating only in a few cases, suchas when multiple relation instances are detected and theintersection of the query results contains only one element,or when the relation instance detected has only oneanswerVsuch as when the question seeks the author ofa book.

PRISMATIC searchPRISMATIC is a large-scale lexicalized relation resourceautomatically extracted from massive amounts of text [20].PRISMATIC provides shallow semantic information derivedfrom aggregating over syntactic or semantic relation usagein a large corpus, with generalization over entity types. Forexample, it gathers the aggregate statistics of relationsextracted from syntactic parses, such as the frequency of thestring Tom Cruise appearing as the subject of the verb starmodified by a prepositional phrase Bin hmoviei[. Theaggregate statistics can be used to infer selectionalrestrictions and other semantics.One of the semantic relations PRISMATIC records is the

occurrence of isa relations.2 This information is particularlyuseful for search and candidate generation because it canidentify the most popular instances of a particular LAT. Forexample, consider the question BUnlike most sea animals,in the Sea Horse this pair of sense organs can moveindependently of one another[. Watson’s search strategies

described above failed to retrieve the correct answer as acandidate since the question mentions a relatively obscurefact. PRISMATIC search, on the other hand, focuses onthe LAT and its modifiers, in this example sense organ,and identifies up to 20 most popular instances of the LATwith modifiers from the PRISMATIC knowledge base.The correct answer, eyes, is the third most popular instanceof sense organ and is returned by the PRISMATIC searchcomponent. PRISMATIC search rank and score featuresare associated with each result for final scoring purposes,in conjunction with additional features generated by otherdownstream answer scorers.

Generating candidates from search resultsOnce relevant content is identified from Watson’s knowledgesources, candidate answers are extracted from the content.For content retrieved from structured resources, the retrievedresults (which are the uninstantiated arguments in thequery) are the candidate answers. For unstructured resultsfrom Document search and Passage search, additionalprocessing is required to identify plausible candidate answersfrom those search results. As a baseline, we adopted anamed entity recognizer from our TREC (Text REtrievalConference) QA system [23] and produced as candidateanswers all entities it recognizes as an instance of any ofthe more than 200 types in its type system. This candidateset is a superset of what would be produced as candidatesusing a type-based candidate generation approach [24, 25]using the same type system.Evenwith the superset, we found thatthe type-based approach does not producehigh enough candidate recall on Jeopardy! questions [12].This section describes three general-purpose candidate

generation techniques applied to unstructured search results,which improve upon our baseline: Title of Documentcandidate generation, applied to Document search resultsonly; Wikipedia Title candidate generation, relevant forPassage search results only; and Anchor Text candidategeneration, which is appropriate for both types of searchresults. These three candidate generation strategies are notanswer type dependent, apply to the vast majority ofquestions, and generate most of Watson’s candidate answers.The other candidate generation strategies, which we donot discuss in this paper, apply to a small number ofquestions or produce a small number of candidates only.Some of these strategies may be dependent on question type(e.g., questions seeking verb phrase answers or numericanswers), and others may rely on typographic cues.Note that although the discussion below demonstrates how

Watson uses Wikipedia metadata for candidate generationand the evaluation demonstrates how these strategies affectWatson’s performance on Jeopardy!, we have previouslydemonstrated that the same techniques also effectivelyperform on questions from the TREC QA track [12].Furthermore, the techniques we developed can be easily2Subclass relationships between objects; for instance, a dog Bis a[ mammal.

J. CHU-CARROLL ET AL. 6 : 7IBM J. RES. & DEV. VOL. 56 NO. 3/4 PAPER 6 MAY/JULY 2012

Page 8: Finding needles in the haystack: Search and candidate generation … · 2013-02-11 · adopts a document-oriented search strategy to find documents that, as a whole, best match the

applied to leverage metadata from other title-orienteddocuments (for Title of Document and Wikipedia Titlecandidate generation) and collections of documents withentity-oriented hyperlinks (for Anchor Text candidategeneration).

Title of Document candidate generationRecall that Document search identifies title-orienteddocuments that, as a whole, best match the facts presentedin the question. For these search results, the entity thatconstitutes the title of a matched document fits thedescription of the question and is thus proposed as acandidate answer.To facilitate the process of answer typing [2] in

downstream processing, candidate provenance informationis recorded for each candidate, if possible, in order todisambiguate the candidate string. In particular, somescorers for answer typing use structured resources, such asDBpedia, whose entries can be disambiguated via Wikipediauniform resource identifiers (URIs). For example, thecandidate Naomi with provenance Naomi (Bible) suggeststhat the candidate is a person, whereas the same candidatewith provenance Naomi (band) will be typed as a musicalgroup. For candidates generated from Wikipedia documents,the distinction between multiple meanings is available in thecandidate generation phase as it corresponds to the documentactually retrieved in the search process. This informationis recorded in metadata associated with the candidate.

Wikipedia Title candidate generationWe conducted an experiment to evaluate the coverage ofWikipedia articles on Jeopardy! questions and found that thevast majority of Jeopardy! answers are titles of Wikipediadocuments [10]. Of the roughly 5% of Jeopardy! answers thatare not Wikipedia titles, some included multiple entities,each of which is a Wikipedia title, such as Red, White, andBlue, whereas others were sentences or verb phrases, suchas make a scarecrow or fold an American flag. The highcoverage of Wikipedia titles over Jeopardy! answers suggeststhat they can serve as an excellent resource for candidategeneration.The Wikipedia Title candidate generation strategy

extracts from a retrieved passage all noun phrases that areWikipedia document titles and are not subsumed by othertitles. These indicate topics in the passage that are sufficientlysalient to warrant their own Wikipedia page and, wehypothesize, are worth considering as candidate answers.To identify Wikipedia titles in passages, we use the sameWikipedia title dictionary used to identify documenttitles in questions for TIC Passage search. The targetdocument titles [e.g., Naomi (Bible)] are used as provenanceinformation for each candidate to help with disambiguationin downstream scoring.

Anchor Text candidate generationAlthough Wikipedia Title candidate generation achievedhigh candidate recall for most passages, we hypothesizedthat linked metadata extracted from Wikipedia documentscan be used to improve the precision of candidatesextracted from Wikipedia passages. For example, considerthe passage BNeapolitan pizzas are made with ingredientslike San Marzano tomatoes, which grow on the volcanicplains south of Mount Vesuvius and Mozzarella di BufalaCampana, made with milk from water buffalo raised inthe marshlands of Campania and Lazio.[ Whatever thequestion for which the passage was retrieved, we expectterms or phrases such as BNeapolitan pizza[, BMountVesuvius[, and Bwater buffalo[ to be plausiblecandidates and other terms such as Bgrow[ and Bingredients[to be less likely as candidate answers. We observedthat plausible candidates typically satisfy two criteria.First, they represent salient concepts in the passage.Second, the candidates have Wikipedia articlesabout them.We observed that Wikipedia contains several types of

metadata that, in aggregation, make up a set of plausiblecandidates that represent salient concepts for each document[12]. They include the following:

1. Anchor texts in the document.2. Document titles of hyperlink targets (which are often

synonyms of terms in 1).3. Title of the current document.4. Titles of redirect pages to the current document (which are

often synonyms of 3).

The metadata is aggregated on a per-document basis, andthey become plausible candidates for search results fromthat document. In other words, for each search result from adocument, terms or phrases in the plausible candidate setfor that document that appear in the search result aregenerated as candidate answers. In the above sample passage,the italicized terms satisfy one of the four criteria aboveand represent the candidate answers that would be generatedfrom that passage. In Watson, Anchor Text candidategeneration is applied to Wikipedia document titles fromDocument search as well as all passages retrieved fromWikipedia (in lieu of Wikipedia Title candidate generation,which has nearly identical candidate recall but significantlylower precision [12]).For candidate provenance, we take advantage of the linked

nature of these candidate answers to accurately identifythe senses for candidates that can potentially be ambiguous.For candidate answers extracted from Anchor Text-basedresources (1 and 2 in the list above), the targetdocument is used to identify the sense of the candidate.For those extracted from document title or redirects [(3)

6 : 8 J. CHU-CARROLL ET AL. IBM J. RES. & DEV. VOL. 56 NO. 3/4 PAPER 6 MAY/JULY 2012

Page 9: Finding needles in the haystack: Search and candidate generation … · 2013-02-11 · adopts a document-oriented search strategy to find documents that, as a whole, best match the

and (4) in the list above], the current document is used todisambiguate among multiple senses.

Experimental evaluation

Experimental setupTo evaluate the impact of the search and candidate generationstrategies described in this paper, we randomly selected66 previously unseen Jeopardy! games. We excluded fromthese games special questions that require a tailored processfor candidate generation, such as puzzle questions [7] andcommon bond questions [26], resulting in a test set of3,344 questions. We evaluate the coverage of each searchstrategy paired with its corresponding candidate answergeneration components and measure performance using thecandidate binary recall metric for each strategy separately aswell as all strategies combined. Candidate binary recall iscomputed as the percentage of questions for which thecorrect answer is generated as a candidate answer. We adoptthis evaluation metric to reflect our goal of maximizingcandidate recall at the Hypothesis Generation phase ofthe DeepQA pipeline.

Results and discussionThe results of our experiments are shown in Table 1. Foreach search and candidate generation method, we computedthe number of questions for which at least one result wasreturned, henceforth referred to as the active subset. Foreach approach, we also computed the average number ofcandidates generated per question in the active subset andthe candidate binary recall computed over all questions.Our results show that Document search and Passage search

have very high coverage, returning candidates for allquestions for Passage search and all but 1% of the questionsfor Document search.3 Although they both yield highcandidate binary recall, that is, 74.43% and 79.40%,

respectively, they also generate a fairly large number ofcandidates per question. Contrast that with Answer Lookupand PRISMATIC search, which are active on a much smallerset of questions and correspondingly have much loweroverall binary recall. However, for these search strategies,only a small number of candidate answers are added to thepool. Note that Answer Lookup yielded a candidate binaryrecall of 3.53% on blind data, which is close to the 4%estimate in an earlier discussion.The row labeled BPercentage unique[ in Table 1

examines the unique contribution of each search andcandidate generation approach, i.e., the loss in candidatebinary recall if that strategy is ablated from the system. Ourresults show that although the degree of overlap is quite highamong the different methods, each approach does makeunique contributions to achieve an 87.17% combinedcandidate binary recall. We analyzed the 429 questions withcandidate recall failures and found that roughly three-fourthsof them are due to search failures. These questions typicallyfail because the question contains extraneous informationthat is relevant but not necessary for identifying the answer.For example, in BStar chef Mario Batali lays on the lardo,which comes from the back of this animal’s neck[, theessential part of the question is the segment in bold. However,the prominent entity BMario Batali[ steered search in thewrong direction and dominated our search results. For theother one-fourth, search returned a relevant passage,but our candidate generation strategies failed to extract theanswer as a plausible candidate. In most of these examples, thecorrect answers are common nouns or verbs, which generallyhave lower coverage for both our Anchor Textand Wikipedia Title candidate generation strategies.Finally, we examined the contribution of each search

and candidate generation strategy on end-to-end QAperformance. The last row in the table shows the QAaccuracy when the candidates generated by each approachare scored by Watson’s full suite of answer scorers and areranked. Our results show that Document search and Passagesearch achieve very comparable performance, about 9%

TABLE 1 Search and candidate generation evaluation results.

3The 1% of questions in the inactive subset for the Document search pipeline are questions forwhich a numeric answer is sought. For these questions, a specialized number candidategeneration component is employed.

J. CHU-CARROLL ET AL. 6 : 9IBM J. RES. & DEV. VOL. 56 NO. 3/4 PAPER 6 MAY/JULY 2012

Page 10: Finding needles in the haystack: Search and candidate generation … · 2013-02-11 · adopts a document-oriented search strategy to find documents that, as a whole, best match the

lower than full system performance. We ran an additionalexperiment in which only the full search query is issued toIndri Passage search to retrieve the ten most relevantpassages. This is the search configuration closest to thesearch strategies adopted in many existing QA systems.When these candidates are scored and ranked, the systemachieves an accuracy of 54.9%. These experimental resultsdemonstrate the effectiveness of our search and candidategeneration strategies, which, in aggregation, achieve anaccuracy of 71.32%.

Related workFrom the earliest days of artificial intelligence and NLP, theprevalent vision was that machines would answer questionsby first Btranslating[ human language into a machinerepresentation and then matching that representation againstbackground knowledge using a formal reasoning process[27, 28]. To date, however, no one has successfullyproduced such formal logical representations from unseennatural-language questions in a reliable way.The first QA system to use structured data sources

effectively for candidate generation was the STARTsystem [29], whose roots date back at least to 2000 [30].Similar to our Answer Lookup approach, a retrieval-basedQA system was augmented with a capability to recognizerelations in questions, and structured sources were queriedwith the detected relation and question focus. As reportedhere, the gating factor in exploiting this for QA is the abilityto detect the relations in the question.In the Halo and Aura systems [31], closed-domain

questions (on, e.g., chemistry) are answered using structuredsources containing many axioms about processes and factsin these domains. Aura addresses classes of questions that areout of scope for Watson, having mainly to do with problemsolving or procedural questions, such as BHow many mlof oxygen is required to produce 50 ml of water?[ However,the natural-language capability of these systems isconstrained by question templates that the system canmap to underlying structured queries.Most existing open-domain retrieval-based QA systems

adopt a pipeline of passage search against a reference corpusand generation of candidates of the expected answer typefrom the search results. From the search perspective, somesystems have explored the use of web data for the purposesof both generating new candidate answers [32–34] andvalidating existing candidate answers [35, 36]. Furthermore,online encyclopedias such as Grolier** and Wikipediahave been used as corpora for several QA systems [37, 38]and in the CLEF (Cross Language Evaluation Forum)evaluation effort [39]. However, to our knowledge, thesesystems treated the new corpus as an extension of thenewswire corpus used in earlier organized QA evaluationefforts and did not exploit its inherent characteristics toimprove QA performance. In contrast, we analyzed the

association between title-oriented encyclopedic documentsand question/answer pairs to motivate two additional searchstrategies for QA: Document search and TIC Passagesearch. These search techniques are effective for title-orienteddocuments and have been shown to improve upon theresults of traditional passage search alone.From the candidate generation perspective, the vast

majority of existing QA systems adopt a semantic-type-basedapproach to produce candidates that match the expectedanswer type on the basis of a static predefined type ontology[24, 25]. In contrast, Watson does not rely on such anontology but utilizes document metadata, such as documenttitle, anchor texts, and titles of hyperlink target documents,to associate salient concepts with respect to each documentand thus to create a pool of plausible candidate answers.Although this approach generates a substantially larger setof candidate answers, we have found it to significantlyoutperform type-based methods in a broad-domain tasksuch as Jeopardy! [12].

ConclusionA crucial step in achieving high QA performance is to cast awide enough net in the Hypothesis Generation phase toinclude the correct answer in the candidate pool. In thispaper, we described Watson’s multipronged strategies foridentifying relevant content and producing candidate answersthat balance high candidate recall and processing time forcandidate scoring.In its Hypothesis Generation phase, Watson uses the

results of question analysis to formulate effective queriesto identify relevant content from both structured andunstructured resources. For search against textual resources,we extended the common passage search paradigm adoptedby most existing QA systems in two ways. First, to takeadvantage of the relationship between the title and contentof title-oriented documents, we adopted a three-prongedsearch strategy: Document search, TIC Passage search, andPassage search. Second, to increase diversity of searchresults, we employed two search engines in our Passagesearch component. We have empirically found that bothextensions lead to higher candidate recall. For search againststructured resources, Watson employs two strategies: AnswerLookup and PRISMATIC search. Answer Lookup relieson recognition of high-frequency semantic relationsinvolving the focus and retrieves possible instantiations ofthe focus from existing structured knowledge sources suchas DBpedia. PRISMATIC search focuses on isa relationsmined from large corpora to produce salient instances ofthe LAT plus modifiers.For candidate generation from unstructured search results,

Watson does not rely on a predefined type ontology becauseof the broad range of answer types observed in Jeopardy!questions. To identify plausible candidates, Watson usesdocument metadata such as document titles, anchor texts, and

6 : 10 J. CHU-CARROLL ET AL. IBM J. RES. & DEV. VOL. 56 NO. 3/4 PAPER 6 MAY/JULY 2012

Page 11: Finding needles in the haystack: Search and candidate generation … · 2013-02-11 · adopts a document-oriented search strategy to find documents that, as a whole, best match the

Wikipedia redirects to identify salient concepts associatedwith each document. These metadata are used for threecandidate generation strategies: Title of Document candidategeneration, applied to Document search results; WikipediaTitle candidate generation, applied to Passage search results;and Anchor Text candidate generation, applied to documenttitles and passages from Wikipedia documents. Wheneverpossible, candidate answers carry metadata that encodeprovenance information to help downstream scorersdisambiguate the word sense of the candidate.Evaluation on a blind set of more than 3,000 questions

shows that the different search/candidate generation strategypairs have different performance characteristics. Thosetargeting unstructured resources generate a larger numberof candidates per question at high recall, whereas thosefocusing on structured resources produce far fewercandidates per question and have a much smaller active setof questions. Overall, all strategies make unique positivecontributions, yielding a combined 87.17% in candidatebinary recall.

*Trademark, service mark, or registered trademark of InternationalBusiness Machines Corporation in the United States, other countries, orboth.

**Trademark, service mark, or registered trademark of JeopardyProductions, Inc., Wikimedia Foundation, or Grolier Incorporated in theUnited States, other countries, or both.

References1. J. W. Murdock, J. Fan, A. Lally, H. Shima, and B. K. Boguraev,

BTextual evidence gathering and analysis,[ IBM J. Res. & Dev.,vol. 56, no. 3/4, Paper 8, pp. 8:1–8:14, May/Jul. 2012.

2. J. W. Murdock, A. Kalyanpur, C. Welty, J. Fan, D. A. Ferrucci,D. C. Gondek, L. Zhang, and H. Kanayama, BTyping candidateanswers using type coercion,[ IBM J. Res. & Dev., vol. 56, no. 3/4,Paper 7, pp. 7:1–7:13, May/Jul. 2012.

3. A. Kalyanpur, B. K. Boguraev, S. Patwardhan, J. W. Murdock,A. Lally, C. Welty, J. M. Prager, B. Coppola,A. Fokoue-Nkoutche, L. Zhang, Y. Pan, and Z. M. Qiu,BStructured data and inference in DeepQA,[ IBM J. Res. & Dev.,vol. 56, no. 3/4, Paper 10, pp. 10:1–10:14, May/Jul. 2012.

4. C. Wang, A. Kalyanpur, J. Fan, B. K. Boguraev, and D. C. Gondek,BRelation extraction and scoring in DeepQA,[ IBM J. Res. & Dev.,vol. 56, no. 3/4, Paper 9, pp. 9:1–9:12, May/Jul. 2012.

5. D. C. Gondek, A. Lally, A. Kalyanpur, J. W. Murdock, P. Duboue,L. Zhang, Y. Pan, Z. M. Qiu, and C. Welty, BA framework formerging and ranking of answers in DeepQA,[ IBM J. Res. & Dev.,vol. 56, no. 3/4, Paper 14, pp. 14:1–14:12, May/Jul. 2012.

6. A. Lally, J. M. Prager, M. C. McCord, B. K. Boguraev,S. Patwardhan, J. Fan, P. Fodor, and J. Chu-Carroll, BQuestionanalysis: How Watson reads a clue,[ IBM J. Res. & Dev., vol. 56,no. 3/4, Paper 2, pp. 2:1–2:14, May/Jul. 2012.

7. J. M. Prager, E. W. Brown, and J. Chu-Carroll, BSpecial questionsand techniques,[ IBM J. Res. & Dev., vol. 56, no. 3/4, Paper 11,pp. 11:1–11:13, May/Jul. 2012.

8. M. C. McCord, J. W. Murdock, and B. K. Boguraev, BDeepparsing in Watson,[ IBM J. Res. & Dev., vol. 56, no. 3/4, Paper 3,pp. 3:1–3:15, May/Jul. 2012.

9. D. A. Ferrucci, BIntroduction to FThis is Watson,’’’ IBM J. Res. &Dev., vol. 56, no. 3/4, Paper 1, pp. 1:1–1:15, May/Jul. 2012.

10. J. Chu-Carroll, J. Fan, N. Schlaefer, and W. Zadrozny, BTextualresource acquisition and engineering,[ IBM J. Res. & Dev.,vol. 56, no. 3/4, Paper 4, pp. 4:1–4:11, May/Jul. 2012.

11. S. Tellex, B. Katz, J. Lin, A. Fernandes, and G. Marton,BQuantitative evaluation of passage retrieval algorithms forquestion answering,[ in Proc. 26th ACM SIGIR Conf. Res.Develop. Inform. Retrieval, 2003, pp. 41–47.

12. J. Chu-Carroll and J. Fan, BLeveraging Wikipedia characteristicsfor search and candidate generation in question answering,[ inProc. 25th Conf. AAAI, 2011, pp. 872–877.

13. Indri Search Engine. [Online]. Available: http://www.lemurproject.org/indri.php

14. Apache Lucene. [Online]. Available: http://lucene.apache.org15. D. Metzler and W. B. Croft, BCombining the language model and

inference network approaches to retrieval,[ Inform. Process.Manage. Special Issue Bayesian Netw. Inform. Retrieval, vol. 40,no. 5, pp. 735–750, Sep. 2004.

16. E. Hatcher and O. Gospodnetic, Lucene in Action. Greenwich,CT: Manning Publ. Co., 2004.

17. D. Cohen, E. Amitay, and D. Carmel, BLucene and Juru at TREC2007: 1-Million queries track,[ presented at the Text REtrievalConf., Gaithersburg, MD, 2007, Special Publ. 500-274.

18. D. Carmel, E. Farchi, Y. Petruschka, and A. Soffer, BAutomaticquery refinement using lexical affinities with maximal informationgain,[ in Proc. 25th ACM SIGIR Conf. Res. Develop. Inform.Retrieval, 2002, pp. 283–290.

19. DBPedia. [Online]. Available: http://www.dbpedia.org20. J. Fan, A. Kalyanpur, D. C. Gondek, and D. A. Ferrucci,

BAutomatic knowledge extraction from documents,[ IBM J. Res.& Dev., vol. 56, no. 3/4, Paper 5, pp. 5:1–5:10, May/Jul. 2012.

21. Internet Movie Database. [Online]. Available: http://www.imdb.com

22. FreeBase. [Online]. Available: http://www.freebase.com/23. J. Prager, J. Chu-Carroll, E. Brown, and K. Czuba, BQuestion

answering using predictive annotation,[ in Advances inOpen-Domain Question Answering, T. Strzalkowski andS. Harabagiu, Eds. Norwell, MA: Kluwer, 2006, pt. 4,pp. 307–347.

24. J. Prager, E. Brown, A. Coden, and D. Radev, BQuestionanswering by predictive annotation,[ in Proc. ACM SIGIR Conf.Res. Develop. Inform. Retrieval, 2000, pp. 184–191.

25. D. Moldovan, S. Harabagiu, M. Pasca, R. Mihalcea, R. Girju,R. Goodrum, and V. Rus, BThe structure and performance of anopen-domain question answering system,[ in Proc. 38th Annu.Meeting Assoc. Comput. Linguistics, 2000, pp. 563–570.

26. J. Chu-Carroll, E. W. Brown, A. Lally, and J. W. Murdock,BIdentifying implicit relationships,[ IBM J. Res. & Dev., vol. 56,no. 3/4, Paper 12, pp. 12:1–12:10, May/Jul. 2012.

27. E. Rich, Artificial Intelligence. New York: McGraw-Hill, 1991.28. R. V. Guha and D. B. Lenat, BCYC: A mid-term report,[ AI Mag.,

vol. 11, no. 3, pp. 32–59, 1990.29. B. Katz, G. Marton, G. Borchardt, A. Brownell, S. Felshin,

D. Loreto, J. Louis-Rosenberg, B. Lu, F. Mora, S. Stiller,O. Uzuner, and A. Wilcox, BExternal knowledge sources forquestion answering,[ presented at the 14th Annual Text REtrievalConf. (TREC), Gaithersburg, MD, 2005. [Online]. Available:http://www.mendeley.com/research/external-knowledge-sources-question-answering/

30. B. Katz and J. Lin, BREXTOR: A system for generating relationsfrom natural language,[ in Proc. ACL 2000 Workshop NLP&IR,2000, pp. 67–77.

31. D. Gunning, V. Chaudhri, P. Clark, K. Barker, S. Chaw,M. Greaves, B. Grosof, A. Leung, D. McDonald, S. Mishra,J. Pacheco, B. Porter, A. Spaulding, D. Tecuci, and J. Tien. (2010).Project Halo updateVProgress toward digital Aristotle. AI Mag.[Online]. 31(3). Available: http://www.cs.utexas.edu/users/pclark/papers/AURA-AIMag.pdf

32. C. Clark, G. Cormack, T. Lynam, C. Li, and G. McLearn, BWebreinforced question answering (MultiText Experiments for TREC2001),[ in Proc. Text REtrieval Conf., 2001, pp. 673–679.[Online]. Available: http://trec.nist.gov/pubs/trec10/papers/mtA.pdf.

33. S. Dumais, M. Banko, E. Brill, J. Lin, and A. Ng, BWeb questionanswering: Is more always better?[ in Proc. 25th ACM SIGIRConf. Res. Develop. Inform. Retrieval, 2002, pp. 291–298.

J. CHU-CARROLL ET AL. 6 : 11IBM J. RES. & DEV. VOL. 56 NO. 3/4 PAPER 6 MAY/JULY 2012

Page 12: Finding needles in the haystack: Search and candidate generation … · 2013-02-11 · adopts a document-oriented search strategy to find documents that, as a whole, best match the

34. B. Katz, J. Lin, D. Loreto, W. Hildebrandt, M. Bilotti, S. Felshin,A. Fernandes, G. Marton, and F. Mora, BIntegrating web-based andcorpus-based techniques for question answering,[ in Proc. 12thText REtrieval Conf., 2003, pp. 426–435.

35. B. Magnini, M. Negri, R. Prevete, and H. Tanev, BIs it theright answer?: Exploiting web redundancy for answer validation,[in Proc. 40th Annu. Meeting Assoc. Comput. Linguistics, 2002,pp. 425–432.

36. J. Ko, L. Si, and E. Nyberg, BA probabilistic framework foranswer selection in question answering,[ in Proc. NAACL-HLT,2007, pp. 524–531.

37. J. Kupiec, BMURAX: A robust linguistic approach for questionanswering using an online encyclopedia,[ in Proc. ACM SIGIRConf. Res. Develop. Inform. Retrieval, 1993, pp. 181–190.

38. D. Ahn, V. Jijkoun, G. Mishne, K. Muller, M. de Rijke, andS. Schlobach, BUsing Wikipedia at the TREC QA track,[ in Proc.Text REtrieval Conf., 2004. [Online]. Available: http://www.mendeley.com/research/using-wikipedia-in-the-trec-qa-track/.

39. D. Giampiccolo, P. Froner, A. Penas, C. Ayaache, D. Cristea,V. Jijkoun, P. Osenova, P. Rocha, B. Sacaleanu, and R. Suteliffe,BOverview of the CLEF 2007 multilingual QA track,[ in Advancesin Multilingual and Multimodal Information Retrieval, C. Peters,V. Jijkoun, T. Mandl, M. Henning, D. W. Oard, P. Anselmo,V. Petras, and D. Santos, Eds. Berlin, Germany:Springer-Verlag, 2007, pp. 200–236.

Received July 26, 2011; accepted for publicationJanuary 9, 2012

Jennifer Chu-Carroll IBM Research Division, Thomas J.Watson Research Center, Yorktown Heights, NY 10598 USA([email protected]). Dr. Chu-Carroll is a Research Staff Member inthe Semantic Analysis and Integration Department at the T. J. WatsonResearch Center. She received the Ph.D. degree in computer sciencefrom the University of Delaware in 1996. Prior to joining IBM in2001, she spent 5 years as a Member of Technical Staff at LucentTechnologies Bell Laboratories. Dr. Chu-Carroll’s research interestsare in the area of natural-language processing, more specifically inquestion-answering and dialogue systems. Dr. Chu-Carroll serves onnumerous technical committees, including as program committeeco-chair of the North American Chapter of the Association forComputational Linguistics: Human Language Technologies (NAACLHLT) 2006 and as general chair of NAACL HLT 2012.

James Fan IBM Research Division, Thomas J. Watson ResearchCenter, Yorktown Heights, NY 10598 USA ([email protected]).Dr. Fan is a Research Staff Member in the Semantic Analysis andIntegration Department at the T. J. Watson Research Center, YorktownHeights, NY. He joined IBM after receiving the Ph.D. degree at theUniversity of Texas at Austin in 2006. He is a member of the DeepQATeam that developed the Watson question-answering system, whichdefeated the two best human players on the quiz show Jeopardy!.Dr. Fan is author or coauthor of dozens of technical papers on subjectsof knowledge representation, reasoning, natural-language processing,and machine learning. He is a member of Association forComputational Linguistics.

Branimir K. Boguraev IBM Research Division, Thomas J.Watson Research Center, Yorktown Heights, NY 10598 USA([email protected]). Dr. Boguraev is a Research Staff Member in theSemantic Analysis and Integration Department at the Thomas J. WatsonResearch Center. He received the Engineering degree in electronicsfrom the Higher Institute for Mechanical and Electrical Engineering,Sofia, Bulgaria, in 1974 and the Diploma and Ph.D. degrees incomputer science, in 1976 and Computational Linguistics, in 1980,respectively, from the University of Cambridge, Cambridge, U.K.. Heworked on a number of U.K./E.U. research projects on infrastructuralsupport for natural-language processing applications, before joiningIBM Research in 1988 to work on resource-rich text analysis. From1993 to 1997,

he managed the natural-language program at Apple’s AdvancedTechnologies Group, returning to IBM in 1998 to work on languageengineering for large-scale, business content analysis. Most recently, hehas worked, together with the Jeopardy! Challenge Algorithms Team,on developing technologies for advanced question answering.Dr. Boguraev is author or coauthor of more than 120 technical papersand 15 patents. Until recently, he was the Executive Editor of theCambridge University Press book series Studies in Natural LanguageProcessing. He has also been a member of the editorial boards ofComputational Linguistics and the Journal of Semantics, and hecontinues to serve as one of the founding editors of Journal of NaturalLanguage Engineering. He is a member of the Association forComputational Linguistics.

David Carmel IBM Research Division, Haifa Research Lab,Haifa, Israel, 31905 ([email protected]). Dr. Carmel is a ResearchStaff Member in the Information Retrieval Group at the IBM HaifaResearch Lab. His research is focused on search in the enterprise, queryperformance prediction, social search, and text mining. For severalyears, he taught the introduction to information retrieval course in theComputer Science Department at Haifa University, Haifa, Israel. AtIBM, Dr. Carmel is a key contributor to IBM enterprise searchofferings. He is a cofounder of the Juru search engine, which providesintegrated search capabilities for several IBM products and was usedas a search platform for several studies in the TREC conferences. Hehas published more than 80 papers in information retrieval and webjournals and conferences, and he serves on the editorial board of theInformation Retrieval journal and as a senior program committeemember or an area chair of many conferences (Special Interest Groupon Information Retrieval [SIGIR], WWW, Web Search and DataMining [WSDM], and Conference on Information and KnowledgeManagement [CIKM]). He organized a number of workshops andtaught several tutorials at SIGIR and WWW. He is coauthor of the bookEstimating the Query Difficulty for Information Retrieval, publishedby Morgan & Claypool in 2010, and he is the coauthor of the paperBLearning to Estimate Query Difficulty,[ which won the Best PaperAward at SIGIR 2005. He earned his Ph.D. degree in computer sciencefrom the Technion, Israel Institute of Technology in 1997.

Dafna Sheinwald IBM Research Division, Haifa Research Lab,Haifa, Israel, 31905 ([email protected]). Dr. Sheinwald is aResearch Staff Member in the Information Retrieval Group in theIBM Haifa Research Lab. She received the D.Sc. degree from theTechnion-Israel Institute of Technology, Haifa, and subsequently joinedIBM at the Haifa Research Lab, working for two years, at the IBMAlmaden Research Center, San Jose, CA. She has worked on datacompression for computer systems and in the recent years oninformation retrieval in the enterprise, directly contributing to severalrelated IBM products. She has taught the introductory course toinformation theory at the computer science department of HaifaUniversity, and she has served on the program committee of the annualIEEE Data Compression Conference and the committee of the IEEEConference on Software Science, Technology, and Engineering(SWSTE).

Chris Welty IBM Research Division, Thomas J. Watson ResearchCenter, Yorktown Heights, NY 10598 USA ([email protected]).Dr. Welty is a Research Staff Member in the Semantic Analysis andIntegration Department at the T. J. Watson Research Center. Hereceived the Ph.D. degree in computer science from RensselaerPolytechnic Institute, Troy, NY, in 1995. He joined IBM in 2002, afterspending 6 years as a professor at Vassar College, Poughkeepsie, NY,and has worked and published extensively in the areas of ontology,natural-language processing, and the Semantic Web. In 2011, he servedas program chair for the International Semantic Web Conference, andhe is on the editorial boards of the Journal of Web Semantics, theJournal of Applied Ontology, and AI Magazine.

6 : 12 J. CHU-CARROLL ET AL. IBM J. RES. & DEV. VOL. 56 NO. 3/4 PAPER 6 MAY/JULY 2012