High Precision Interactive Question Answering (HIREQA-ML)

High Precision Interactive Question Answering (HIREQA-ML)

Language Computer CorporationSanda Harabagiu, PI

John Lehmann, John Williams, Finley Lacatusu, Andrew Hickl, Robert Hawes, Paul Aarseth, Luke Nezda, Jeremy Bensley,Patrick Wang, Seth Cleveland and Ricardo Ruiz

Innovative Claims

• Multi-Strategy Q/A √• Innovations in Question Decomposition and

Answer Fusion• Bootstrapping Q/A √• Predictive Interactive QA √• Recognition of Incorrect and Missing Answers• Processing Negation in Q/A

Multi-Strategy Q/A

• Our fundamental premise is that progress in Q/A cannot be achieved only by enhancing the processing components, but it also requires generating the best strategies for processing each individual question.

• Current pipeline approach:• Question processing• Passage retrieval• Answer selection

• Complex questions need processing in which we also consider• The context/topic of the questioning• The previous interactions

Multi-Strategy Question Answering

• Strategies based on question type, question focus, question topic, paraphrases

• Strategies that impose passage retrieval by normalized keyword selection and web relevance

• Strategies that extract and fuse answers based on kernel methods/counter-training

Answer Resolution Strategies

Question Analysis

QuestionDecomposition

Question Analysis

Question Analysis

Question Analysis

…EnglishQuestion

Passage Retrieval 1

Passage Retrieval 2

Passage Retrieval n

…

Answer Selection 1

Answer Selection 2

Answer Selection n

Interactive Question Answering User Background

Question Processing Strategies

AnswerFusion

EnglishAnswer

1 million question/answer pairs

Passage Retrieval Strategies

Counter-Training for Answer Extraction

Problems Addressed

• Flexible architecture• Enable the implementation of multi-strategies

• Palantir• Predictive Q/A – Dialog Architecture Ferret• Large set of Question/Answer pairs: 1 million mark• Evaluations:

1. TREC, 2. QA with Relations, 3. Dialog Evaluations in the ARDA Challenge Workshop

Palentir

• Rewritten in Java – before the project started• ~70,000 lines of code• 139 named entities• 150 answer types• Allows for:

• Multiple forms of question analysis• Multiple passage retrieval strategies• Multiple answer selection methods• Incorporation of context in question processing and answer

selection• Modeling of Topic for Q/A in Restricted Domains

What is a topic?

• A topic represents an information need that can be cast into a template representation.

• Inherently similar to information extraction tasks:• Management Succession• Natural Disasters• Market Change• Movement of People

What is a scenario?

• A scenario is a description of a problem that requires a brief 1-2 page report that clearly identifies findings, justifications, and gaps in the information.

• It includes a number of subject areas, including:• Country Profile• Government: Type, Leadership, Relations• Military Organization and Operations:

Army, Navy, Air Force, Leaders, Capabilities, Intentions• Allies/Partners: Countries, Coalition Members, Forces• Weapons: Conventional, Chemical, Biological, Materials,

Facilities, Stockpiles, Access, Research Efforts, Scientists

Meta

-Tem

pla

te

Scenario Example

Technology

Assistance

Production

Stockpiles

Exports

As terrorist activity in Egypt increases, the Commander of the United States Army believes a better understanding of Egypt’s military capabilities is needed.

Egypt’s biological weapons database needs to be updated to correspond with the Commander’s request.

Focus your investigation on Egypt’s access to old technology, assistance received from the Soviet Union for development of their pharmaceutical infrastructure, production of toxins and BW agents, stockpiles, exportation of these materials and development technology to Middle Eastern countries, and the effect that this information will have on the United States and Coalition Forces in the Middle East.

Please incorporate any other related information to your report.

Q/AQ/A IEIE

Examples of Questions

Egypt’s Biological Weapons Stockpiles

What biological weapons agents may be included in Egypt’s BW stockpiles?

Where did Egypt obtain its first stockpiles of chemical weapons?

Is there evidence that Egypt has dismantled its stockpiles of chemical and biological weapons?

Did Egypt get rid of its stockpiles of chemical and biological weapons?

Will Egypt dismantle its CBW stockpiles?

Relation Extraction and Q/A

• A slot in a template/templette represents an implicit relation to a topic.

Egypt StockpileCBW

• These relations encode a set of predications that are relevant to the relation:

• Dismantle: <Arg0= Egypt, Arg1= Stockpile (CBW)>

• Inherit: <Arg0= Egypt, Arg1= Stockpile, Arg2= British>

• These predications need to be discovered and associated with question-answer pairs.

Creating Question/Answer Pairs

How do we create question/answer pairs?

• Convert each template slot to a topic theme.• Use incremental topic representations to discover

all available predications.• Select text surrounding each predication as the answer

of the pair.• Generate questions by automatically paraphrasing the

selected answer.

Large Set of Question-Answer Pairs

• Used human-generated question/answer pairs for the CNS data• Results: 5,134 pairs

• Harvesting of question/answer pairs from funtrivia.com• Over 600,000 pairs

• Created manual paraphrases of all questions evaluated in TREC until now

Efficient Topic Representation as Topic Signatures

• Topics may be characterized through a lexically determined topic signature (Lin and Hovy 2000)• TS = {topic,<(t1,w1),(t2,w2),…,(tn,wn)>}

• Where each of the terms ti is tightly correlated with the topic, with an associated weight wi

• Example• TS = {“terrorism”, <(bomb, 0.92),(attack, 0.89), (killing,0.83),…)}• The terms and the weights are determined by using the likelihood ratio

Our Findings

• Topics are not characterized only by terms, there are also relations between topics and concepts that need to be identified.

• Assumption (Harabagiu 2004)• The largest majority of topic-relevant relations take place between

verb/nominalizations & other nouns

• Topic Representation can be produced in two iterations:1. TS1 = {topic,<(t1,w1),(t2,w2),…,(tn,wn)>}

2. TS2={topic,<(r1,w1),(r2,w2),…,(rm,wm)>}

where ri is a binary relation between two topic concepts How do we determine the relations ri?

The idea: Start with a seed relation rs

Discover new relations relevant to the topic

Selecting the seed relation

• Three step procedure:1. Morphological expansion of all lexemes relevant to the topic2. Semantic normalization (based on ontological resources)3. Selection of the predominant [predicate-argument] relation

that is syntactically expressed either as a V-Subject or as a V-Object, V-PP(attachment_.

Example of Seed Relations

Topic Seed Relation

Natural Disaster [hit-tornado]

Deaths [kill-PERSON]

Bombings [explode-bomb]

Market Changes [raise-QUANTITY]

Court Cases [accuse-crime]

Illness Outbreaks [spread-infection]

Medical Research [discover-cure]

Elections [campaign-PERSON]

Movement of People [fly-LOCATION]

Topic Relations

• Two forms of topic relations are considered• Syntax based relations between the VP and its subject, object, or

prepositional attachment• C-relations representing relations between events and entities that

cannot be identified by syntactic constraints

• C-relations are motivated by• Frequent collocations of certain nouns with the topic verbs or

normalizations• An approximation of intra-sentential centering introduced in

(Kameyama, 1997)

Motivating Example

The model of Discovering Topic Relations

• Step 1: Retrieve Relevant Passages• Use the seed relation and any new discovered

relation

• Step 2: Generate Candidate Relations• Two types of relations

• Syntax based relations• Salience based relations

• Step 3: Rank the Relevance of the Relations

• Step 4: Add the Relevant relations of the Topic representation

• Step 5: Determine continuation/stop condition (Counter-training Yangarber 2003)

Syntax-Based Relations

• From every document from Dq extract all• Verb-subject, verb-object, verb-prep attachment relations• Recognized by a FSA-based parser

• Expand syntax based relations• Replace each word with its root form

• “wounded” -> “wound”• “trucks” -> “truck”

• Replace each word with any of the concepts that subsume it in a hand-crafted ontology

• “truck” -> VEHICLE• “truck” -> ARTIFACT

• Replace each named entity with its name class• “Bank of America” -> ORGANIZATION

Expansion of Relations

exploded truck exploded Colombo

1.explode truck

2.explode VEHICLE

3.explode ARTIFACT

4.explode OBJECT

5.EXPLODE_WORD truck

6.EXPLODE_WORD VEHICLE

7.EXPLODE_WORD ARTIFACT

8.EXPLODE_WORD OBJECT

1.explode Colombo

2.explode CITY_NAME

3.explode LOCATION_NAME

4.EXPLODE_WORD Colombo

5.EXPLODE_WORD CITY_NAME

6.EXPLODE_WORD LOCATION_NAME

Salience – Based C-relations

• Additional topic relations may be discovered within a salience window for each verb. A window of k=2 sentences preceding and succeeding the verb.

• The NPs of each salience window are extracted and ordered.

• Basic hypothesis – a C-relation between a verb and an entity from its domain of relevance are similar to anaphoric relations between entities in texts.

Candidate C-relations

• In each salience window,[Trigger Verb -> NPi]Relations are:• created• Expanded similarly as it

is done for syntax relations

• Caveat: When considering our expansion for [Trigger Verb -> NPj] – the expansion is allowed only if it was not already introduced by any other expansion [Trigger Verb -> NPk] with k < j

STEP3: Rank Candidate Relations

Following the method introduced in AutoSlog (Riloff 1996), each relation is ranked based on its:• Relevance – Rate• Frequency: the number of times a relation is identified in R

In a single document, one relation may be identified multiple times.

Where Count is the max number of times the relations is recognized in any given document.

Relations with

are discarded Consider only relations with:

Count

FrequencyRatelevance Re

7.0 RatelevanceRe

01.0

MaxCountCount /

4.0

Results for the identification of topic relations

TOPICS Syntax-Based Relations Syntax-Based Relations + C-Relations

Handcrafted Relations

P R F1 P R F1 P R F1

NATURAL DISASTERS 77.9% 60.4% 68.1% 65.3% 88.9% 75.3% 70.5% 74.6% 72.5%

DEATHS 63.9% 65.5% 60% 51.5% 80.78% 62.9% 78.2% 44.8% 57.3%

BOBMINGS 66.2% 40.8% 50.5% 61.3% 58.2% 59.7% 76.9% 45.2% 56.8%

MARKET CHANGES 78.4% 85.9% 82% 80.3% 84.4% 82.3% 80.3% 83.5% 81.9%

COURT CASES 55.7% 37% 44.5% 40.3% 73.67% 52.1% 76.2% 35.3% 48.3%

ILLNESS OUTBREAKS 72.4% 59.46% 65.3% 70.3% 79% 74.4% 71.5% 75.4% 73.4%

MEDICAL RESEARCH 78.4% 75.2% 76.8% 77.1% 82% 79.5% 78.2% 76% 77%

ELECTIONS 68.8% 73.1% 70.9% 60.2% 97% 74.3% 73.5% 69.4% 71.4%

MOVEMENT OF PEOPLE 65.3% 64.5% 64.9% 64.8% 89.5% 75.2% 77.2% 63% 69.4%

Micro-Average 69.6% 62.4% 64.7% 63.4% 81.4% 70.6% 75.8% 63% 67.5%

FERRET Overview

• LCC’s Ferret Dialogue System was designed to study analysts’ extended interaction with:

• Current “best practices” in Automatic Question/Answering• Production quality Q/A system

• Extensive domain-specific knowledge• Human-created question/answer pairs from document

collection

• User-interface that mimicked common web-browsing applications

• Reduced time-to-learn, overall complexity of task

The story so far

Experts need access to high-precision Q/A systems.

Novices need access to sources of domain-specific knowledge.

Systems need to account for varying levels of expertise “on the fly”.

Systems should provide sophisticated information and be easy to use.

Systems should make use of analysts’ existing computer and research skills.

Systems need not be overprovisioned with functionality to be effective.

Systems need to seamlessly integrate multiple sources of information.

Automatic Q/A

Knowledge Base

We hypothesize that these four goals can be

addressed through a dialogue interface.

Designing Ferret

• High-Precision Q/A (Palantir):• Offer users full-time use of state-of-the-art Palantir automatic Q/A system.• Return both automatically-generated answers and full doc contexts.

• Question-Answer Base (QUAB):• Provide users with extensive domain-specific knowledge-base.• Organize information into human-crafted question-answer pairs.• Return both identified answers and full doc contexts.

• Simple Integrated Interface:• Provide both automatically-generated and QUAB answers simultaneously.• Mimic functionality of browser applications that users will be familiar with.• Reduce confusion and time to learn by providing a minimum of “extra” tools.

High-PrecisionAutomatic Q/A

Question-Answer Base (QUAB)

Simple IntegratedInterface

Three overarching design principles for Ferret:

Two Kinds of Answers

• Ferret provides two kinds of answers to questions:• Answers derived from LCC’s Palantir automatic

question-answering system

• Answers extracted from a human-generated “question-answer base” (QUAB) containing over 5000 domain-specific question-answer pairs created from the workshop document collection.

Does Al-Qaeda have biological or

chemical weapons?

High-PrecisionAutomatic Q/A

Question-Answer Base

Potential Challenges for QUAB

…and some challenges as well:

• Information ContentWill developers be able to identify information that be consistently useful to analysts?

• Scope and CoverageHow many questions does it take to get adequate coverage? How much time does it take to create such a collection?

• RelevanceHow do you determine which QUAB pairs should be returned to a user’s query?

• AdoptionWill analysts accept information provided by non-expert users?

• IntegrationHow do you add this new source of information to existing interactive Q/A architecture?


Building the QUAB Collection

• A team of 6 developers (with no particular expertise in the domain topics) were tasked with creating the QUAB collection.

• For each scenario, developers were asked to identify passages in the document collection which might prove useful to someone conducting research on the domain.

• Once these snippets were extracted, developers created a question which could be answered by the text passage.

In this volatile region, the proliferation of NBC weapons and the means to deliver them poses a significant challenge to our ability to achieve these goals. Iran, Iraq, and Libya are aggressively seeking NBC weapons and missile capabilities,constituting the most pressing threats to regional stability.

Topic: Libya’s CBW Programs

Is Libya seekingCBW weapons?

Is Libya seekingCBW weapons?


Distribution of the QUAB Collection

•5147 hand-crafted question-answer pairs in QUAB•3210 pairs for 8 “Testing” domains•342 pairs for 6 “Training domains•1595 terrorism-related pairs added to augment training data

•Approximately 180 person hours needed to build QUAB collection

Scenario Topic QUAB Q

A India 460

B Libya 414

C I ran 522

D North Korea 316

E Pakistan 322

F Russia 366

G South Africa 454

H Egypt 356

Testing Total 3210

Average 401.25

2 73.50

A

B C

D

E

FG

H


Selecting QUAB Pairs

• Ferret employs a complex concept-matching system to identify the QUAB questions that are most appropriate for a user’s particular query.

What is North Korea’s current CW weapons What is North Korea’s current CW weapons capability?capability?

What chemical weapons capabilities did North Korea have prior to 1980?

How many tons of CW is North Korea expected to be able to produce annually?

What could motivate North Korea to make a chemical weapons attack?

When was North Korea first capable of producing chemical weapons in large quantities?

How much did Iran pay North Korea to develop a ballistic missile capable of carrying a chemical

weapons payload?

Does North Korea have the ability to rapidly prepare CW and BW?

How has the unavailability of CW precursors affected North Korea’s ability to produce certain

kinds of CW?

Which countries have or are pursuing CW capabilities?

What are North Korea’s CW capabilities? Where does North Korea weaponize its CW?

Actual Dialogue: Day 4, CH8 – Question 1


The Ferret Interface

• The control bar found at the top of every screen replicates basic browser functions such as navigation (Back, Forward, History), text searching (Find), copy-and-paste, and query submission (Ask).

• Basic on-line help is also provided.

Simple Interface

The Ferret Interface

• Answers are presented simultaneously on the browser’s screen:• Palantir answers are listed on the left-hand side of the screen.• QUAB pairs are found on the right-hand side of the screen.

Palantir Answers

QUAB Answers

Simple Interface

Number of Answers

• The top 150 Palantir answers (ranked in order of relevance) are returned for each query.

• Keywords are presented in bold. Document links in blue.

• The top 10 QUAB pairs (also ranked in order relevance) are returned for each question submitted to Ferret.

• Only a short snippet of the answer is presented on the main screen.

Simple Interface

Getting Full Docs from Palantir Answers

• Only short snippets of Palantir answers and QUAB pairs are presented on the main screen.

• Users can click on links associated with each Palantir snippet to view the full text of the document that the answer was extracted from. (The actual answer text is highlighted in yellow.)

Main ScreenMain Screen Full DocumentFull Document

Simple Interface

Getting Full Docs from QUAB Pairs

• With QUAB question-answer pairs, users can click on the link on the main screen to receive the full text of the answer identified by the annotator.

• Users can click a link on the screen displaying the full answer text to view the entire text of the document. (Again, the actual answer text is highlighted in yellow.)

Main ScreenMain Screen Full AnswerFull Answer Full DocumentFull Document

Simple Interface

Using QUAB as a source for Palantir

• Users can also re-submit a QUAB question to Ferret’s Palantir automatic Q/A system by clicking the Find more answers like this one link on the QUAB answer page.

• This function allows users to check the answer provided in QUAB against other potential answers found in document collection.

Main ScreenMain ScreenFull AnswerFull Answer

Simple Interface

2004 ARDA AQUAINT Dialogue Workshop

• This summer’s ARDA-sponsored Dialogue Workshop provided us with opportunity to test the effectiveness of Ferret in an extensive real-world experiment.• 3 weeks at Pacific Northwest National Laboratories• 8 intelligence analysts (7 Navy, 1 Army)• 16 “real” research scenarios (AFRL, Rome Labs)• 4 participating systems

(Ferret, GINKO (Cyc), HITIQA (Albany), GNIST (NIST))

• Workshop opportunities:• First chance to gather extensive data on these kinds of dialogues• Feedback from actual end-users of interactive systems• Interaction in real time• Dialogues produced by “real” analysts in “real” scenarios• Opportunity to team with other system developers• Chance to demo systems at October’s AQUAINT meeting

Types of User Interactions

• 500 questions were either asked or selected across 16 sessions• (Average: 31.25 questions/session)

QUAB Q282 questions

(56.4%) "Find More"22 questions

(4.4%)

User Q196 questions

(39.2%)

Type of Questions: User Comparison

QUAB Q

User Q

“Find More”

User 1

53%45%

2%

User 2

35%

65%

0%

User 3

35%

65%

0%

User 4*

45%

55%

0%

User 5

27%

64%

9%

User 6

54%42%

4%

User 7*

33%

53%

14%

User 8

33%

65%

2%

Significant difference in terms ofnumbers of QUAB questions selected

by users (p > 0.05).

A Dialogue Fragment

User What is the current status of Russia's chemical weapons program?QUAB How large is Russia's chemical weapons program? QUAB What is the physical status of Russia's chemical weapons facilities? QUAB Has Russia made any recent chemical weapon developments? QUAB How efficient was Russia's chemical weapons destruction programme in 2003? User What kind of CW does Russia produce?QUAB What kinds of nerve gas CW munitions does Russia have? QUAB Which chemical weapons does Russia have in its chemical weapons stockpile? QUAB What chemical weapons products did the Soviet Union (Russia) produce prior to World War II? QUAB Where are Russia's CW storage sites? QUAB How did Russia consolidate the storage of its CW? User Where are Russia's CW produced?User What scientists worked on Russia's CW program?QUAB What was the Russian secret chemical weapons program known as Foliant? User What kind of CW research is being done in Russia?QUAB Which countries are trying to obtain CBW from Russia? QUAB Is Russia researching chemical weapons? User How can Russia deliver its CW?QUAB Can Russia deliver CW via spray tanks or aerosols?

Dialogue Preferences

• Users typically consult at least one QUAB pair for almost every question submitted to Palantir.

• User Dialogues: Streaks of User Questions• Average: 1.84 intervening non-user questions• Minimum: 0.66 questions; Maximum: 2.75 questions

• System Dialogues: Streaks of QUAB Selections• Average: 2.68 non-user Qs• Minimum: 1.74 questions; Maximum: 3.67 questions;

How deep do analysts search for answers?

• Analysts tended to search through almost all of the QUAB answers returned, but only looked at 1/5 of Palantir answers.• QUAB Full Answer: 88.8% (8.8th answer)• QUAB Document: 75.9% (7.6th answer)• Palantir Document: 18.34% (27.5th answer)

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

1 2 3 4* 5 6 7* 8

User

Palantir

QUAB Full Answers

QUAB Docs

Average Depth of Search

Per

cen

tag

e D

eep

Lessons Learned

• Results from workshop allow us to see first examples of extensive unsupervised interactions with Q/A systems.

• Analysts’ dialogues are markedly different than human dialogues:• Topics shift (and are returned to) repeatedly• Can lapse into “search engine” strategies• Not constrained by a need to establish topics, domain of interest,

question under discussion

• Ultimately, the best systems will need to be flexible and reactive• Anticipate potential information needs• Refrain from imposing particular information-seeking discourse

structures on users• Be able to track multiple topics simultaneously

Questions? Comments?

Thank you!

Documents

High Precision Interactive Question Answering (HIREQA-ML)