Copyright by Shaw Yi Chaw 2009 · within, reading huge chunks, or helping me get the ASKME system to work correctly. Many thanks are due to Bruce Porter who supervised my research

Copyright

by

Shaw Yi Chaw

2009

The Dissertation Committee for Shaw Yi Chaw

certifies that this is the approved version of the following dissertation:

Addressing the Brittleness of Knowledge-Based

Question-Answering

Committee:

Bruce W. Porter, Supervisor

Kenneth J. Barker

Raymond Mooney

Gordon S. Novak Jr.

Art Markman


Question-Answering

by

Shaw Yi Chaw, BComp(Hons), MSCS

Dissertation

Presented to the Faculty of the Graduate School of

The University of Texas at Austin

in Partial Fulfillment

of the Requirements

for the Degree of

Doctor of Philosophy


December 2009

Dedicated to the memory of Bit Computer Services Pte Ltd, ACE 21, and ACJ

Computers Pte Ltd

Acknowledgments

A Ph.D. dissertation has one name on the front. However behind this one person,

who takes all the credit, are dozens of other people; cajoling, encouraging, criticizing,

stimulating and generally making sure that several years of alternating anguish and

excitement are eventually transformed into a dissertation. Many people contributed

to this dissertation, directly or indirectly, either by discussing the ideas contained

within, reading huge chunks, or helping me get the ASKME system to work correctly.

Many thanks are due to Bruce Porter who supervised my research work.

He provided timely encouragement and advice, carefully read and criticised numer-

ous documents and was always prepared to listen to my ideas. I have benefited

tremendously from his unique blend of energy, vision, technical insights, and prac-

tical sensibility. Most importantly, Bruce has that invaluable asset of the good

supervisor, an ever open door.

A big set of thanks go to Ken Barker. He was a joy to work with. Ken is also

a world class researcher and my work greatly benefited from discussions with him.

I thank him for providing so much practical advice and motivation, for making sure

this dissertation saw the light of day.

I would like to thank Professor Gordon Novak, Professor Raymond Mooney,

and Professor Art Markman for taking on the responsibilities of both serving on

my dissertation committee, and carefully reading my thesis for errors in technical

details, English grammar and style despite their busy schedules.

v

Several pieces of technology used in studying my thesis were developed prior

to my arrival at the university, or by collaborators outside the university. Particular

thanks here are due to Bruce Porter, Ken Barker, Gordon Novak, Art Souther, Peter

Clark, John Thompson, Phil Harrison, Rick Wojcik, Tom Jenkins, Vinay Chaudhri,

Ken Murray, Sunil Mishra, John Pacheco, James Fan, Peter Yeh, and Dan Tecuci.

I enjoyed the intellectual environment UTCS harbors and appreciate the

interactions with other students and researchers. Many fine folks at UTCS have

taught me by questioning, as well as by example, on how to do research.

The Knowledge Systems Group has been a great working environment, and

I particularly appreciate the friendship and many thought-provoking discussions I

have enjoyed with my colleagues: James Fan, Peter Yeh, Dan Tecuci, Michael Glass,

and Doo-Soon Kim.

There are many more friends I have made at UTCS who gave me help and

support but were not mentioned here. I would like to take this opportunity to thank

all of them.

During my time at UTCS, I have had opportunities to know and interact

with many staff members, who have enriched and made unique my experience. I

like to thank the friendly administrative staff who helped make things go smoothly:

Gloria Ramirez, Katherine Utz, and Lydia Griff. I also like to thank the technical

staff of UTCS for providing wonderful computing facilities and software.

I especially like to thank Stacy Miller. She’s the best! Stacy has boundless

energy and is always positive! She gives me endless moral support, laughs at all my

jokes, and is always helpful. Stacy provided pastoral care and is my moral compass.

Stacy also sold me my first car, a 1980, lemon yellow, Toyota Celica! It was a

reliable, awesome, and fun car to drive around Austin, TX! Driving that car made

me look cooler than I really was! I would also like to acknowledge that Stacy played

a major role in naming my system ASKME (we have the picture to prove it!).

vi

My parents and sister have been a huge source of emotional and financial

support. Despite the fact that I am thousands of miles away from them, my welfare

has always been foremost in their minds. I want to say a big thank you to them for

all the sacrifices they have made over the years in order to see me succeed in my

quest for a world-class research career.

I would like to thank my dad, Chaw Shing Cheong, for all his hard work at

supporting the family, and getting me my first home computer. Besides doing real

work on that computer, it has allowed me to play SimCity for days at end. I thank

him for believing in me.

I would especially like to thank my mom, Lek Kim Noi, for introducing me

to my first computer in 1986. My mom taught me how to turn on a computer,

use DOS, play computer games, and, subsequently, how to perform DIY work on

computers. She also greatly supported my curiosity by getting me Internet access at

the earliest opportunity. Without my mom, I would not have spent so many hours

of my life sitting in front of a computer monitor. I greatly appreciate her patience

and dedication in getting me access to computers and software at such a young age.

I would like to thank my sister, Chaw Shaw Lin, for her unwavering support

and practical advice. She has supported me financially time and again. I greatly

appreciate all her help! I would especially like to thank her for generously paying

for my MacBook, doctoral regalia, and car repairs.

I would like to thank my aunt, Chaw Suat King, for letting me access her

bookstore (Computer Book Center Pte Ltd) as a private library. The countless

afternoons and weekends reading technical books and magazines greatly influenced

my decision to study Computer Science.

I would like to thank my uncle, Chaw Kiang, for the many Saturday af-

ternoons we spent together, pounding away at the computer keyboard inside Bit

Computer Services Pte Ltd.

vii

I would like to thank my grandmother, Chng Chian Kim, for her love and

support. She has always been a good friend. Since young, my grandmother has

always cheered me on by the side, be it a kindergarten school play, or my decision to

attend graduate school in Austin, TX. I miss the wonderful times we spent chatting.

I would like to thank my professors at the National University of Singapore,

especially Khoo Siau-Cheng, for providing me an excellent undergraduate education

and getting me initiated into research.

Support for this research was provided by Vulcan Inc. as part of Project

Halo, SRI International as part of Learning by Reading, IBM Corporation, and by

the Boeing company.

I regret that my grandfather, Chaw Boon Se, did not live to see where I am

today. I would like to have shared with him these exciting years spent working on

my doctoral degree.

Shaw Yi Chaw


December 2009

viii


Question-Answering

Publication No.

Shaw Yi Chaw, Ph.D.

The University of Texas at Austin, 2009

Supervisor: Bruce W. Porter

Knowledge base systems are brittle when the users of the knowledge base

are unfamiliar with its content and structure. Querying a knowledge base requires

users to state their questions in precise and complete formal representations that

relate the facts in the question with relevant terms and relations in the underlying

knowledge base. This requirement places a heavy burden on the users to become

deeply familiar with the contents of the knowledge base and prevents novice users

to effectively using the knowledge base for problem solving. As a result, the utility

of knowledge base systems is often restricted to the developers themselves.

ix

The goal of this work is to help users, who may possess little domain exper-

tise, to use unfamiliar knowledge bases for problem solving. Our thesis is that the

difficulty in using unfamiliar knowledge bases can be addressed by an approach that

funnels natural questions, expressed in English, into formal representations appro-

priate for automated reasoning. The approach uses a simplified English controlled

language, a domain-neutral ontology, a set of mechanisms to handle a handful of well

known question types, and a software component, called the Question Mediator, to

identify relevant information in the knowledge base for problem solving. With our

approach, a knowledge base user can use a variety of unfamiliar knowledge bases

by posing their questions with simplified English to retrieve relevant information in

the knowledge base for problem solving.

We studied the thesis in the context of a system called ASKME. We evaluated

ASKME on the task of answering exam questions for college level biology, chemistry,

and physics. The evaluation consists of successive experiments to test if ASKME can

help novice users employ unfamiliar knowledge bases for problem solving. The initial

experiment measures ASKME’s level of performance under ideal conditions, where

the knowledge base is built and used by the same knowledge engineers. Subsequent

experiments measure ASKME’s level of performance under increasingly realistic

conditions. In the final experiment, we measure ASKME’s level of performance

under conditions where the knowledge base is independently built by subject matter

experts and the users of the knowledge base are a group of novices who are unfamiliar

with the knowledge base.

Results from the evaluation show that ASKME works well on different knowl-

edge bases and answers a broad range of questions that were posed by novice users

in a variety of domains.

x

Contents

Acknowledgments v

Abstract ix

Contents xi

List of Tables xv

List of Figures xix

Chapter 1 Introduction 1

1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.2 The ASKME Approach . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.3 Project Halo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

1.4 Summary of Evaluation and Results . . . . . . . . . . . . . . . . . . 14

1.5 Organization of Dissertation . . . . . . . . . . . . . . . . . . . . . . . 17

Chapter 2 Technical Challenges 18

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.2 Linguistic processing challenges . . . . . . . . . . . . . . . . . . . . . 19

2.3 Resolving Representational Differences . . . . . . . . . . . . . . . . . 24

2.4 Relevance Reasoning . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

xi

2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

Chapter 3 Related Work 36

Chapter 4 Approach 42

4.0.1 Restricted English . . . . . . . . . . . . . . . . . . . . . . . . 42

4.0.2 Domain-Neutral Ontology . . . . . . . . . . . . . . . . . . . . 43

4.0.3 Question Taxonomy . . . . . . . . . . . . . . . . . . . . . . . 46

4.0.4 Question Mediator . . . . . . . . . . . . . . . . . . . . . . . . 47

4.1 The ASKME Prototype . . . . . . . . . . . . . . . . . . . . . . . . . 48

4.1.1 Computer Processable Language . . . . . . . . . . . . . . . . 49

4.1.2 The Component Library . . . . . . . . . . . . . . . . . . . . . 57

4.1.3 Supported questions . . . . . . . . . . . . . . . . . . . . . . . 58

4.2 The Design of the Question Mediator . . . . . . . . . . . . . . . . . . 59

4.2.1 A search controller to find information to answer questions . 61

4.2.2 Resolving Representational Differences . . . . . . . . . . . . . 69

4.2.3 Relevance Reasoning . . . . . . . . . . . . . . . . . . . . . . . 76

4.2.4 Block diagram and pseudocode . . . . . . . . . . . . . . . . . 78

4.3 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

4.3.1 Restricted English . . . . . . . . . . . . . . . . . . . . . . . . 87

4.3.2 Available Ontologies . . . . . . . . . . . . . . . . . . . . . . . 89

4.3.3 Question categories . . . . . . . . . . . . . . . . . . . . . . . . 89

4.3.4 Question mediator . . . . . . . . . . . . . . . . . . . . . . . . 92

4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

Chapter 5 Evaluation 96

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

5.2 Experiment #1: Establishing ASKME’s performance under ideal con-

ditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

xii

5.2.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

5.3 Experiment #2: Brittleness analysis . . . . . . . . . . . . . . . . . . 109

5.3.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

5.3.2 Failure Categories . . . . . . . . . . . . . . . . . . . . . . . . 110

5.3.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

5.4 Experiment #3: Can questions continue to answer with different

knowledge bases? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

5.4.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

5.5 Experiment #4: Can users pose questions using ASKME to success-

fully query knowledge bases with which they are unfamiliar? . . . . . 118

5.5.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

5.6 Experiment #5: Establishing ASKME’s performance under produc-

tion conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

5.6.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130

5.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134

Chapter 6 Contributions and Future Work 135

6.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

6.1.1 Prior art . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136

6.1.2 A General Framework for querying unfamiliar knowledge bases 136

6.1.3 An Application of ASKME for a system answering AP-like

questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137

6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

6.2.1 Identifying unstated assumptions . . . . . . . . . . . . . . . . 139

6.2.2 Sanity Checking . . . . . . . . . . . . . . . . . . . . . . . . . 140

6.2.3 Debug Tool . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

6.2.4 Episodic Memory and Transfer Learning. . . . . . . . . . . . 141

6.2.5 Machine Reading Application . . . . . . . . . . . . . . . . . . 142

xiii

Bibliography 143

Vita 158

xiv

List of Tables

1.1 Differences between two successful types of knowledge-base-systems:

rule-based expert-systems and Wikipedia . . . . . . . . . . . . . . . 7

3.1 Summary of notable knowledge based systems and the challenges ad-

dressed by them . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

4.1 Some of the known properties of restricted English that are important

to making natural language input easier to process by computers.

(originally produced by Kittredge[61], I reproduce the table here for

convenience) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4.2 Reproduced for convenience. Example CPL sentences taken verbatim

from Clark et. al.[30] . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

4.3 Users follow a set of guidelines while writing CPL sentences. Some

of the guidelines are stylistic recommendations to reduce ambiguity,

while others are firm constraints on vocabulary and grammar. These

guidelines are taken verbatim from Clark et. al.[30]. . . . . . . . . . 54

4.4 For convenience, we reproduce verbatim the question categories, ab-

stract specification, and associated examples from the Graesser-Person

question classification scheme[49] . . . . . . . . . . . . . . . . . . . . 60

xv

4.5 Summary on how each question type can be represented into query

triples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

4.6 Available question categories in QUALM, Graessner-Person, and ASKME.

For convenience, we reproduce verbatim the sample questions that

were previously described in QUALM[68] and the Graessner-Person

question taxonomy[49]. . . . . . . . . . . . . . . . . . . . . . . . . . . 91

5.1 Four knowledge bases were used in the evaluation. The knowledge

bases for each science domain were created by knowledge engineers

working alongside subject matter experts. The significantly larger

multidomain knowledge base was created by concatenating the con-

tents of the individual domain knowledge bases. . . . . . . . . . . . . 100

5.2 Summary of the differences between the regular version of ASKME,

ASKME-W/O-QM, and ASKME-BFS-QM. Relevance reasoning is

not applicable to ASKME-W/O-QM because it does not search the

knowledge base for information to extend a question for problem solving.102

5.3 Correctness scores achieved by knowledge engineers posing questions

using ASKME on the reference knowledge base . . . . . . . . . . . . 103

5.4 Effect of question mediator and xform rules on answering questions

with reference knowledge-base . . . . . . . . . . . . . . . . . . . . . . 103

5.5 The average, median, 75th percentile, 90th percentile, and the maxi-

mum number of states explored by both versions of the system – with

and without relevance reasoning – for both domain knowledge bases

and the larger multi-domain knowledge base. . . . . . . . . . . . . . 106

xvi

5.6 The average, median, 75th percentile, 90th percentile, and the maxi-

mum runtime performance (seconds) on both versions of the system

– with and without relevance reasoning – for both domain knowledge

bases and the larger multi-domain knowledge base. In some cases,

the heuristics used by ASKME contributed to worse runtime perfor-

mance, but they were necessary to maximize correctness scores. . . 107

5.7 The correctness scores for versions of the system with and without rel-

evance reasoning. Both versions achieved similar correctness scores

on the domain knowledge-bases. This indicates that relevance rea-

soning did not sacrifice correctness. The version of the system with-

out relevance reasoning recorded lower correctness scores when used

with the significantly larger multi-domain-kb when answering physics

questions. This is due to the large number of states explored during

blind-search and our evaluation setup aborting an attempt after a

time-bound is reached. This result highlights the need for the system

to select only the most relevant portions of the knowledge base to

reason with. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

5.8 Correctness scores achieved by knowledge engineers on the unseen

question set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

5.9 Failure analysis on why biology questions in the unseen question set

fail to answer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

5.10 Failure analysis on why chemistry questions in the unseen question

set fail to answer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

5.11 Failure analysis on why physics questions in the unseen question set

fail to answer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

xvii

5.12 The knowledge bases used in the study on ASKME’s ability to answer

questions using a variety of knowledge bases that differ in content

and organization. Aside from the reference knowledge bases, which

was created by knowledge users working closely with subject matter

experts, the other knowledge bases were independently authored by

subject matter experts. . . . . . . . . . . . . . . . . . . . . . . . . . 116

5.13 Correctness scores when reference question formulations are attempted

on different KBs independently built by subject matter experts. . . . 117

5.14 Answer credit scores observed in experiment #4’s question-answering

evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

5.15 Aggregated and average correctness scores achieved by novice users

in experiment #4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

5.16 Data on number of formulations and time spent per question by

novice users in experiment #4 . . . . . . . . . . . . . . . . . . . . . . 125

5.17 Answer credit scores achieved by knowledge engineers and novice

users on knowledge bases that are independently authored by sub-

ject matter experts. . . . . . . . . . . . . . . . . . . . . . . . . . . . 130

5.18 Aggregated and average correctness scores achieved by novice users

in experiment #5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

xviii

List of Figures

2.1 Two questions answered using different viewpoints of the Plant con-

cept shown in Figure 2.2. . . . . . . . . . . . . . . . . . . . . . . . . 25

2.2 The plant concept and its two viewpoints . . . . . . . . . . . . . . 26

2.3 Questions having multiple answers using different information con-

texts in the knowledge base. . . . . . . . . . . . . . . . . . . . . . . . 28

2.4 The question representation contains two modeling differences with

the knowledge base. First, the question was described as a single

Move event, while the knowledge base contains a richer representa-

tion, where the Move is caused by an Exert-Force. Second, the

query in the question is on the force slot, while the necessary equa-

tions to answer the question are stored on the net-force slot in the

knowledge base. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

2.5 This question highlights the need to automatically install unstated

assumptions commonly present in questions. The sentences in version

2 that are rendered in italics introduce commonly assumed values for

the initial y speed, initial x position, and final y position. . . . . . . 31

xix

2.6 This example highlights the difficulty in finding relevant information

in the knowledge base to return correct answers. Two answers to

the question are shown. Both answers were returned by the prob-

lem solver using the same knowledge base. One of them is incorrect

because it assumed acceleration to be 9.8 m/s2 instead of -9.8 m/s2. 32

2.7 Using a typical biology knowledge base, the problem solver describes

Lysosomes to play the role of a container. Ideally, the returned an-

swer should describe the specific container role played by Lysosomes

(rendered in italics). . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.1 Different generations of knowledge-based systems and their contribu-

tions to question-answering . . . . . . . . . . . . . . . . . . . . . . . 36

4.1 A graphical table of contents for the detailed example given in Figures

4.2 and 4.3. ASKME answers the user’s question in three steps:

interpreting the question using the CPL processor, selecting domain

knowledge to answer it using the question mediator, and generating

an explanation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

4.2 In panel 1, a physics question is posed to the system in simplified

English. The system interprets the question as shown in Panel 2. The

scenario and query of the question is interpreted as a Move event on

an Object having mass 80 kg. The initial and final velocity of the

Move are 17 m/s and 0 m/s respectively. The distance of the Move

is 10 m. There is also an Exert-Force event whose object is the

same object of the Move event. The Exert-Force event causes the

Move event. The query is on the net-force of the Exert-Force and

is the node with a question-mark. ASKME’s processing continues in

Figure 4.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

xx

4.3 The continuation of the example from Figure 4.2. In panel 3, the

question mediator draws in information from the knowledge base.

The final answer and explanation are shown in Panel 4. . . . . . . . 52

4.4 The user poses questions in restricted English. CPL tries to un-

derstand the question. The CPL interpreter provides reformulation

advice if it detects CPL errors. Otherwise, it presents the system’s

understanding of the question both as an English paraphrase and

graphically for the user to check if CPL understood the question cor-

rectly. This figure is courtesy of Peter Clark at Boeing Phantom

Works. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

4.5 On the left-hand side, the scenario and query of the question is rep-

resented as a Move event on an Object having mass 80 kg. The

initial and final velocities of the Move are 17 m/s and 0 m/s respec-

tively. The distance of the Move is 10 m. There is also an Exert-

Force event whose object is the same as that of the Move event.

The Exert-Force event causes the Move event. The query is on

the net-force of the Exert-Force and is the node with a question-

mark. To answer the question, the question mediator has to draw in

information from the knowledge base. As shown on the right-hand

side, information from the Motion-under-force and Motion-

with-constant acceleration concepts are used to elaborate the

question for the reasoner to compute the answer. . . . . . . . . . . 62

xxi

4.6 Search graph created by the question mediator to answer the question

introduced in Figure 4.5. Each state in our state-space tree contains

a minikb. The initial state in the tree represents the original minikb

for the question. The minikbs in other states are elaborations of the

original minikb. Each operator in the tree describes how a concept in

the knowledge base can be applied to elaborate a minikb to produce

another. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

4.7 The minikbs for states 1, 2, and 5 in the search graph in Figure 4.6.

State 1, the initial state in the search graph, contains the original

minikb for the question. State 2 contains the minikb created by elab-

orating the minikb in State 1 using the Motion-with-constant-

acceleration concept. This elaboration introduced an equation to

calculate the acceleration of the Move event. Using this equation, the

acceleration of the Move was computed to be 14.45 m/s2. Further

elaborating the minikb in State 2 using the Motion-under-force

concept results in the minikb in state 5. This minikb contains the

equations to compute the net-force causing the Move to be -1156

Newtons. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

4.8 The semantic matcher identified the matching features (highlighted

in bold) between the state 1 on the left-hand side and the Motion-

with-constant-acceleration concept on the right-hand side. In

this case, the result of semantic matching becomes Operator A relat-

ing state 1 to state 2 in the search graph shown in Figure 4.6 . . . . 67

xxii

4.9 The elaborated minikb after applying Operator A in the search graph

(Figure 4.6). Panel 1 shows the minikb of state 1 and panel 2 shows

the Motion-with-constant-acceleration concept. The overlap-

ping features found by semantic matching between the minikb in state

1 and Motion-with-constant-acceleration are highlighted in

bold in panels 1 and 2. These overlapping features are joined to form

the minikb in panel 3. The new piece of knowledge introduced by the

concept Motion-with-constant-acceleration is highlighted in

bold in panel 3. This minikb forms State 2 of the search graph

(Figure 4.6). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

4.10 The application of a transformation rule to include an additional

relations between Human-Cell and Chromosome in the richer

representation. This particular rule encodes the transitivity of the

has-part relation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

4.11 The original set of query triples to answer “What is the function of

lysosomes?” is expanded to include related queries on the Agentive

roles. This, in turn, retrieves additional details from the knowledge

base to answer the question. . . . . . . . . . . . . . . . . . . . . . . . 74

4.12 Answer differences when additional queries are used/not used to re-

trieve finer-grained information. . . . . . . . . . . . . . . . . . . . . . 75

4.13 The semantic matcher did not identify any matching features between

State 1 (Figure. 4.6) on the left-hand side and the Circle concept

on the right-hand side. In this case, no operator is created. . . . . . 77

4.14 Information from Fall-from-rest is added to State 1 of Figure 4.6,

resulting in an inconsistency in which the initial velocity of the Move

has multiple values, 17 m/s and 0 m/s. . . . . . . . . . . . . . . . . . 78

xxiii

4.15 Different degrees of match between state 1 in Figure 4.6 and the

Motion-under-force and Two-Dimensional-Move concepts in

the knowledge base. The match with Motion-under-force has a

higher degree of match because a larger portion of Motion-under-

force matches state 1. Thus, Operator B is preferred over Operator

C in Figure 4.6. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

4.16 Block diagram of the question mediator. . . . . . . . . . . . . . . . . 80

4.17 The pseudocode for the question mediator is given in Figures 4.17-

4.21. The high-level structure of the algorithm is best-first search.

This figure gives a standard definition of best-first search by Nilsson[84].

Figures 4.18, 4.19-4.20, and 4.21 show the instantiation of steps 6,7,

and 9 (respectively) to describe the question mediator. . . . . . . . . 82

4.18 Pseudocode for step 6 of Figure 4.17 . . . . . . . . . . . . . . . . . . 83

4.19 Figures 4.19 and 4.20 are the pseudocode for step 7 of Figure 4.17

(Part 1/2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

4.20 Figures 4.19 and 4.20 are the pseudocode for step 7 of Figure 4.17

(Part 2/2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

4.21 Pseudocode for step 9 of Figure 4.17 . . . . . . . . . . . . . . . . . . 86

5.1 Answer credit distributions among novice users and the knowledge

engineer in experiment #4 . . . . . . . . . . . . . . . . . . . . . . . . 120

5.2 Score distribution among novice users and the knowledge engineer in

experiment #5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131

xxiv

Chapter 1

Introduction

Knowledge base systems enable non-experts to perform tasks that have tradition-

ally required experts. However, these systems are brittle and expensive to build.

The domain knowledge has to be formalized into logic to do problem solving. This

knowledge acquisition task is difficult and error-prone, particularly for domain ex-

perts who have little training in knowledge representation. Knowledge engineers

usually work closely with domain experts to create the knowledge base. The result-

ing knowledge base is complex and engineered to perform well on specific tasks, but

is ill suited to more general use.

I would like to reduce the cost of knowledge acquisition and enhance the

utility of the knowledge base. In the ideal system, domain experts create knowledge

bases using familiar notations in their domains. These domain-specific representa-

tions are then converted into computational logic for problem solving. The resulting

knowledge base may then be employed by users with little training to perform a va-

riety of tasks.

Achieving this ideal requires overcoming the brittleness in automated reason-

ing. This brittleness is due to the arms-length separation between the people who

built the knowledge base and the users who will eventually use it. First, the domain

1

experts building the knowledge base cannot anticipate all the tasks it will be used

for. Thus, they cannot engineer their representations to support problem solving for

these unanticipated tasks. Second, the users of the knowledge base are unfamiliar

with the contents of the knowledge base. Therefore, the users face difficulties in

working with the knowledge base for problem solving.

Learning to use an unfamiliar knowledge base is similar to learning to commu-

nicate in a new language. A non-native speaker visiting a foreign-speaking country

has to learn the vocabulary, the grammar, and the pragmatics of the foreign lan-

guage to communicate effectively. Likewise, a knowledge base user has to learn the

terms and relations in the knowledge base, the valid constructions for those terms

and relations, and identify the most appropriate representation in the knowledge

base for problem solving.

A steep learning curve confronts both expert and novice knowledge base

users, and is especially pronounced for users with limited domain expertise. Knowl-

edge base users who lack domain expertise often convey their intent in long descrip-

tions composed of general terms and relations. For example, a patient who lacks

medical expertise typically describes his ailment to the doctor in long descriptions

using general concepts familiar to himself and the doctor. His description may also

contain irrelevant or even incorrect information. To treat the patient, the doctor

applies his medical expertise to recognize how different ailments fit the described

symptoms. The difficulty faced by the patient to access relevant medical expertise

possessed by the doctor is analogous to the difficulties faced by a knowledge base

user with limited domain expertise. In a best effort, the user describes his intent

in long descriptions, using only general terms in the knowledge base. However,

his description is unlikely to identify specific information in the knowledge base for

problem solving.

2

This work describes the design and evaluation of a system, called ASKME1,

that is intended to bridge the gap between the formal representations of knowledge

bases and the automated reasoners attempting to use those representations to an-

swer questions. The described approach helps users with different levels of domain

expertise to use unfamiliar knowledge bases for problem solving. ASKME’s algo-

rithms take advantage of features of the generic knowledge representation language,

but are independent of the specific content and organization in these knowledge

bases. I note that all of the domain knowledge and inference methods are in the

knowledge bases and that ASKME’s role is to interpret the user’s question and

identify the necessary information in the underlying knowledge base for problem

solving.

The first section describes two successful knowledge base systems and their

design tradeoffs. In an ideal system, the knowledge base is built independently

by subject matter experts, and it can be used by a different group of users who

possess limited expertise in the subject matter for problem solving. This creates

a substantial challenge for automated question answering systems: to be able to

answer questions that are formulated without regard for (or even familiarity with)

the knowledge base that is expected to answer it. This ideal knowledge base system

has yet to be built due to the difficulty faced by users with limited domain expertise

to use unfamiliar knowledge bases for problem solving.

The second section describes in detail the goal of this dissertation. This

work aims to help users, who may possess little domain expertise, to use unfamiliar

knowledge bases for problem solving. My thesis is that the difficulty in using unfa-

miliar knowledge bases can be addressed by an approach that consists of a version of

restricted English, a set of mechanisms to answer a handful of well-known question

types, and a software component to identify relevant information in the knowledge1ASKME is the acronym for “Automated Selection of Knowledge in a Mediated Environment”

3

base for problem solving. With my approach, a knowledge base user can use a vari-

ety of unfamiliar knowledge bases by posing their questions with simplified English

to retrieve relevant information in the knowledge base for problem solving.

The third section outlines my approach to studying the thesis. I studied the

thesis for the task of question answering in the context of a system called ASKME,

that is used to answer questions posed by users who are unfamiliar with the knowl-

edge base. I built a prototype version of ASKME to study and evaluate my approach.

The fourth section summarizes the results of my evaluation of ASKME. With

help from an independent evaluation team, I evaluated ASKME on the task of an-

swering AP-like2 exam questions in the domains of college level biology, chemistry,

and physics. I conducted successive experiments to test if ASKME can help novice

users solve problems with unfamiliar knowledge bases. The initial experiment mea-

sures ASKME’s level of performance under ideal conditions, where the knowledge

base is built and used by the same knowledge engineers. Successive experiments

measure ASKME’s level of performance under increasingly realistic conditions. In

the final experiment, I measure ASKME’s level of performance under conditions

where the knowledge base is built independently by subject matter experts and the

users of the knowledge base are novice users who are unfamiliar with its contents.

1.1 Background

Knowledge base users face the challenge of creating inputs for an automated rea-

soner to perform automated reasoning. The inputs have to be described using the

vocabulary and grammar of the knowledge base. They also have to identify rele-

vant information in the knowledge base for problem solving. To successfully create

the inputs requires the user to have domain expertise and a deep understanding of2“AP (Advanced Placement) exams” are nationally administered college entry-level tests in the

USA

4

the knowledge base. This familiarity is difficult to acquire from the builders of the

knowledge base and also difficult to transfer to its intended users.

To overcome the difficulty facing users of unfamiliar knowledge bases, the

designers of knowledge base systems typically make a tradeoff between engineering

a small knowledge base for a narrowly defined, deep reasoning task, or to widen

the knowledge acquisition bottleneck to quickly create a large knowledge base for

shallow reasoning. Rule-based[17] and Wikipedia[117] systems are two successful

knowledge base systems that approach this tradeoff differently.

Rule-based systems are designed to automate tasks that have traditionally

required experts. However, these systems are expensive to build and they require

domain experts to work alongside knowledge engineers to create the knowledge base.

Acquiring the knowledge from experts to build the knowledge base is a difficult and

error prone process, that often results in knowledge bases that are engineered to

perform well only on specific tasks. The limited nature of rule-based systems has

three disadvantages. First, rule-based systems cannot be easily adapted to perform

unanticipated tasks because they lack the knowledge of “first principles” from which

to reason [69, 82]. Second, because the knowledge base is tailored for a particular

task, that knowledge is not easily reused or extended to perform other tasks within

the same domain [22, 42]. Third, the resulting knowledge bases are often usable

only by the developers themselves due to their complexity and brittleness[17].

By contrast, it is easy to build and use a Wikipedia knowledge base. Wikipedia

uses representations familiar to many users, such as text, images, and other unstruc-

tured elements. While Wikipedia is successfully used by a large user community to

search for relevant documents, it lacks the ability to solve problems on higher levels

such as performing diagnosis or prediction tasks. For Wikipedia to have such capa-

bilities that rely on automated reasoning would require separate systems to convert

the unstructured content into computational logic [5, 6, 73, 74, 107]. Automating

5

this task is complex and difficult to do well. The conversion task requires sophis-

ticated natural language and image understanding systems [6, 75, 93, 122]. Once

converted, the generated knowledge bases often contain shallow information[73, 81,

93, 115], errors[74], and are too highly fragmented [10, 59] to be useful for auto-

mated reasoning. Plus, they are large and complex, making it difficult for users to

become proficient in using them to perform automated reasoning.

Table 1.1 lists the differences between rule-based and Wikipedia systems.

Their differences convey the tension between engineering a knowledge base to au-

tomate a narrowly defined task that requires advanced expertise, or to widen the

knowledge acquisition bottleneck to quickly create knowledge bases for shallow rea-

soning.

Ideally, a knowledge base system has the following three properties:

• The knowledge base is built independently by domain experts.

• The system achieves an arms-length separation between the builders and the

users of the knowledge base.

• Users who lack domain expertise or familiarity with the knowledge base can

perform automated reasoning with the resulting knowledge base

This knowledge base system is ideal from several perspectives, e.g., the re-

duced knowledge acquisition costs since domain experts can build the knowledge

base themselves without working alongside knowledge engineers. But more impor-

tantly, for purposes of this dissertation, the ideal system is less brittle and does

not require users to face a steep learning curve. Therefore the ideal knowledge base

system is easier to use and its application is not limited to the developers themselves.

The goal of this dissertation is to enable users who lack domain expertise to

use unfamiliar knowledge bases for problem solving.

6

Rule-basedExpert-Systems Wikipedia

Purpose Automate task that have tradi-tionally required experts

Repository of a wide variety ofinformation

Knowledge-representation

structured, formal-logic unstructured, natural language,diagrams, equations, etc.

Inference automated reasoning information retrieval techniquesKnowledge-acquisiton

difficult easy

Problem-solving

easy to achieve, e.g., deduction,heuristic classification, etc.

difficult to achieve, requires ro-bust natural language and im-age understanding systems

Created by Knowledge-engineers workingclosely with subject-experts

Subject-experts working inde-pendently

Used by small user community, typicallythe developers themselves

large user community with dif-ferent levels of domain expertise

Coverage Narrow. Focused on particulartasks

Broad coverage

Table 1.1: Differences between two successful types of knowledge-base-systems: rule-based expert-systems and Wikipedia

7

1.2 The ASKME Approach

My thesis is that the brittleness problem of knowledge based question answering

can be addressed.

• Questions can be posed with little regard for the ontology of the knowledge

base

• Knowledge base users can use a variety of unfamiliar knowledge bases for

problem solving

• There can be an arms-length separation between the builders and the users of

the knowledge base.

ASKME’s algorithms take advantage of features of the generic knowledge

representation language, but are independent of the specific content and organization

in these knowledge bases. All of the domain knowledge and inference methods

necessary for problem solving are in the knowledge bases. ASKME’s role is to

interpret the user’s question and identify the necessary information in the underlying

knowledge base for problem solving.

I have built a prototype version of ASKME. The major features of ASKME

collectively function as a funnel (metaphorically) to map questions stated in unre-

stricted English into a formal representation that can be answered by a knowledge

base system. I have designed ASKME to be easy to learn and use, while being

robust enough to answer a variety of questions. I believe my engineering choices

should not burden the user with restrictions on how they can pose questions. With

ASKME, users formulate their questions using a restricted English vocabulary. The

questions are then interpreted by ASKME and represented in a logical form us-

ing general terms and relations from the knowledge base. ASKME then identifies

relevant information in the knowledge base to extend the question formulation for

problem solving.

8

The next sections discuss the design of the ASKME components.

Restricted English

Producing formal representations from natural language is very difficult be-

cause natural language permits an enormous amount of expressive variation. Dif-

ferent writers may use a variety of styles and grammatical constructions to express

the same meaning. ASKME uses a controlled language, which avoids difficult prob-

lems such as resolving ambiguity and co-reference in natural language processing.

A controlled language is a form of language with special restrictions on grammar

and style usage that are based on well-established writing principles. Using a con-

trolled language improves the consistency and readability of a text by countering

the tendency of writers to use unusual or overly specialized language constructions.

Controlled language has been successfully deployed in industry and has been shown

to be useful in many technical settings[96, 112, 116]. The commercial success of

controlled language suggests that people can indeed learn to work with restricted

English.

Domain-Neutral Ontology

The English language with its large vocabulary allows users to express things

in subtly different ways. This expressiveness complicates the understanding of text

by computers, which involves mapping the many ways of expressing something in

English, into meaningful representations in the knowledge base. Automating this

task is complicated because what is written in English often does not fit into the

target ontology of the knowledge base. ASKME uses a domain-neutral ontology to

overcome the technical difficulty of choosing appropriate entries in the knowledge

base to create relevant meaning representations. In a variety of applications, domain-

neutral ontologies has been shown to be easily learnt and used by novices to express

a wide variety of knowledge through the specialization and composition of these

9

general terms.

Limited question types

The English language also allows a wide variety of question types. It is a

challenging task to enumerate all questions and implement the required process-

ing to correctly answer each question. ASKME requires the user to formulate the

original question in one of several well-known question types to make it easier for

computer processing. The supported questions are the simple and intermediate

difficulty questions in the Graessner-Person question taxonomy[49]. The difficulty

scale in the Graessner-Person taxonomy is defined by the amount and complexity of

content produced to answer a question. The set of simple and intermediate difficulty

questions is found to be well known and to cover a majority of questions posed by

both students and tutors in a variety of settings.

Question mediator

Given a question stated with general terms and relations that are common

among different knowledge bases, the question mediator’s challenge is to identify

information in a knowledge base to extend the question so as to enable the problem

solver to answer the question. The question mediator synthesizes a mini-knowledge

base (“minikb”) that contains the information needed to infer an answer. Initially,

the minikb contains only the information (triples) in the question. The question

mediator incrementally extends the minikb with frames (domain concepts) drawn

from the knowledge base. The frames include both domain assertions and inference

methods. The question mediator succeeds if it constructs a minikb that is sufficient

to answer the question. My approach to the question mediator consists of searching a

state space. The states are minikbs and operators select content from the knowledge

base being queried and add it to the minikb. I describe the question mediator using

the five components of state-space search: states, goal test, goal state, operators,

10

and the control strategy.

State: Each state in the state-space tree contains a minikb represented in a form

that is similar to conceptual graphs[106]. The initial state in the tree represents the

original minikb for the question. The minikbs in other states are elaborations of

the original minikb that contain additional information from other concepts in the

knowledge base.

Goal test and goal state: The goal test determines if a state’s minikb answers

the question. A state in the search graph containing such a minikb is known as a

goal state. The basic goal test retrieves and tests if the values to the queries in a

state’s minikb is null. Where applicable, the goal test includes additional queries

to improve the answer by retrieving similar kinds of information or finer-grained

details from the state’s minikb.

Operator: Operators elaborate a state to produce another. Operators are created

from the set of concepts in the knowledge base – these concepts are represented as

concept graphs. A semantic matcher takes a state’s minikb and a concept (encoded

in a form similar to conceptual graphs) and uses taxonomic knowledge to find the

largest connected subgraph that is isomorphic between the two representations.

The output of semantic matching forms an operator relating two states. Where

applicable, the question mediator transforms the candidate graphs to improve the

match between a state’s minikb and the concept. This transformation resolves

representational differences that may exist between a state’s minikb and concepts

in the knowledge base. Applying an operator to a state creates a successor state.

The successor state includes new information introduced by a concept from which

the operator was created. The new information is merged with the parent state

by joining [106] the representations on their common features found by semantic

matching.

11

Control strategy: The question mediator expands the search graph in a breadth

first manner to ensure the minikb returned by the question mediator contains only

concepts necessary to answer the question. To achieve good performance and scal-

ability on large knowledge bases, the search controller applies heuristics to reject

operators that are not useful. It also orders operators on each ply based on their

relevance to answering the question.

1.3 Project Halo

The research of this dissertation was performed in the context of Project Halo.

A goal of Project Halo is to develop a knowledge-based question answering

system capable of answering novel (previously unseen) questions posed by untrained

non-experts using knowledge bases built by subject matter experts in a variety of

domains [114]. This goal creates a substantial challenge for automated question

answering systems: to be able to answer questions that are formulated without

regard for (or even familiarity with) the knowledge base that is expected to answer

it. A question answering system has successfully addressed this challenge if it can be

coupled with any of a variety of knowledge bases – each with its own independently

built ontology – and it can answer questions without requiring users to reformulate

the questions for each knowledge base that is used.

The ASKME prototype described in this dissertation is part of a larger

AURA system developed by a team of researchers to achieve the goals of Project

Halo. The AURA system enables subject matter experts to create knowledge bases

using concept maps, equations and tables – all of which are converted to computa-

tional logic [25]. In addition, the AURA system enables a different set of users, who

have limited domain expertise or familiarity with the knowledge base, to pose ques-

tions, like those found in Advanced Placement(AP) exams, and to receive coherent

answers and explanations.

12

The AURA system consists of three main components: (1) the knowledge

capture system; (2) the ASKME system; and (3) inference engines and explanation

generation.

SRI implemented the “Knowledge Capture” portions of the system and in-

tegrated the components into AURA.

ASKME was primarily developed at the University of Texas at Austin with

significant help from Peter Clark (Boeing Phantom Works). The CPL interpreter

used in ASKME was primarily developed by Boeing Phantom Works. The domain-

neutral ontology used in ASKME, the Component Library, has been under develop-

ment by the Knowledge System Group for the past ten years. I have been a major

contributor to the Component Library. The set of supported question types was

identified by myself, and I also designed and implemented the question mediator.

The integration of the various components of ASKME was done by Peter Clark

(Boeing Phantom Works) and myself.

The University of Texas at Austin also contributed the KM inference engine,

an equation solver, a conversion system for units of measurement, and an expla-

nation generator. KM was originally developed by Peter Clark (Boeing Phantom

Works) and Bruce Porter. The equation solver and conversion system for units of

measurement were developed by Gordon Novak. Explanation generation support

was primarily developed by Ken Barker.

Bonnie John of Carnegie Mellon University designed the user interface for

all parts of AURA.

BBN Technologies helped in the evaluation of the AURA system.

Participating in Project Halo provides a platform to study the difficulties

faced by users in employing unfamiliar knowledge bases. First, the AURA system

makes available the components for performing knowledge formulation, reasoning,

and explanation generation. The availability of these components lets me study the

13

thesis without having to build them myself. Second, the prototype’s performance in

achieving the goals of Project Halo is rigorously evaluated by an independent eval-

uation team that did not participate in the design and development of the AURA

system[33, 72]. Participating in this evaluation gives me access to data (e.g., knowl-

edge bases, question formulations, and user logs) I can use to study aspects of

ASKME.

1.4 Summary of Evaluation and Results

I assessed the baseline performance of ASKME on the task of answering questions

like those found in the AP exam. The question-answering task involved users posing

AP-like questions to the system to retrieve solutions from the knowledge base. The

Project Halo team chose the AP test as an evaluation criterion because it is a

widely accepted standard for testing whether a person has understood the content

of a given subject. The team chose the domains of college level biology, chemistry,

and physics because they are fundamental, hard sciences and they stress different

kinds of representations[33].

The evaluation consists of successive experiments to test if ASKME can help

novice users employ unfamiliar knowledge bases for problem solving. The initial

experiment measures ASKME’s level of performance under ideal conditions, where

the knowledge base is built and used by the same knowledge engineers. Succes-

sive experiments measure ASKME’s level of performance under increasingly realis-

tic conditions, where, ultimately, the final experiment measures ASKME’s level of

performance under conditions where the knowledge base is independently built by

subject matter experts and the users of the knowledge base are a different group of

novice users who are unfamiliar with the knowledge base.

The first experiment established the level of performance for ASKME oper-

ating under ideal conditions. These ideal conditions are similar to the way knowl-

14

edge base question-answering systems have traditionally been used - the people who

built the knowledge base were the ones who used it. Additionally, the set of AP-like

questions used in this evaluation are known to the knowledge engineers building the

knowledge base. The set of knowledge bases used in the experiment covered por-

tions of a science textbook in biology, chemistry, and physics. Although the ideal

conditions avoid the problem of novice users employing unfamiliar knowledge bases

for problem solving, the experiment provides a baseline for judging the contribution

of ASKME under less ideal, but typical conditions. Results from the experiment

show ASKME to increase the number of correctly answered questions by identifying

relevant information in the knowledge base for problem solving.

The second experiment presented a brittleness study to provide a fair measure

of ASKME’s ability to answer AP-like questions. The setup for this experiment is

similar to the previous one except that it uses a different set of AP-like questions that

were not made available to the knowledge engineers when they built the knowledge

base. The purpose of the brittleness study is to identify the major failures preventing

ASKME from correctly answering questions. Results from the study indicate that

failures to answer the questions correctly arise from gaps in the knowledge base,

bugs in the implementation, or unsupported question types.

I claim that ASKME’s algorithms take advantage of features of the generic

knowledge representation language, but are independent of the specific content and

organization in the knowledge bases. The third experiment tested this claim by

measuring ASKME’s performance at answering questions using knowledge bases

authored by different users. The experiment tests whether ASKME can continue

to answer questions when the original knowledge base is exchanged for one with

different content and organization. Results from the experiment show that ASKME

continues to answer questions correctly despite the difference in knowledge base.

15

The fourth experiment evaluated my conjecture that ASKME is able to an-

swer questions that are formulated by users who do not know the ontology of the

knowledge base to which they are posed. The experiment was conducted by an inde-

pendent evaluation team[72] that did not participate in the design and development

of ASKME. Nine undergraduates were recruited to participate in this experiment.

These undergraduates have little experience in knowledge representation and have

no familiarity with the knowledge base being queried. They underwent a four hour

training session that taught the basics in using ASKME to pose and receive answers

to AP-like questions. The answers and explanations returned by ASKME were then

graded by independent graders with experience at grading AP-exams. Results from

the experiment show that when novice users can achieve correctness scores that are

comparable to (and in some cases surpass) the knowledge engineers who built the

knowledge base. The experiment shows that using ASKME, deep familiarity with

the knowledge base does not provide a big advantage, and that novice users can

effectively use ASKME to interact with unfamiliar knowledge bases for the task of

answering AP-like exam questions.

The fifth experiment evaluated ASKME’s level of performance when operat-

ing under realistic conditions, where the knowledge bases are independently built by

subject matter experts and the users of the knowledge base are novice users who are

unfamiliar with the knowledge base. The experiment was conducted by the same

independent evaluation team as in the previous experiment. In this experiment, a

group of subject matter experts with little training in knowledge representation and

reasoning are tasked to create a set of knowledge bases for the three science domains:

biology, chemistry, and physics. After the knowledge bases were built, a different

group of novice users were then tasked to query the knowledge bases using ASKME

to answer a set of AP-like questions. As a control, knowledge engineers were also

tasked to use ASKME to attempt question answering with the independently-built

16

knowledge bases. The answers and explanations returned by ASKME were then

graded by independent graders with experience at grading AP exams. Results from

the experiment continue to show that novice users can achieve correctness scores

that are comparable to (and in some cases surpass) the knowledge engineers who

built the knowledge base.

1.5 Organization of Dissertation

The remainder of this dissertation is organized as follows:

1. Chapter 2 discusses the technical challenges facing ASKME.

2. Chapter 3 surveys related work on overcoming the difficulty in using knowledge

base systems.

3. Chapter 4 presents my approach to answer questions formulated by users un-

familiar with the knowledge base.

4. Chapter 5 describes an evaluation of ASKME’s ability to answer questions

posed by users unfamiliar with the knowledge base.

5. Chapter 6 summarizes the dissertation and lists possible areas for future re-

search.

17

Chapter 2

Technical Challenges

2.1 Introduction

ASKME is designed to help users avoid the difficulty of using unfamiliar knowledge

bases. This chapter describes three technical challenges faced by ASKME.

ASKME’s first technical challenge is to parse questions posed in natural lan-

guage and create a correct meaning representation of the text. Producing formal

representations from natural language poses a challenge because natural language

permits an enormous amount of expressive variation and much of what is commu-

nicated in language is not explicitly stated. A sample of the linguistic processing

challenges include handling prepositional phrase attachment, pronoun resolution,

reference resolution, compound noun interpretation, word sense disambiguation, se-

mantic role labeling, proper noun interpretation, units of measure, comparatives,

negation, and ellipsis.

ASKME must also resolve representational differences between questions for-

mulated by the user and representations in the knowledge base. Such representa-

tional differences arise for several reasons. For example, when knowledge bases are

built and used by different people, differences may arise. When the knowledge bases

18

are built using an expressive ontology, the same meaning can be represented differ-

ently. When the knowledge bases are large and complex, representational differences

may occur when trying to cover broad domains.

The third technical challenge of ASKME is to perform relevance reasoning.

First, it is well known that traditional problem solving methods such as deductive

reasoning and heuristic classification do not scale well on large knowledge bases.

Therefore, relevance reasoning is necessary for ASKME to work well with knowledge

bases covering broad and complex domains. Second, a knowledge base may return

different answers to a question depending on how it is used or how a question is

formulated. Thus, it is useful to have heuristics to return a relevant answer.

2.2 Linguistic processing challenges

Linguistic challenges are concerned with parsing the original sentences and creating

a correct meaning representation of the text. All the usual linguistic challenges

fall into this category. These include tasks like prepositional phrase attachment,

pronoun resolution, reference resolution, compound noun interpretation, word sense

disambiguation, semantic role labeling, proper noun interpretation, units of mea-

sure, comparatives, negation, and ellipsis. I briefly discuss four types of linguistic

challenges, attachment preferences, co-reference resolution, ellipsis, and interroga-

tives, using the following question1:

“A banana is thrown into the air with an initial velocity of 20 m/s. Gravity

stops the piece of fruit briefly before it falls back down. Calculate the acceleration

at a height of 2 m during the movement upwards. Is the acceleration greater than 2

m/s2?”1These examples are inspired by an earlier report by Clark et. al.[27]

19

Attachment preferences

One of the first steps in natural language processing(NLP) is to parse the input

sentences to determine the syntactic structure of the text. Resolving attachment

ambiguities poses a pervasive problem in parsing natural language. A syntactic

parser often encounters phrases that can be attached to two or more different nodes

in the parse tree, and must decide which one is correct. Depending on how the

nodes are attached, there can be significant differences in meaning. For example,

a standard NLP system may return the following two syntactic parses for the first

sentence in the question below.

Attachment preference #1

“A banana vp[is thrown pp[into the air] pp[with an initial velocity of

20 m/s]]. Gravity stops the piece of fruit briefly before it falls back down. Calculate

the acceleration at a height of 2 m during the movement upwards. Is the acceleration

greater than 2 m/s2?”

Attachment preference #2

“A banana is thrown into np[the air pp[with an initial velocity of 20

m/s]]. Gravity stops the piece of fruit briefly before it falls back down. Calculate the

acceleration at a height of 2 m during the movement upwards. Is the acceleration


Both parses have different interpretations depending on whether “initial velocity”

is attached to “air”, or to “throw”. Picking the correct parse automatically in a

reliable and accurate manner is challenging and requires pragmatic and contextual

knowledge to infer that “velocity” is more commonly associated with “throw” in the

context of physics problems.

20

Co-reference resolution

An important subtask in natural language processing is to determine if two ex-

pressions in natural language refer to the same entity in the world. This task is

commonly known as co-reference resolution.



at a height of 2 m during the movement upwards. Is the acceleration greater than 2

m/s2?”

An example of co-reference resolution is the task of resolving what “it” refers to in

our example question. Does “it” refer to Gravity or “the piece of fruit”? Performing

this task requires commonsense knowledge that “gravity” does not fall and that a

“piece of fruit” can fall. Similarly, commonsense knowledge is necessary to realize

the intended co-reference between “banana” and “piece of fruit”.



at a height of 2 m during the movement upwards? Is the acceleration greater than

2 m/s2?”

Another example of co-reference resolution is to resolve references for “the acceler-

ation” and “the movement” in the example question. One approach is to simply

associate both references to the Throw event, however this is not ideal as a Throw

consists of two subevents (e.g., Propel and Let-Go-Of) happening sequentially:

first an object is propelled, then the object is released. The ideal resolution is to

associate “the acceleration” and “the movement” with the Propel event that is

a subevent of Throw. To perform this successfully, NLP systems must infer the

21

presence of a Propel event after the Throw. This type of co-reference resolution

is also sometimes referred to as “indirect anaphora resolution”. The challenge here

is not only having commonsense knowledge that a Propel follows a Throw, but

also the mechanisms to apply commonsense to language interpretation to resolve

such references.

Ellipsis

Much of what is communicated in language is not explicitly stated, which makes

extracting content from natural language difficult. Such incomplete utterances range

from sentences that fail to include all requisite semantic information to syntactically

incomplete sentence fragments.

“A banana is thrown into the air [travelling upwards] with an initial

velocity of 20 m/s. Gravity stops the piece of fruit briefly before it falls back down.

Calculate the acceleration at a height of 2 m [above ground, whose default

height is 0 m] during the movement upwards. Is the acceleration greater than 2

m/s2?”

Our question example contains two missing arguments that need to be identified.

The NLP system has to recognize “thrown into the air” to mean “travelling up-

wards” and that “heights” are by default above the ground (which also has a de-

fault height of 0 m). Humans perform well at this task because they can leverage a

vast amount of commonsense knowledge to help resolve ambiguities and fill in the

missing arguments. This is not the case for computers. To automatically identify

these missing arguments will require a mixture of linguistic knowledge and common-

sense. Precisely how this should be done is a difficult problem for natural language

systems.

22

Question types

ASKME’s design goal is to be capable of answering a wide variety of questions. This

is a very challenging goal, considering that the expressiveness of language admits a

wide variety of questions differing in form and function.



at a height of 2 m during the movement upwards. Is the acceleration


Although questions are typically posed using interrogative sentences containing cues

such as question marks or wh-words (e.g., who, what, when, where, why, and how),

it is not sufficient to simply use these criteria to identify and categorize questions.

Some expressions, such as “Would you pass the ball?”, have the grammatical form

of questions but actually function as requests for action, not for answers. Further,

questions may be stated without using wh-words. For example, questions may

be phrased as imperative sentences expressing commands such as “Calculate the

acceleration at a height of 2 m.” as opposed to “What is the acceleration at a

height of 2 m?”, or they may be posed as a comparative such as “Is the acceleration

greater than 2m/s2?”. Therefore, systems that depend heavily on the use of question

marks or wh-words for clues may have difficulty handling a wide variety of question

types.

Without doing a good job of analyzing the type of questions and the answers

expected, it is difficult to build a system to process questions and identify relevant

information in the knowledge base for answering them.

23

2.3 Resolving Representational Differences

Representational differences (e.g., the same meaning represented differently) be-

tween questions and the knowledge base are common in Project Halo because:

1. the knowledge bases are built by subject-matter experts and are intended to

be used by a different group of non-experts.

2. the knowledge bases are built by extending an expressive ontology, which per-

mits the same content to be expressed in multiple ways.

3. the knowledge bases are large and complex, covering broad domains.

Resolving representational differences is important for achieving good ques-

tion answering performance in Project Halo. I describe three types of representa-

tional differences that are commonly encountered in Project Halo:

1. granularity differences

2. viewpoint differences

3. modeling differences

Granularity differences

Consider the biology question “How many chromosomes are in a cell?”. The ques-

tion can be answered by looking up the Human-Cell concept and returning the

number of chromosomes. However, the assertion that the human cell has 46 chro-

mosomes might be encoded in the knowledge base as either:

Simpler Representation:

Human-Cellhas−part−→ (46 Chromosome)

or

Richer Representation:

Human-Cellhas−part−→ Nucleus

has−part−→ DNAhas−part−→ (46 Chromosome)

24

Question #1An entity contains chloroplast and cell walls. What is this entity?

Question #2What is the agent of an event consuming carbon dioxide and producing oxygen?

Figure 2.1: Two questions answered using different viewpoints of the Plant conceptshown in Figure 2.2.

Without resolving granularity differences, the automated reasoner answers

the simpler representation correctly, but fails on the richer representation.

Viewpoint differences

I describe two types of viewpoint differences.

First, a concept can have different viewpoints. For example, the Plant

concept (shown in Figure 2.2(a)) can be viewed as a container for multiple or-

ganelles such as the Chloroplast, Vacuole, and Ribosomes, enclosed inside

a Cell-Wall. This same concept can also be viewed as an organism that con-

verts light and carbon dioxide into chemical energy, carbohydrates, and oxygen.

Although both views describe the same concept, their representations include dif-

ferent features. Recognizing different viewpoints is necessary to identify portions of

a concept relevant to answering questions.

Figure 2.1 lists two questions answered using different views of the Plant

concept. To answer the first question “An entity contains chloroplast and cell walls.

What is this entity?”, ASKME has to identify the container viewpoint of Plant

shown in Figure 2.2(b). To answer the second question “What is the agent of an

event consuming carbon dioxide and producing oxygen?”, ASKME has to identify

the consumer/producer viewpoint of Plant shown in Figure 2.2(c).

25

(a) Representation for the Plant concept

(b) The parts of the graph highlighed in bold show the viewpoint describing Plant as con-taining different organelles

(c) The parts of the graph highlighted in bold show the viewpoint describing Plant as aconsumer-producer. The plant consumes light and CO2 to produce O2, Carbohydrates, andchemical energy

Figure 2.2: The plant concept and its two viewpoints

26

A second type of viewpoint difference arises when information to answer ques-

tions is represented contextually. The frame-based representation used in Project

Halo provides a convenient representation for contextual information within the

<frame slot value> paradigm[2]. This representation minimizes the proliferation

of frames corresponding to concepts that are important in very limited contexts,

without requiring additional rules and reasoning mechanisms. To achieve good

performance using knowledge bases created in Project Halo, it is necessary to use

information represented contextually. Consider the question “What is the state of

water(H2O)?”. There can be multiple answers to the question depending on the dif-

ferent contexts in which H2O is referred to. For example, the state of water can be a

gaseous substance when it is a result of a C2H4 +3O2 −→ 2CO2 +2H2O reaction, or

it can be a liquid substance when it is a result of a CaOH2 +2HI −→ CaI2 +2H2O

reaction. Figure 2.3 lists other examples in physics and biology where the use of

contextual information yields different answers.

Modeling differences

The users in Project Halo make many modeling decisions during knowledge forma-

tion and question formulation. There are many opportunities for modeling differ-

ences because the representations are built using an expressive ontology to cover

broad domains. Resolving modeling differences is difficult, as it requires the users of

the knowledge base to be aware of the modeling decisions made during knowledge

acquisition. I give two examples of modeling differences.

Figure 2.4 shows the first example, a a physics question whose representation

differ with the expected representation in the knowledge base.

The question representation contains two modeling differences with the knowl-

edge base. First, the question was described as a single Move event, while the

knowledge base contains a richer representation, where the Move is caused by an

Exert-Force. Second, the query in the question is on the force slot, while the

27

Biology example

QuestionIs it true that interphase occurs before mitosis?

AnswerNo. But it is true in the context of Mitotic-Cell-Cycle.In a mitotic cell cycle, interphase and mitosis are subevents, and interphase occursbefore mitosis.

Chemistry example

QuestionWhat is the state of water(H2O)?

AnswerTypically H2O does not have a state value.

H2O is gaseous in the context of the reactionC2H4 + 3O2 −→ 2CO2 + 2H2O

H2O is a liquid in the context of the reactionCaOH2 + 2HI −→ CaI2 + 2H2O

Physics example

QuestionWhat is the acceleration of a move?

AnswerAcceleration typically does not have a value.

In the context of a Free-Fall, acceleration is 9.8m/s2

In the context of a Up-Down-Free-Fall, acceleration is −9.8m/s2

In the context of a Motion-with-Force, acceleration = force/mass

Figure 2.3: Questions having multiple answers using different information contextsin the knowledge base.

28

necessary equations to answer the question are stored on the net-force slot in the

knowledge base. The question fails to answer when modeling differences between

the question and the knowledge base remain unresolved.

In a second example of modeling differences, experts creating knowledge

bases often introduce bias and assumptions in their representations. Questions fail

to answer when assumptions expected by the knowledge base are not stated in

the question. In Figure 2.5, the sentences rendered in italics introduce commonly

assumed values for the initial x position, initial y speed, and final y position necessary

to infer an answer on a typical physics knowledge base.

2.4 Relevance Reasoning

I describe two scenarios in which relevance reasoning is useful. The first scenario is

for yielding good performance.

Knowledge base systems typically suffer from poor performance as the size

of the knowledge base grows. It is well known that traditional inference methods

such as deductive reasoning and heuristic classification, do not scale well on large

knowledge bases. During the Halo pilot, the chemistry question-answering system

developed using the CYC knowledge base took approximately 27 hours to answer

the chemistry question set[44].

In Project Halo, the systems are intended for a wide variety of users in

a production environment. Performance is important, as these systems have to

be sufficiently responsive to captivate and engage the user. The AURA system

is designed to return an answer to a majority of questions within 60 seconds[33].

Therefore, relevance reasoning is important to ensure that the knowledge relevant

to answering questions is found quickly and correctly. This is especially important

for large knowledge bases covering broad and complex domains.

29

Figure 2.4: The question representation contains two modeling differences with theknowledge base. First, the question was described as a single Move event, whilethe knowledge base contains a richer representation, where the Move is caused byan Exert-Force. Second, the query in the question is on the force slot, while thenecessary equations to answer the question are stored on the net-force slot in theknowledge base.

30

Question (Version 1)A water balloon is thrown.The vertical distance of the throw is 30 meters.The horizontal distance of the throw is 100 meters.What is the duration of the throw?

Question (Version 2)A water balloon is thrown.The vertical distance of the throw is 30 meters.The horizontal distance of the throw is 100 meters.The initial y speed of the throw is 0 m/s.The initial x position of the throw is 0 meters.The final y position of the throw is 0 meters.What is the duration of the throw?

Figure 2.5: This question highlights the need to automatically install unstated as-sumptions commonly present in questions. The sentences in version 2 that are ren-dered in italics introduce commonly assumed values for the initial y speed, initial xposition, and final y position.

The second scenario in which relevance reasoning is useful is for returning

relevant answers. A knowledge base may return multiple answers to a question

because different subsets of the knowledge base may be used to answer the question.

Some subsets may return incorrect answers, making the task of finding appropriate

subsets to infer the correct answer challenging. Consider the question:

“A parachutist’s friend pushes him down from an airplane with an initial

velocity of 2 m/s. How far did he fall after 10 seconds?”.

In Figure 2.6, different answers to the question were returned using the same

knowledge base. One of them is incorrect because a subset of the knowledge base

assumed acceleration to be 9.8 m/s2 instead of -9.8 m/s2.

Relevance reasoning may also be used to return relevant answers to open-

ended questions. Such questions are common in the biology and chemistry domains.

31

QuestionA parachutist’s friend pushes him down from an airplane with an initial velocity of2 m/s. How far did he fall after 10 seconds?

Derived answer #1 (Incorrect) Derived answer #2 (Correct)

Answer

h = 470 m

Explanation

free-fall : An event in which an object is

in free fall at 9.8 m/s^2 (i.e. the typically

assumed acceleration due to Earths gravity).

Assumptions:

* We assume upward positive.

Given:

* a = 9.8 m/s^2

* theta2 = -90.0 deg

* v2 = -2 m/s

* t = 10 s

* u = -2 m/s

* g = 9.8 m/s^2

* h = u * t + ((1 / 2) * g) * t2

* s = h

Solving for s ...

s = h

h = u * t + ((1 / 2) * g) * t2

.:. h = 470 m

.:. s = 470 m

Therefore, the distance of the fall

(h) = 470 m

Answer

h = -510 m

Explanation

up-down-free-fall : A move of an object in a

gravity field of 9.8 m/s^2. The object is in

free fall may have an initial upward velocity

(by virtue of being thrown, fired, etc),

followed by an apex velocity of 0, followed

by a downward velocity.

Given:

* a = -9.8 m/s^2

* theta2 = -90.0 deg

* v2 = -2 m/s

* t = 10 s

* u = -2 m/s

* g = -9.8 m/s^2

* h = u * t + ((1 / 2) * g) * t2

* s = h

Solving for s ...

s = h

h = u * t + ((1 / 2) * g) * t2

.:. h = -510 m

.:. s = -510 m

Therefore, the distance of the fall

(h) = -510 m

Figure 2.6: This example highlights the difficulty in finding relevant informationin the knowledge base to return correct answers. Two answers to the question areshown. Both answers were returned by the problem solver using the same knowledgebase. One of them is incorrect because it assumed acceleration to be 9.8 m/s2 insteadof -9.8 m/s2.

32

QuestionWhat is the function of lysosomes?

AnswerA lysosome plays the role of a container.

Desired AnswerA lysosomes plays the role of a container, ..., and is an instrument enclosingDigestive-Enzymes.

Figure 2.7: Using a typical biology knowledge base, the problem solver describesLysosomes to play the role of a container. Ideally, the returned answer shoulddescribe the specific container role played by Lysosomes (rendered in italics).

There can be different answers depending on the amount of detail expected by the

user. These expectations are often assumed and seldom captured in the question

formulation. The example listed in Figure 2.7 illustrates the problem for the question

“What is the function of lysosomes?”. A naive answer to the question would be to

simply describe lysosomes to be containers, which is an accurate but incomplete

answer in this context. Ideally, the answer should also describe the container role

played by lysosomes as an instrument enclosing digestive enzymes (rendered in italics

in Figure 2.7).

2.5 Summary

I study ASKME as part of Project Halo where the builders and the users of the

knowledge base are different. Project Halo’s vision is to develop a computer program

capable of answering novel questions in a broad range of scientific disciplines[9, 114].

The type of questions targeted in Project Halo are unlikely to answer using search

or database lookup because the answers to these questions are not stored for simple

retrieval in the knowledge base. Answering these questions requires the user to

interact with the knowledge base for problem solving. To perform this interaction

33

successfully, a user has to learn the terms and relations in the knowledge base,

the valid constructions for those terms and relations, and must identify the most

appropriate representation in the knowledge base for problem solving. Therefore,

users face a steep learning curve to become proficient at using a knowledge base for

problem solving.

Project Halo’s success requires that ASKME help users overcome the diffi-

culty of using unfamiliar knowledge bases for problem solving. ASKME’s role is to

process questions as they are stated in English and identify relevant information in

the knowledge base for problem solving. This requires that ASKME overcome three

technical challenges.

The first technical challenge is to parse the original sentences and create

a correct meaning representation of the text. All the usual linguistic challenges

arise. These include tasks like prepositional phrase attachment, pronoun resolution,

reference resolution, compound noun interpretation, word sense disambiguation,

semantic role labeling, proper noun interpretation, quotations, parentheticals, units

of measure, comparatives, negation, temporal reference, and ellipses.

The second technical challenge is to resolve representation differences when

the same piece of knowledge is represented differently. There are a variety of rea-

sons why such representational differences are common in Project Halo. First, the

knowledge bases are built and used by different people. Second, the knowledge bases

are built by extending an expressive ontology, which permits the same content to

be expressed in multiple ways. Third, the knowledge bases are large and complex,

covering broad domains. Thus, resolving representational differences is necessary to

achieve good question answering performance in Project Halo.

The third technical challenge is to perform relevance reasoning. In Project

Halo, the systems are to be used by a wide variety of users to query large, complex

knowledge bases that cover broad domains. These knowledge bases perform poorly

34

with traditional problem solving methods and may even return different answers to a

question when different subsets of the knowledge base are used for reasoning. Some

of these answers may be incorrect or lack the details desired in a good answer. Thus,

relevance reasoning is necessary to help the problem solver focus on the relevant

portions of the knowledge base so as to yield good runtime performance and answers.

35

Chapter 3

Related Work

Figure 3.1: Different generations of knowledge-based systems and their contributionsto question-answering

An important goal of research on knowledge-based systems is building sys-

tems that allow novice users to pose questions and receive answers, using knowledge

bases encoded by others. Work in knowledge based systems can be categorised into

four generations of systems, as shown in Figure 3.1. The earliest systems responded

36

Knowledge Based Systems Challenges addressedBASEBALL(1961), LUNAR(1972)

Answers questions posed by novice users byextracting information stored in databases

MYCIN(1970s),XCON/R1(1978)

Answer questions by carrying out reasoningwith domain knowledge acquired from subjectmatter experts with the help of knowledge en-gineers

Cyc(1984 - Present),ThoughtTreasure(1998),OpenMind(2002)

Large repositories of commonsense knowledgethat can be used for question answering.

Systems built duringHPKB(1998)

Demonstrated systems that can be quicklybuilt and usefully applied to question-answering tasks by knowledge engineers.

Systems built duringRKF(2002)

Demonstrated systems that SMEs can use toauthor knowledge bases that can be usefullyapplied to question-answering tasks with littlehelp from KEs.

Project Halo (2003 - Present) Aim is to demonstrate systems that SME canuse to author knowledge bases that can an-swer questions posed by a different set of usersunfamiliar with knowledge representation orthe knowledge base being queried.

Table 3.1: Summary of notable knowledge based systems and the challenges ad-dressed by them

37

to questions using answers explicitly stored in databases. The next generation of

systems added inference rules to reason with domain expertise, and later genera-

tions involved efforts in scaling up to larger knowledge bases by adding commonsense

knowledge and facilitating domain experts in authoring the knowledge themselves.

More recently, efforts such as Project Halo have focused on mitigating the brittleness

in knowledge base interaction caused by the arm’s-length separation between the

experts who author the knowledge bases and the novice users posing questions to

them. Table 3.1 gives an overview of the different challenges addressed by notable

knowledge based systems of the past.

Natural language interfaces to databases. Early question-answering systems

focused on helping novice users access information stored in databases. The task of

extracting relevant information from a database could be difficult for novice users,

as it required familiarity with the contents of the database and other programming

tricks necessary for extracting the data. Typical natural language interfaces to

databases work by either pattern matching keywords in a user’s question with terms

in the database or by translating the question into a particular set of commands

to navigate the information in the database. Two effective and notable natural

language frontends to databases were the BASEBALL[51] and LUNAR[118] sys-

tems. These systems demonstrated good performance in handling questions posed

by novice users to access information stored in databases, without requiring users

to learn the structure of the database or a specialised language for querying it. The

LUNAR system answered questions about the geological analysis of rocks returned

by the Apollo moon missions. It demonstrated particularly good performance at a

lunar science convention in 1971, answering questions posed by geologists untrained

on the system. A good summary describing database accessor systems is found in

Androutsopoulos et. al.[4].

38

Reasoning with domain expertise. The utility of early question answering

systems was limited to answering questions whose answers were explicitly stored in

the database. Thus questions that require reasoning with domain expertise cannot

be answered by earlier database accessor systems. The next generation of question

answering systems, called expert systems, answered questions by reasoning with the

domain knowledge and heuristics used by real experts. Expert systems mimic the

problem solving behavior of experts and are typically used to solve problems that

do not have a single correct solution that can be encoded in a conventional algo-

rithm or stored in a database. The knowledge base in an expert system is narrowly

focused and engineered to perform well on a class of questions predetermined at

design time. Acquiring the domain expertise is difficult and domain experts work

alongside knowledge engineers to encode “rules of thumb” on how to evaluate and

solve pre-determined problems. Novice users interact with the system using a user

interface that restricts the questions that can be posed to the system. These sys-

tems predetermine the logical forms for the restricted set of questions at design time.

When users pose questions via the user interface, the required logical forms are then

generated and used to derive the answers to the queries. The MYCIN (for diagnos-

ing blood diseases and recommending antibiotics)[17] and XCON/R1 (for computer

equipment order processing)[71] systems are classic examples of successful expert

systems.

Knowledge Acquisition. Building an expert system is generally expensive and

tedious, as it requires knowledge engineers to work closely with domain experts. The

next generation of systems tried to address the problem by reducing the cost and

complexity in building knowledge bases. For example, the systems built during the

Defense Advanced Research Projects Agency (DARPA)’s High-Performance Knowl-

edge Base (HPKB)[32] project demonstrated tools to quickly build large, broadly

applicable knowledge based systems by reusing content authored by different knowl-

39

edge engineers1. Further reducing the time and effort in building large knowledge

based applications will require subject matter experts (SME) to enter knowledge

directly. This was a goal of DARPA’s Rapid Knowledge Formation (RKF)[109]

project and it demonstrated systems SMEs can use to quickly develop knowledge

bases applicable to question answering tasks. In the HPKB and RKF projects, the

functional performance of the developed systems were evaluated by measuring how

well the knowledge bases answered a set of unseen questions. The questions were

posed by the same users authoring the knowledge base, using a set of pre-determined

domain specific question templates. The completed question templates were then

used to generate detailed logical forms that were in turn used by the reasoner to

answer the question. More details on the HPKB and RKF evaluations are described

in Schrag et. al.[102] and Clark et. al.[28] respectively.

Question-Answering. Arguably, the performance of the systems built during

the HPKB and RKF projects would have suffered if a different set of users, unfa-

miliar with knowledge representation or the knowledge base being queried, posed

the questions. The separation between the builders and the users of the knowledge

bases continues to be a source of brittleness in the interaction between knowledge

bases authored by experts and their novice users. Systems such as ISAAC2[86] and

MECHO[18, 19] have previously demonstrated that systems could correctly answer

certain types of high school level physics questions as originally stated in the exams.

Both systems, however, are likely to perform poorly at answering questions like

those found on an AP exam using knowledge bases authored by other experts. The1In a later effort similar to HPKB, the Project Halo pilot phase also demonstrated systems

that were collaboratively built by different knowledge engineers. The competing systems attemptedpreviously unseen Advanced Placement (AP) exam questions in the chemistry domain and attainedAP scores similar to the mean human score. These systems demonstrate that knowledge basedsystems can answer unseen complex questions using complex knowledge bases that are quicklybuilt in short amounts of time (4 months).

2The ISAAC system was later extended to include diagrams as an additional modality for statingquestions[87].

40

MECHO and ISAAC systems are brittle because they were engineered to work well

for narrow domains and cannot be easily adapted to query other knowledge bases.

Commonsense Knowledge. Newell and Ernst[82] observed that one reason that

knowledge based systems are brittle is their lack of general commonsense knowledge

about the world. That is, as knowledge engineers begin to debug why questions

or problem solving tasks fail, they find the need to add commonsense, rather than

domain-specific, knowledge. For instance, a type of brittleness faced by the ISAAC

system is the importance and difficulty of inferring unstated assumptions from a

question. To answer questions about projectile motion, for example, it is often

necessary to assume that air resistance can be ignored, that the projectile is near

Earth and that the only gravitational force acting on the projectile is due to Earth.

Thus, commonsense knowledge about the world is necessary to achieve good perfor-

mance for a variety of knowledge based applications. Several projects are formalizing

large commonsense knowledge bases to improve the performance of knowledge based

systems. These efforts include CYC[69], ThoughtTreasure[78], and the OpenMind

project[105]. The challenge here is not only building a large commonsense knowl-

edge base, but also to identify the mechanisms necessary for applying commonsense

to various problem solving tasks to improve overall system performance.

ASKME. Question answering tasks have become increasingly challenging in re-

cent years. Annual competitions for information extraction and information retrieval

have “begun to require direct answers and their explanations instead of returning

factoids or extracting relevant passages from a text” [24, 70]. Answering and ex-

plaining the solutions to these questions will require the use of domain knowledge

and deeper reasoning capabilities commonly found in knowledge based applications.

Our work advances the state of the art in knowledge based question answering by

mitigating the brittleness due to the arms-length separation between the builders

and the novice users of the knowledge base.

41

Chapter 4

Approach

ASKME helps users with different levels of domain expertise use unfamiliar knowl-

edge bases for problem solving. ASKME is intended to work well in a variety of

domains and knowledge bases, each having its own ontology. ASKME is designed

to answer questions as they are posed in natural language. To make the task eas-

ier for computers, ASKME requires users to formulate their questions in restricted

English. The four major pieces of ASKME are: a) a version of restricted English

(Computer Processable Language), b) a domain-neutral ontology (Component Li-

brary), c) mechanisms to handle a closed set of well known question types, and (d) a

program, called the question mediator, to identify relevant information in the knowl-

edge base for problem solving. The four major pieces of ASKME which, when taken

together, function (metamorphically) as a funnel that distills the original question

into a simpler form that can be more easily processed by the computer for problem

solving.

4.0.1 Restricted English

Producing formal representations from natural language is very difficult because

natural language permits an enormous amount of expressive variation. This ex-

42

pressiveness allows different writers to use a variety of styles and grammatical con-

structions to express the same meaning. Although it would be an easy task for

human readers to reconcile different representations that have the same meaning,

it is not so easy for computers. This difficulty stems from the fact that much of

what is communicated in language is not explicitly stated and computers perform

poorly at extracting content from the language. Humans perform well at this task

because they can leverage a tremendous amount of commonsense knowledge to per-

form a variety of linguistic processing tasks (e.g., resolve ambiguities, determine

anaphoric reference, fill in ellipses). In contrast, machines continue to lack com-

monsense knowledge as well as the tools to usefully apply commonsense to guide

the language understanding process.

For question answering, the Project Halo team chose to support a simpler

version of English to avoid difficult problems in natural language processing. Con-

sequently, ASKME places special restrictions on grammar, style, and vocabulary

usage that are based on well established writing principles. These restrictions have

the effect of improving the consistency and readability of text by countering the

tendency of writers to use unusual or overly specialized language constructions that

are difficult for the computer to process. For convenience, Table 4.1 lists in verbatim

(originally produced by Kittredge[61]), some of the known properties of restricted

English that are important to making natural language input easier to process by

computers.

4.0.2 Domain-Neutral Ontology

A key step in interacting with knowledge bases is to select the appropriate entries in a

knowledge base for problem solving. A history of economic importance has endowed

English with an enormous vocabulary (e.g., the Oxford English Dictionary contains

over 300000 words). In contrast, knowledge bases typically contain a handful to a

43

Property1 restricted lexicon (and possibly including special words not used elsewhere

in the language)

2 a relatively small number of distinct lexical classes (e.g., nouns or nominalphrases denoting <body part>) which occur frequently in the major sentencepatterns

3 restricted sentence syntax (e.g., some sentence patterns found in literatureseem to be rare in scientific or technical writing

4 deviant sentence syntax that are unusual in the standard language

5 restricted word co-occurrence patterns which reflect domain semantics (e.g.,verbs take limited classes of subjects and objects; nouns have sharp word-class restrictions on their modifiers)

6 restricted text grammar (e.g., stock market summaries typically begin withstatements of the overall market trend, followed by statements about sectorsof the market that support and go against the trend, followed by salientexamples of stocks which support or counter the trend)

7 different frequency of occurrence of words and syntax patterns from the normfor the whole language – each controlled language has its own statisticalprofile, which can be used to help set up preferred interpretations for newtexts.

Table 4.1: Some of the known properties of restricted English that are important tomaking natural language input easier to process by computers. (originally producedby Kittredge[61], I reproduce the table here for convenience)

44

few hundred concepts relevant to a domain.

This selection task involves mapping the many ways of expressing something

in English into meaningful representations in the knowledge base. Automating this

selection task is complicated because what is written in English often does not fit

into the target ontology of the knowledge base. Some of the reasons include (1)

an incomplete mapping from English words to terms and relations in the knowl-

edge base, (2) a literal interpretation that violates constraints in the ontology, or

(3) the knowledge cannot be expressed in the underlying knowledge representation

language.

ASKME uses a domain-neutral ontology to overcome the technical difficulty

of choosing appropriate entries in the knowledge base to create relevant meaning

representations. Its design is influenced by experience in controlled languages and

lexicography in which small vocabularies have been found to be easy to learn and

are sufficient to express a wide variety of knowledge without using uncommon terms

unfamiliar to many users[1, 96, 112, 116].

For example, although the Longman Dictionary of Contemporary English

contains 45,000 entries and 65,000 word senses, all the definitions in the dictionary

are defined by a vocabulary of about 2000 words[95].

ASKME’s domain-neutral ontology must be satisfactory in three areas: cov-

erage, access, and semantics. First, the ontology has to have broad coverage because

ASKME is intended to be domain independent. To accomplish this, we need not

enumerate all the terms in the English vocabulary and install one-to-one mappings

between words and entries in the knowledge base. Indeed, doing so would just make

the knowledge base huge and difficult to use. Instead, we desire a small set of well

known terms and relations that are commonly employed by everyday users. From

this small set of terms and relations, a wide variety of meaning can be represented

via specialization and composition to achieve good coverage.

45

Second, the ontology has to be intuitive and accessible because its intended

users are accustomed to expressing their questions with English. Ideally, the knowl-

edge base should not contain so many entries that a distinct mapping cannot be

identified. The senses for a word should be easily identifiable and highly distin-

guishable from alternative choices, to avoid having multiple concepts that are closely

related but differ in subtlely different ways. Such concepts will greatly burden the

mapping process and yield poor results.

Third, the terms of the ontology should be grounded with deep semantics for

reasoning with an automated reasoner. Each concept should be richly axiomatized

to encode the meaning of the component as well as how it interacts with other

components. These axioms should also be general enough to be reusable for encoding

detailed domain knowledge, and they should be written in such a way that they are

consistent with the axioms of other concepts during composition.

4.0.3 Question Taxonomy

My long-term research goal is to create systems capable of answering a wide vari-

ety of questions. This is a very challenging goal considering the expressiveness of

language that admits a wide variety of questions that differ in form and function.

Although questions are typically posed using interrogative sentences contain-

ing cues such as question marks or wh-words (e.g., who, what, when, where, why,

and how), it is not sufficient to simply use this criteria to identify and categorize

questions. First, some expressions, such as “Would you pass the ball?”, have the

grammatical form of questions but actually function as requests for action, not for

answers. Second, questions may be stated without using wh-words. For example,

questions may be phrased as imperative sentences expressing commands such as

“Calculate the acceleration at a height of 2 m.” as opposed to “What is the accel-

eration at a height of 2 m?”, or they may be posed as a comparative such as “Is

46

the acceleration greater than 2m/s2?”. Therefore, systems that depend heavily on

the use of question marks or wh-words for clues may have difficulty handling a wide

variety of question types.

A good taxonomy of types of questions should meet three criteria. First, it

has to have broad coverage because ASKME is intended for use in a variety of do-

mains. Ideally, ASKME achieves broad coverage using a small set of question types

that are commonly used in everyday communication. From this small set of ques-

tion types, users can effectively state their question intent (via multiple questions

or through composition) to retrieve answers from the knowledge base.

Second, the set of supported questions has to be intuitive and accessible to

everyday users who are accustomed to expressing their questions with English.

Third, the set of supported questions should be grounded with deep seman-

tics for reasoning with an automated reasoner. These semantics should be well

understood and easy to implement. Each question type should be richly axioma-

tized to encode the meaning of the question, as well as how it interacts with the

problem solver and the knowledge base. These axioms should also be general enough

to be reusable for a variety of domains and different knowledge bases.

4.0.4 Question Mediator

The question mediator’s goal is to synthesize a “minikb” containing information rele-

vant to inferring an answer. Given the initial minikb containing only the information

(triples) originally stated in the question, the question mediator incrementally ex-

tends the minikb with frames (domain concepts) drawn from the knowledge base.

The frames include both domain assertions and inference methods. The mediator

succeeds if it constructs a minikb that is sufficient to answer the question.

For ASKME to be useful for novice users interacting with unfamiliar knowl-

edge bases, it has to work well in a variety of domains and different knowledge

47

bases. The design of the question mediator must, therefore, satisfy two conditions.

First, there are many opportunities for the same meaning to be represented differ-

ently when the knowledge bases are built and used by different groups of users. For

ASKME to achieve good performance, the question mediator has to have mecha-

nisms to resolve representational differences that may exist between question formu-

lations and the knowledge base. Second, because different subsets of the knowledge

base may be used to answer a question, the question mediator may have to ex-

plore a large portion of the knowledge base before identifying the relevant piece of

knowledge for problem solving. For ASKME to perform well on knowledge bases

of varying sizes, the question mediator should give priority to information in the

knowledge base that is highly relevant to answering the question.

4.1 The ASKME Prototype

I have built a prototype to study whether the ASKME approach helps users interact

with unfamiliar knowledge bases. The ASKME prototype consists of a variety of

systems that have been deployed in both production and academic settings. The

prototype uses a version of restricted English called Computer Processable Lan-

guage(CPL) and an upper ontology called the Component Library(CLib). The

question types supported by ASKME are from a well studied question taxonomy.

I designed and implemented a program that performs the question mediation task.

The knowledge bases used by the system are conventional, consisting of a set of

frames in an inheritance hierarchy[3, 8, 15, 52, 69]. Each frame represents a domain

concept, such as Eukaryotic-Cell in biology, Metathesis-Reaction in chem-

istry, and Fall-from-Rest in physics. Frames encode declarative assertions with

associated inference methods, such as rules for automatic classification [15] and com-

putation methods such as, in physics, an equation to compute velocity from distance

and time. Domain experts build the knowledge bases by extending the concepts in

48

the Component Library[8].

The intended users of the knowledge base are novices in the subject matter,

who will use the knowledge base to answer questions. ASKME users formulate their

questions using Computer Processable Language (CPL) [30]. These questions are

then interpreted and represented in a logical form - described with general terms and

relations from the knowledge base - for processing by the problem solving system.

ASKME consists of several components including the CPL interpreter from Boeing

Phantom Works[30], additional word sense disambiguation and semantic role label-

ing technologies developed in our research group [38, 39, 40, 121], and a prototype

Scenario Elaborator that uses heuristics to identify the applicable assumptions in

a question [88]. Using the question formulation generated by CPL, the question

mediator identifies relevant information from the knowledge base for problem solv-

ing. The derived answer is then explained with domain specific terminology using

explanation generation support provided by KM[29] and the CLib[8]. Figure 4.1

shows the prototype system’s end to end processing of a sample question.

4.1.1 Computer Processable Language

ASKME creates formal representations from restricted English input using a con-

trolled language interpreter, called Computer Processable Language (CPL), that

was originally developed by Boeing Phantom Works[26, 30, 31]. The CPL inter-

preter consists of a syntactic parser called SAPIR[55], a logical form (LF) genera-

tor, an initial logic generator, and subsequent processing modules. CPL currently

“deals with a subset of linguistic phenomena, including nominalizations, passives,

plurals, prepositional phrases, relative clauses, direct anaphora, and a limited form

of conjunction”[30, 31]. For our purposes, Boeing further “extended CPL to handle

chemical equations, interrogatives, indirect anaphora, comparatives, variables, and

physical quantities”[31]. Clark et. al.[30] described the basic CPL sentence to have

49

!"#

$%&'

()'*#+,+#*-%&

'(!"#

$%&'

(.#/

0-%&

'(1'$2#+(

34,5-'-%

&'(

!"#$%"&'(

)*+,%"-.)

/0'

-12+

)

6-'#

5(7(

8##(9:

8(;0<"+#(=>?(

6-'#

5(@(

8##(A:

8(;0<"+#(=>?(

6-'#

5(?(

8##(9:

8(;0<"+#(=>=(

6-'#

5(=(

8##(A:

8(;0<"+#(=>=(

Fig

ure

4.1:

Agr

aphi

cal

tabl

eof

cont

ents

for

the

deta

iled

exam

ple

give

nin

Fig

ures

4.2

and

4.3.

ASK

ME

answ

ers

the

user

’squ

esti

onin

thre

est

eps:

inte

rpre

ting

the

ques

tion

usin

gth

eC

PL

proc

esso

r,se

lect

ing

dom

ain

know

ledg

eto

answ

erit

usin

gth

equ

esti

onm

edia

tor,

and

gene

rati

ngan

expl

anat

ion.

50

!"#$

%&'&

!"#$

%&(&

)*+$,-&

./0$&

1"22&

3#34"%50$%/,3-6

&

7#"%50$%/,3-6

&

89$:-5;/:,$&

,"<2$2&

/*+$,-&

#$-5=/:,$&

>?&@A&

B&

?&1C2

&

'D&1

C2&

'?&1

&

/*+$,-&

E32-"#,$&

!""

#$%&'

#()(*+

&,-.

)(/"0(11$-.1)

2+(

1,-.

)F&,6,%32-&1

<2-&2-/G&H$

:&*3@$&3#&'?&1I&JH$

&32&

-:"0$%3#A&"-&"&0$%/,3-6

&/=&'

D&1C2I&KH$

&,/1*3#$

E&1"22&/=&-H

$&,6,%32-&"

#E&*3,6,%$&32&>?&

@AI&L

H"-&32&-H$&=/:,$&:$M<

3:$E&-/&2-/G&-H$&

*3@$&3#&-H

32&E32-"#,$B&&

3$4"#$5(6

)7.8#$19)

F#&*3,6,%$&1/0$2I&

KH$&1"22&/=&-H

$&*3,6,%$&32&>?&@AI&

KH$&3#34"%&0$%/,3-6

&/=&-H$

&*3,6,%$&32&'D&1C2I&

KH$&7#

"%&0$%/,3-6

&/=&-H$

&*3,6,%$&32&?&1

C2I&

KH$&E32-"#,$&/=&-H$

&1/0$&32&'?&1I&

LH"-&32&-H$&=/:,$&/#

&-H$&*3,6,%$B&

Fig

ure

4.2:

Inpa

nel

1,a

phys

ics

ques

tion

ispo

sed

toth

esy

stem

insi

mpl

ified

Eng

lish.

The

syst

emin

terp

rets

the

ques

tion

assh

own

inP

anel

2.T

hesc

enar

ioan

dqu

ery

ofth

equ

esti

onis

inte

rpre

ted

asa

Move

even

ton

anO

bje

ct

havi

ngm

ass

80kg

.T

hein

itia

lan

dfin

alve

loci

tyof

the

Move

are

17m

/san

d0

m/s

resp

ecti

vely

.T

hedi

stan

ceof

the

Move

is10

m.

The

reis

also

anExert-F

orce

even

tw

hose

obje

ctis

the

sam

eob

ject

ofth

eM

ove

even

t.T

heExert-F

orce

even

tca

uses

the

Move

even

t.T

hequ

ery

ison

the

net-

forc

eof

the

Exert-F

orce

and

isth

eno

dew

ith

aqu

esti

on-m

ark.

ASK

ME

’spr

oces

sing

cont

inue

sin

Fig

ure

4.3

51

!"#$%&'

()*

)+,-

.&/,%)+

0&1+&,1%%$2$31*)

+'()*

)+,4+5

$3,6)3%$'

7100'

.+.*12,8$2)%.&9

' :+12,8$2)%.&9

'

;<$3&,=)3%$'

%140$0'

)"#$%&'

1%%$2$31*)

+'+$

&,6)3%$'

>?'@A'

,BBC

DE'

BFGFC'7H0

I'

?'7H0

'

BJ'7

H0'

B?'7

'

)"#$%&'

5.0&1+%$'

!""

#$%&'

#()(*+

&,-.

)(/"0(11$-.1)

$.20-3

+%(3

)'4)5-,

-.6+.3

(067-0%()

K '+$

&,6)3%$'L'7100'M'1%%$2$31*)

+'$.20-3

+%(3

)'4)5-,

-.68

$296%-.12&.26&%%(#(0&,

-.'

K ':+

12,8$2)%.&9

I 'L'.+.*12,8$2)%.&9

I 'N'OI

'M'1%%$2$31*

)+'M'5.0&1+%$P'

Q1+$

2'R'

Q1+$

2'F'

!.18(0)

+$&,6)3%$'L',BBCD'+$

-&)+0'

:/"#&.

&,-.

)7100'L'>?'@A'

.+.*12,8$2)%.&9

'L'BJ'7H0'

:+12,8$2)%.&9

'L'?'7

H0'

5.0&1+%$'L'B?'7'

1%%$2$31*)

+'L'O.+

.*12,8$2)%.&9

I 'S':+12,8$2)%.&9

I P'H'

OI'M'5.0&1+%$P'

1%%$2$31*)

+'L'OBJ'7H0

I 'S'?'7

H0I P'H'OB

'M'B?'7P'

1%%$2$31*)

+'L',BFGFC'7

H0I''

+$&,6)3%$'L'7100'M'1%%$2$31*)

+'+$

&,6)3%$'L'>?@A'M',B

FGFC'7

H0I'

+$&,6)3%$'L',BBCD'+$

-&)+0'

Fig

ure

4.3:

The

cont

inua

tion

ofth

eex

ampl

efr

omF

igur

e4.

2.In

pane

l3,t

hequ

esti

onm

edia

tor

draw

sin

info

rmat

ion

from

the

know

ledg

eba

se.

The

final

answ

eran

dex

plan

atio

nar

esh

own

inP

anel

4.

52

Example CPL SentencesA man picks up a large box from a table.The man carries the box across the room.The man is sweeping the powder with a broom.Two vehicles drive past the factorys hangar doors.The narrator is walking past racks of equipment.The narrator is standing beside a railing beside a stormwater outfall.

Table 4.2: Reproduced for convenience. Example CPL sentences taken verbatimfrom Clark et. al.[30]

the form:

subject + verb + complements + adjuncts

“where complements are obligatory elements required to complete the sentence, and

adjuncts are optional modifiers”[30]. The CPL grammar does not allow pronouns

and requires users to use definite references[30]. Figure 4.2 lists some examples of

CPL sentences.

Users follow a set of guidelines while writing CPL sentences. Some of the

guidelines are stylistic recommendations to reduce ambiguity, while others are firm

constraints on vocabulary and grammar. Table 4.3 lists some example guidelines

taken verbatim from Clark et. al.[30]. The full list of guidelines along with examples

is given in the CPL user’s guide[31, 110].

CPL sentences are converted into logic in three main steps, namely pars-

ing, generation of an intermediate logical form (LF), and conversion of the LF to

statements in the KM knowledge representation language[30].

CPL performs parsing using SAPIR[55], a mature, bottom-up, broad cover-

age chart parser[30, 31]. SAPIR generates an intermediate logical form (LF) that

is a simplified and normalized tree structure with logic-type elements[30]. The LF

53

Guideline1 Keep sentences short and simple as possible

2 Use just one clause per sentence

3 Assume the computer has no common sense. State the obvious inthe question

4 Identify and describe the objects, events, and their properties in-volved in the question.

5 Use “a” to introduce an item, and “the” to refer back to it.

6 Use “first” and “second” to distinguish two of the same kind of items

7 Use “there is a ...” if needed to introduce an object.

8 Do not mix groups and group members in a scenario

9 Avoid using pronouns, instead refer using a name (“the block”, “thetable”)

10 Begin a question with the words “what is”, “what are”, “how many”,“how much”, or “is it true”.

11 Restate a multiple choice question as a set of simple questions

12 Ask for just one value in a single question

13 Set up a question by talking about one specific object

14 Use a question or statement in place of a command

15 Always include a unit of measure after a numerical value

Table 4.3: Users follow a set of guidelines while writing CPL sentences. Some of theguidelines are stylistic recommendations to reduce ambiguity, while others are firmconstraints on vocabulary and grammar. These guidelines are taken verbatim fromClark et. al.[30].

54

is generated by rules parallel to the grammar rules, and contains variables for noun

phrases and additional expressions for other sentence constituents[30]. The resulting

LF is used to generate ground KM assertions.

Because we are working in restricted domains, word sense disambiguation

and semantic role labelling is considerably easier than with a broad coverage ap-

plication, as the choice of senses is constrained to the concepts in the knowledge

base[31]. The logic generator first performs a straightforward transformation of the

LF to first-order logic syntax. Subsequent processing modules then perform word

sense disambiguation, semantic role labeling, co-reference resolution, and some lim-

ited metonymic and other transformations[26, 30, 31]. Wordnet is used to help

expand the vocabulary of words recognized by CPL. To disambiguate a word, first

its Wordnet senses are found, and then CPL looks to see if any are mapped directly

to a concept in the knowledge base (each concept in the knowledge base has a list

of associated Wordnet synsets). If one or more are found, the most likely sense is

selected based on corpus frequency statistics. If not, CPL searches up WordNet’s

hypernym tree until a synset that is mapped to a concept in the knowledge base is

found. In this way, specific words like “bicycle” and “cliff” can be used by the user,

although they are not explicitly stated in the knowledge base, as they are mapped

through this process to more general concepts in the knowledge base.

While having a controlled language makes interpretation more reliable, it also

introduces the challenge of having the users learn to use it. To facilitate this, Boeing

integrated two additional components into CPL, namely, an advice system that

detects CPL errors and provides reformulation advice, and an interpretation display

system that presents the system’s understanding of the question (both as English

paraphrases and graphically) so the user can check that the system understood

correctly[31].

55

Original question

A ball is thrown from a cliff. The horizontal velocity of the object is 20 m/s. The height of the cliff is 125 m.

CPL (Controlled english)

Logic

Question- Answering

Paraphrase of system’s understanding

A ball is the object of a throwing. A cliff is the origin of the throwing. The velocity is equal to 20 meter-per-second(s). The velocity is horizontal. The cliff has a height of 125 m.

Rewriting advice

Figure 4.4: The user poses questions in restricted English. CPL tries to understandthe question. The CPL interpreter provides reformulation advice if it detects CPLerrors. Otherwise, it presents the system’s understanding of the question both asan English paraphrase and graphically for the user to check if CPL understood thequestion correctly. This figure is courtesy of Peter Clark at Boeing Phantom Works.

CPL detects grammatical violations when grammar rules outside the scope

of CPL, but within the scope of full English, are used in a formulation. If there

are CPL errors, the system highlights them and offers rewriting advice1 to help the

user fix the mistakes. If the user’s reformulation is valid CPL, the system creates

a logical interpretation, and shows its understanding back to the user in two ways.

One, as a set of English paraphrases of the logic, i.e., the question is “round-tripped”

from the user’s CPL, to logic, to a machine generated English paraphrase. Two,

as a graph, whose nodes and edges represent terms and relations in the knowledge

base. The user can then inspect the logical form and edit the graph or reformulate

the original sentences to repair the CPL interpretation. The CPL interpretation

display system allows the user to validate if the system has understood the question1The advice messages presented in CPL are canned text, not an automatic rewrite of the user’s

actual input.

56

correctly. Both the graph and paraphrases are intuitive visualizations allowing users

to identify interpretation errors without becoming familiar with the content of the

knowledge base being queried.

A more complete description of the CPL interpreter can be found in [26, 30,

31].

4.1.2 The Component Library

The Component Library (CLib)[8] is used as the upper ontology in ASKME. It has

been under development for about ten years by the Knowledge System Group at

the University of Texas at Austin. I am a member of the group and one of the main

contributors to the Component Library.

The Component Library consists of a set of generic Event, Entity,

and Role concepts (so called “components”) and a language for combining them.

The design of the component library emphasizes coverage (allowing users to encode

knowledge in a variety of domains), access (components satisfying user expectations

can be found easily), and semantics (components are general enough to be used in

a variety of contexts, but specific enough to express non-trivial knowledge)[8].

Coverage: The components in the CLib are diverse enough that they can be used

to compose a wide variety of domain knowledge. The CLib achieves coverage with a

small set of concepts (a few hundred) and a closed set of relations (about 80). The

concepts are inspired by English lexical resources (such as dictionaries, thesauri,

and English word lists) and is organized into an inheritance hierarchy in which the

top level concepts are: Entities, Events, and Roles. Unlike traditional approaches

that achieve coverage through the enumeration of a large number of concepts (e.g.,

Cyc[69]), coverage in the CLib is primarily achieved through composition. Compo-

sition consists of specifying relationships between instantiated components so that

additional implications can be computed.

57

Access: To make the library intuitive and accessible to users accustomed to ex-

pressing knowledge with natural language, the set of general terms and relations in

the CLib is heavily influenced by English language usage. The set of concepts is in-

spired by English lexical resources (such as dictionaries, thesauri, and English word

lists) and are carefully selected to be highly distinct so as to minimize confusion due

to subtle differences.

Semantics: Each concept in the CLib is also richly axiomatized to encode the

meaning of the component as well as how it interacts with other components. These

axioms are general enough to be reusable for encoding detailed domain knowledge,

and are written in such a way that they are consistent with the axioms of other

components during composition.

4.1.3 Supported questions

ASKME supports a total of eight question types. To determine the question types

to support, I worked with subject matter experts to identify the most common

question types and their required processing for a corpus of AP-like exams. The

question types supported by ASKME are also heavily influenced by the Graesser-

Person question taxonomy[49].

The Graesser-Person taxonomy classifies questions by the complexity in-

volved in producing a good answer to a question. The taxonomy was developed

by studying questions posed by individuals in a variety of settings[20]. There are

16 categories in the taxonomy (see Table 4.4) and they are scaled by depth, which

is defined as the amount of information and complexity involved in producing a

good answer to the question. In their analysis, Graesser and Person have differen-

tiated simple shallow questions (categories 1-4), intermediate questions (categories

5-8), versus complex deep questions (categories 9-16). The Graesser-Person scale of

depth has been validated to correlate significantly with both Mosenthal’s[77] scale

58

of question depth and the original Bloom’s[14] taxonomy of cognitive difficulty[50].

Ideally, ASKME would support all 16 question categories in the Graesser-

Person taxonomy. I did not do so for two reasons. First, in my analysis, as well as

that discussed in the literature, deep questions are not common. In a study con-

ducted on “a corpus of multiple choice questions on psychology in college textbooks

and a Graduate Record Examination practice book, it was found that only 23%

of the questions were classified as deep questions according to the Graesser-Person

taxonomy, and 21% were classified as deep questions according to the Mosenthal

scale”[50]. Second, although it is relatively easy to implement the required mech-

anisms to generate good answers for simple and intermediate questions, this is not

the case for deep questions. Answering deep questions requires sophisticated prob-

lem solvers which entail serious engineering work. Therefore, for practical reasons,

I chose to only implement the set of simple and intermediate question types in the

Graessner-Person taxonomy.

4.2 The Design of the Question Mediator

I describe the design of the question mediator in three parts. First, I describe a

basic search controller to systematically search the knowledge base for information

to extend the question formulation for automated reasoning. This search controller

is described using the five components of state-space search: states, goal test, goal

state, operators, and the control strategy. I then describe extensions to the search

controller to resolve representational differences that may exist between question

formulations and the knowledge base. Finally, I describe heuristics used to perform

relevance reasoning.

59

Dep

thQ

ues

tion

cate

gory

Ab

stra

ctsp

ecifi

cati

onE

xam

ple

1V

erifi

cati

onIs

afa

cttr

ue?

Did

anev

ent

occu

r?Is

the

answ

er5?

2D

isju

ncti

veIs

Xor

Yth

eca

se?

IsX

,Y

,or

Zth

eca

se?

Isge

nder

orfe

mal

eth

eva

riab

le?

3C

once

ptco

mpl

etio

nW

ho?

Wha

t?W

hat

isth

ere

fere

ntof

ano

unar

gum

ent

slot

?W

hora

nth

isex

peri

men

t?

4Fe

atur

esp

ecifi

cati

onW

hat

qual

itat

ive

attr

ibut

esdo

esen

-ti

tyX

have

?W

hat

are

the

prop

erti

esof

aba

rgr

aph?

5Q

uant

ifica

tion

Wha

tis

the

valu

eof

aqu

anti

tati

veva

riab

le?

How

man

y?H

owm

any

degr

ees

offr

eedo

mar

eon

this

vari

able

?6

Defi

niti

onW

hat

does

Xm

ean?

Wha

tis

at

test

?7

Exa

mpl

eW

hat

isan

exam

ple

labe

lor

inst

ance

ofth

eca

tego

ry?

Wha

tis

anex

ampl

eof

afa

ctor

ial

de-

sign

?8

Com

pari

son

How

isX

sim

ilar

toY

?H

owis

Xdi

f-fe

rent

from

F?

Wha

tis

the

diffe

renc

ebe

twee

na

tte

stan

dan

Fte

st?

9In

terp

reta

tion

Wha

tco

ncep

tor

clai

mca

nbe

infe

rred

from

ast

atic

orac

tive

patt

ern

ofda

ta?

Wha

tis

happ

enin

gin

this

grap

h?

10C

ausa

lan

tece

dent

Wha

tst

ate

orev

ent

casu

ally

led

toan

even

tor

stat

e?H

owdi

dth

isex

peri

men

tfa

il?

11C

ausa

lco

nseq

uenc

eW

hat

are

the

cons

eque

nces

ofan

even

tor

stat

e?W

hat

happ

ens

whe

nth

isle

vel

de-

crea

ses?

12G

oal

orie

ntat

ion

Wha

tar

eth

em

otiv

esor

goal

sbe

hind

anag

ent’

sac

tion

?W

hydi

dyo

upu

tde

cisi

onla

tenc

yon

the

y-ax

is?

13In

stru

men

tal/

proc

edur

alW

hat

inst

rum

ent

orpl

anal

low

san

agen

tto

acco

mpl

ish

ago

al?

How

doyo

upr

esen

tth

est

imul

uson

each

tria

l?14

Ena

blem

ent

Wha

tob

ject

orre

sour

ceal

low

san

agen

tto

perf

orm

anac

tion

?W

hat

devi

ceal

low

syo

uto

mea

sure

stre

ss?

15E

xpec

tati

onal

Why

did

som

eex

pect

edev

ent

not

oc-

cur?

Why

isn’

tth

ere

anin

tera

ctio

n?

16Ju

dgm

enta

lW

hat

valu

edo

esth

ean

swer

erpl

ace

onan

idea

orad

vice

?W

hat

doyo

uth

ink

ofth

isop

erat

iona

lde

finit

ion?

Tab

le4.

4:Fo

rco

nven

ienc

e,w

ere

prod

uce

verb

atim

the

ques

tion

cate

gori

es,

abst

ract

spec

ifica

tion

,an

das

soci

ated

exam

ples

from

the

Gra

esse

r-P

erso

nqu

esti

oncl

assi

ficat

ion

sche

me[

49]

60

4.2.1 A search controller to find information to answer questions

To ground the description of the question mediator’s search controller, consider the

following physics question:

“A cyclist must stop her bike in 10 m. She is traveling at a velocity of 17

m/s. The combined mass of the cyclist and the bicycle is 80kg. What is the force

required to stop the bike in this distance?”

For this sample question, relevant information must be combined from the

Motion-under-force and Motion-with-constant acceleration concepts

for the reasoner to infer an answer. Figure 4.5 shows the question and the knowledge

necessary to answer it. The search graph explored by the question mediator to

answer the question is shown in Figure 4.6.

States

Each state in the state-space contains a representation of knowledge drawn from

the user’s question formulation and from the knowledge base that is intended to

answer it. I call each state a “minikb”. The initial state contains only the question

formulation. The other states are elaborations of the initial state, constructed by

operators in the state space that elaborate the minikbs.

Figure 4.7 shows the minikbs for states 1, 2, and 5 in the search graph of

Figure 4.6. State 1 is the initial state in the search graph and it contains the

original minikb for a user’s question. State 2 results from elaborating State 1 using

the Motion-with-constant-acceleration concept in the knowledge base. This

elaboration introduced an equation to calculate the acceleration of the Move event.

Using this equation, the reasoner calculated the acceleration of the Move event

to be 14.45 m/s2. Further elaborating state 2 using the Motion-under-force

concept results in state 5, which introduces other equations. In this case, state 5

61

!"#$%&'

()*$'

+,--'

./.0,12*$1)%.&3'

4/,12*$1)%.&3'

56$7&28)7%$' %,9-$-'

)"#$%&'

/$&2:)7%$'

;<'=>'

?' <'+@-'

AB'+@-'

A<'+'

)"#$%&'

C.-&,/%$'

!"#$%&'

()0)/2D.&E2%)/-&,/&2,%%$1$7,0)/'()0)/29/C$72:)7%$'

+,--'

./.0,12*$1)%.&3'

4/,12*$1)%.&3'

56$7&28)7%$' %,9-$-'

)"#$%&'

,%%$1$7,0)/'/$&2:)7%$'

;<'=>'

2AAFGH' AIJIF'+@-K' <'+@-'

AB'+@-'

A<'+'

)"#$%&'

C.-&,/%$'

!"#$%&'(!L '4/,12*$1)%.&3K'M'./.0,12*$1)%.&3K'N'OK'P',%%$1$7,0)/'P'C.-&,/%$Q'L '/$&2:)7%$'M'+,--'P',%%$1$7,0)/'

)#*(%&'+ ,-'-./+

)#*(%&'+

R'%3%1.-&'+9-&'-&)S'E$7'".=$'./'A<'+J''

TE$'.-'&7,*$1./>',&','*$1)%.&3'):'AB'+@-J''UE$'%)+"./$C'+,--'):'&E$'%3%1.-&',/C'".%3%1$'.-';<'=>J''

VE,&'.-'&E$':)7%$'7$W9.7$C'&)'-&)S'&E$'".=$'./''&E.-'C.-&,/%$?'

!"#$%&'(!

Figure 4.5: On the left-hand side, the scenario and query of the question is repre-sented as a Move event on an Object having mass 80 kg. The initial and finalvelocities of the Move are 17 m/s and 0 m/s respectively. The distance of theMove is 10 m. There is also an Exert-Force event whose object is the same asthat of the Move event. The Exert-Force event causes the Move event. Thequery is on the net-force of the Exert-Force and is the node with a question-mark.To answer the question, the question mediator has to draw in information from theknowledge base. As shown on the right-hand side, information from the Motion-under-force and Motion-with-constant acceleration concepts are used toelaborate the question for the reasoner to compute the answer.

62

Figure 4.6: Search graph created by the question mediator to answer the questionintroduced in Figure 4.5. Each state in our state-space tree contains a minikb. Theinitial state in the tree represents the original minikb for the question. The minikbsin other states are elaborations of the original minikb. Each operator in the treedescribes how a concept in the knowledge base can be applied to elaborate a minikbto produce another.

contains the necessary equations to derive the net-force causing the Move event

to be -1156 Newtons.

Goal Test and the Goal State

The eight question types supported by ASKME are represented as query triples to

the minikb. Complex questions can be expressed by chaining multiple query triples.

Table 4.5 summarizes how each question type can be represented as query triples.

The goal test determines if a state answers the question. The goal test is

implemented as a projection operator, which is a common solution for matching

queries with documents in semantic information retrieval systems[46, 53, 57, 106].

The projection operator maps a query triple to a state’s minikb if it is isomorphic to

a subgraph in the minikb, and when every concept and relation in the query triple

subsumes its corresponding concept and relation in the minikb.

63

!"#$%&'

()*$'

+,--'

./.0,12*$1)%.&3'

4/,12*$1)%.&3'

56$7&28)7%$'

%,9-$-'

)"#$%&'

/$&2:)7%$'

;<'=>'

?'

<'+

@-'

AB'+

@-'

A<'+

'

)"#$%&'

C.-&,/%$'

!"#$%&'

()0)/2D

.&E2%)/-&,/&2,%%$1$7,0)/'

+,--'

./.0,12*$1)%.&3' 4/,12*$1)%.&3'

56$7&28)7%$'

%,9-$-'

)"#$%&'

,%%$1$7,0)/'

/$&2:)7%$'

;<'=>'

?'

2AFGFH'+

@-I'

<'+

@-'

AB'+

@-'

A<'+

'

)"#$%&'

C.-&,/%$'

!"#$%&'

()0)/2D

.&E2%)/-&,/&2,%%$1$7,0)/'

()0)/29/C$72:)7%$'

+,--'

./.0,12*$1)%.&3' 4/,12*$1)%.&3'

56$7&28)7%$'

%,9-$-'

)"#$%&'

,%%$1$7,0)/'

/$&2:)7%$'

;<'=>'

2AAHJK'

2AFGFH'+

@-I'

<'+

@-'

AB'+

@-'

A<'+

'

)"#$%&'

C.-&,/%$'

A'

I'

H'

!""#$%&'#()(*+&,-.)(/"0(11$-.1)

$.20-3+%(3)'4)5

-,-.6+.3(067-0%()

L!'/$&2:)7%$'M'+

,--'N',%%$1$7,0)/'

$.20-3+%(3)'4)5

-,-.68

$296%-.12&.26&%%(#(0&,-.'

L!'4/,12*$1)%.&3

I'M'./.0,12*$1)%.&3

I'O'PI'N',%%$1$7,0)/'N'C.-&,/%$Q'

!""#$%&'#()(*+&,-.)(/"0(11$-.1)

!""#$%&'#()(*+&,-.)(/"0(11$-.1)

$.20-3+%(3)'4)5

-,-.68

$296%-.12&.26&%%(#(0&,-.'

L!'4/,12*$1)%.&3

I'M'./.0,12*$1)%.&3

I'O'PI'N',%%$1$7,0)/'N'C.-&,/%$Q'

Fig

ure

4.7:

The

min

ikbs

for

stat

es1,

2,an

d5

inth

ese

arch

grap

hin

Fig

ure

4.6.

Stat

e1,

the

init

ials

tate

inth

ese

arch

grap

h,co

ntai

nsth

eor

igin

alm

inik

bfo

rth

equ

esti

on.

Stat

e2

cont

ains

the

min

ikb

crea

ted

byel

abor

atin

gth

em

inik

bin

Stat

e1

usin

gth

eM

otio

n-w

ith-c

onst

ant-a

cceleratio

nco

ncep

t.T

his

elab

orat

ion

intr

oduc

edan

equa

tion

toca

lcul

ate

the

acce

lera

tion

ofth

eM

ove

even

t.U

sing

this

equa

tion

,the

acce

lera

tion

ofth

eM

ove

was

com

pute

dto

be14

.45

m/s

2.

Furt

her

elab

orat

ing

the

min

ikb

inSt

ate

2us

ing

theM

otio

n-u

nder-f

orce

conc

ept

resu

lts

inth

em

inik

bin

stat

e5.

Thi

sm

inik

bco

ntai

nsth

eeq

uati

ons

toco

mpu

teth

ene

t-fo

rce

caus

ing

the

Move

tobe

-115

6N

ewto

ns.

64

Question type Query tripleWhat is the r of X? (X r ?)Is it true that the r of X is Y? (X r Y)Is it true that X is greater than Y? (X <operator> Y)What is the relationship between X and Y? (X ? Y)How many r of X? (X rhowmany ?)How many r of X is Y? (X rhowmany Y)What is an example of X? (? instance-of Class)What is the X? (X instance-of ?)Describe X? (X instance-of Class)What is the similarity between X and Y? (X similar-to Y)What is the difference between X and Y? (X different-to Y)

Table 4.5: Summary on how each question type can be represented into query triples

I implement the projection operator using a semantic matcher [80, 94, 119].

The matcher determines if a query can be projected onto a state’s minikb by finding

a mapping between the query triple and triples in the state’s minikb.

In our example the goal test is satisfied when the projection operator maps

the query onto the state’s minikb to compute the net-force causing the Move event.

A state in the search graph containing such a minikb is known as a goal state.

Consider the minikbs for states 2 and 5 in Figure 4.7. State 2 fails the goal

test because it does not return an answer to the query – its minikb lacks the necessary

axioms and inference methods to return a value for the net-force of the Exert-

Force event. In contrast, the minikb in state 5 contains the necessary equations

to derive an answer. Thus, state 5 satisfies the goal test and is a goal state, as it

returns the value, i.e., -1156 Newtons, for the net-force causing the Move.

Operators

Operators elaborate a state to produce another. I next describe how operators are

created and applied.

65

Creating Operators Operators are created from the set of concepts in the knowl-

edge base. Each concept contains declarative assertions with associated inference

methods (e.g., rules for automatic classification) and computation methods (e.g.,

mathematical equations). Concepts in the knowledge base are represented as con-

cept graphs and the search controller uses a semantic matcher [80, 94, 119] to de-

termine how information in a concept relates to a state. A semantic matcher takes

two representations (encoded in a form similar to conceptual graphs [106]) and uses

taxonomic knowledge to find the largest connected subgraph in one representation

that is isomorphic to a subgraph in the other. The output of semantic matching

forms an operator relating two states.

Figure 4.8 shows in bold the common features between state 1 and the

Motion-with-constant-acceleration concept found by the semantic matcher.

In this case, Move matches Motion-with-constant-acceleration because

the former subsumes the latter, and the distance of the Move in the minikb

matches the distance of the Motion-with-constant-acceleration concept.

These common features become Operator A relating state 1 to state 2 in the search

graph shown in Figure 4.6.

Applying Operators Applying an operator to a state creates a successor state.

The successor state includes new information introduced by a concept from which

the operator was created. The new information is merged with the parent state

by joining [106] the representations on their overlapping features found by seman-

tic matching. The successor state contains additional information and inference

methods introduced by the operator’s concept.

Figure 4.9 shows the successor state that results from applying Operator A

to state 1 (Figure 4.6). It was created by the match shown in Figure 4.8. This

successor state becomes state 2 in the search graph (Figure 4.6).

66

Figure 4.8: The semantic matcher identified the matching features (highlighted inbold) between the state 1 on the left-hand side and the Motion-with-constant-acceleration concept on the right-hand side. In this case, the result of semanticmatching becomes Operator A relating state 1 to state 2 in the search graph shownin Figure 4.6

.

67

Fig

ure

4.9:

The

elab

orat

edm

inik

baf

ter

appl

ying

Ope

rato

rA

inth

ese

arch

grap

h(F

igur

e4.

6).

Pan

el1

show

sth

em

inik

bof

stat

e1

and

pane

l2

show

sth

eM

otio

n-w

ith-c

onst

ant-a

cceleratio

nco

ncep

t.T

heov

erla

ppin

gfe

atur

esfo

und

byse

man

tic

mat

chin

gbe

twee

nth

em

inik

bin

stat

e1

and

Motio

n-w

ith-c

onst

ant-a

cceleratio

nar

ehi

ghlig

hted

inb

old

inpa

nels

1an

d2.

The

seov

erla

ppin

gfe

atur

esar

ejo

ined

tofo

rmth

em

inik

bin

pane

l3.

The

new

piec

eof

know

ledg

ein

trod

uced

byth

eco

ncep

tM

otio

n-w

ith-c

onst

ant-a

cceleratio

nis

high

light

edin

bol

din

pane

l3.

Thi

sm

inik

bfo

rms

Stat

e2

ofth

ese

arch

grap

h(F

igur

e4.

6).

68

Control Strategy

The question mediator expands the search graph in a breadth first manner. Ex-

panding the search graph in a breadth first manner ensures the minikb returned by

the question mediator contains only concepts necessary to answer questions. This is

to avoid an overly complex minikb that may be inefficient to execute or may yield

incoherent explanations.

4.2.2 Resolving Representational Differences

When knowledge bases and question formulations are built using an expressive on-

tology, the same meaning can be represented differently. Such representational dif-

ferences are especially common when the knowledge bases are built and used by

different groups of users, and for large knowledge bases covering complex domains.

For the question mediator to work well, it has to resolve representational differences

that may exist between question formulations and the knowledge base. Towards this

end, I extend the question mediator to resolve representational differences during

operator creation and in the goal test.

Resolving Differences when Creating Operators

Consider the question:

Question:

A cell has 46 chromosomes. What is the cell?

Question formulation:

Cell?has−part−→ (46 Chromosome)

The assertion that the human cell has 46 chromosomes might be encoded in

the knowledge base as either:

69

Simpler Representation:

Human-Cellhas−part−→ (46 Chromosome)

or




With the simpler representation, there are no representational differences

between the question formulation and the knowledge base. Therefore, answering the

question is trivial because the semantic matcher can easily find the correspondence

between the question formulation and the information in the Human-Cell concept.

In contrast, it is necessary to resolve representational differences when the richer

representation is used. This is because semantic matching fails to match:


Cell?has−part−→ (46 Chromosome)

with




To resolve such representational differences, I use a semantic matcher that is

enhanced with a set of transformation rules. These transformation rules resolve rep-

resentational differences when two knowledge structures having the same meaning

are represented differently [120].

Each transformation rule used in the flexible semantic matcher is an instance

of the pattern “transfers through” [69] which has the following form:

C1r1−→ C2

r2−→ C3 ⇒ C1r1−→ C3

where Ci is a concept and rj is a relation. A transformation rule is applicable to

70

a representation if its antecedent subsumes a subgraph of the representation. The

flexible semantic matcher applies a rule by joining [106] the rule’s consequent with

the subgraph that matched the rule’s antecedent. Applying a transformation rule

will replace one relation in an encoding with a sequence of semantically similar

relations. The transformation rules are iteratively applied to transform the input

graphs to include additional triples. The flexible semantic matcher stops and returns

a result when no additional correspondence is found between the graphs or when a

depth bound is reached. This forms the transistive closure of a transistive relation

such as has-part.

Figure 4.10 shows the application of transformation rules to include addi-

tional relations between the Human-Cell and Chromosome entities in the

richer representation. The transformed version of the richer representations allows

the semantic matcher to find the correspondence with the question by resolving

the granularity mismatch between the question “How many chromosomes are in

a person’s cell?” and the richer representation. This allows the question to be

answered because the necessary information is found and applied by the question

mediator. Additional details on the list of transformation rules and the flexible

semantic matcher are described in [121].

Resolving Differences in the Goal Test

For every query triple in a question formulation, the basic goal test retrieves the

values for a single slot of a given frame. Consider the question:

Question:

What is the function of lysosomes?


Lysosomeplays−→ Role?

71

Figure 4.10: The application of a transformation rule to include an additional rela-tions between Human-Cell and Chromosome in the richer representation. Thisparticular rule encodes the transitivity of the has-part relation.

The basic goal test returns a shallow answer describing lysosomes simply

as containers. Ideally, the returned answer will describe the specific container role

played by lysosomes to be an instrument enclosing digestive enzymes (rendered in

italics).

Basic goal test’s answer

Lysosomes play the role of a container.

Ideal answer

Lysosomes play the role of a container, ..., and is an instrument enclosing digestive

enzymes.

The basic goal test may return uninteresting or incorrect answers for two

reasons. First, it is challenging for users unfamiliar with the knowledge base to

determine which frame-slots are relevant to the answer because there can be a variety

of related frame-slots answering a question. Second, additional details expected in

the question may not be returned by the basic goal test if granularity differences

72

exist between the query triple and the knowledge base being queried. I use two

techniques to improve the basic goal test to return additional details expected in

the question. Both techniques use knowledge represented in the knowledge base to

transform the original query. This has the advantage of being domain and ontology

independent.

The first technique, a type of query expansion[35], is to include related queries

to return additional details expected in the question. Semantically related relations

in the knowledge base are cataloged into a variety of dimensions[2, 8, 66] that

convey similar kinds of information. For example the structural properties of an

Entity can be retrieved by querying the relations in the Structural dimension.

The question mediator includes additional queries from the same dimension if the

original query is a member of the dimension. Using this technique, the goal test to

answer “What is the function of lysosomes?” is expanded to include related queries

on the Agentive roles, e.g., the instrument and agent relations, of the Lysosome

(see Figure 4.11). This in turn retrieves additional details from the minikb to answer

the question.

The second technique to improve the goal test resolves the granularity differ-

ences that may exist between the question’s query triples and the knowledge base.

Consider the question:

Question:

What are the parts of a cell?


Cellhas−part−→ Entity?

The basic goal test will return the <frame slot value> triples highlighted

in Bold in Figure 4.12(a). Ideally, I like to return additional details shown in

Figure 4.12(b) to describe the partonomy of a Cell.

73

Figure 4.11: The original set of query triples to answer “What is the function oflysosomes?” is expanded to include related queries on the Agentive roles. This,in turn, retrieves additional details from the knowledge base to answer the question.

ASKME achieves the ideal answer by transforming the original query into

additional finer-grained queries on the knowledge base. It creates the additional

queries by applying the transformation rules of the transfer-thru pattern[69, 121] on

the original query triples. For example,

Question:

What are the parts of a cell?

Question formulation: (with finer-grained query triples)



has−part−→ Entity?


has−basic−structural−unit−→ Entity?

...

74

(a) Answer returned when additional queries to retrieve finer-grainedinformation are not used

(b) Answer returned when additional queries to retrieve finer-grainedinformation are used

Figure 4.12: Answer differences when additional queries are used/not used to re-trieve finer-grained information.

75

These additional queries retrieve other finer-grained information to answer

the question.

4.2.3 Relevance Reasoning

Because states can be elaborated by potentially many concepts in the knowledge

base, the question mediator may explore a large number of states to find a minikb

answering the question. Because of my concern about the scalability of the question

mediator on large knowledge bases, I want the question mediator to select only

relevant concepts. Toward this end, the system performs relevance reasoning with

a set of heuristics that control operator creation and ordering.

Creating Operators.

When creating operators, the question mediator rejects concepts when:

1. No new knowledge is added

2. Knowledge added is unrelated

3. Inconsistent knowledge is added

Figure 4.13 shows an operator that is irrelevant because it adds only un-

related knowledge. The left hand side of Figure 4.13 shows state 1, the original

minikb for the sample question in Figure 4.5. The right hand side shows the Cir-

cle concept. In this case, there are no common features between the minikb and

the Circle concept, as the two do not have any features (i.e., terms and relations)

in common.

Figure 4.14 shows an example where a concept is rejected because it adds

inconsistent knowledge to a state. The left hand side of Figure 4.14 shows state

1, the original minikb for the sample question in Figure 4.5. The right hand side

76

Figure 4.13: The semantic matcher did not identify any matching features betweenState 1 (Figure. 4.6) on the left-hand side and the Circle concept on the right-hand side. In this case, no operator is created.

shows the Fall-from-rest concept. The semantic matcher finds the overlap

between these two graphs, as highlighted in bold. In this case, the Move event

in state 1 and its properties, object, initial velocity, final velocity, and distance,

align with Fall-from-rest having similar features. Fall-from-rest is rejected

by the question mediator because it introduces a contradiction in which the initial

velocity of the Move has multiple values, 17 m/s and 0 m/s.

Control Strategy

The question mediator orders the application of operators to give priority to concepts

in the knowledge base having:

1. knowledge structures directly affecting the query

2. a high degree of similarity

3. the least number of assumptions added

77

!"##$%&'($&)*+,-'./)0+,

123)/+,

!"##$%&'($&)*+,

4.45"#$6)#'/4+7,

8."#$6)#'/4+7,

!""#$#%!&'()

*+,)-./0)

9,(:*,

'23)/+,

;4*+"./),

<00#4/"2#),)=>"5'.,)?0&)**4'.*,

1!2(!$34#$'"5670)8)5(5&!$34#$'"5670)9)0:!""#$#%!&'();)<5/6!("#=)

123)/+,

@'6),

-!//)

4.45"#$6)#'/4+7,

8."#$6)#'/4+7,

>?#%63@'%"#) "!A/#/)

'BC#"6)

,D)EF)

G)

9,(:*,

AB,(:*,

A9,(,

'23)/+,

;4*+"./),(#63H'%"#)C,

(4.4$D2,

E+"+),A,

Figure 4.14: Information from Fall-from-rest is added to State 1 of Figure 4.6,resulting in an inconsistency in which the initial velocity of the Move has multiplevalues, 17 m/s and 0 m/s.

Figure 4.15 shows how operators B and C in Figure 4.6 are ordered based

on their concept’s similarity to State 1. Figures 4.15(a) and 4.15(b) show how state

1 matches Motion-under-force and Two-Dimensional-Move respectively.

The match with Motion-under-force is preferred because a larger portion of

the Motion-under-force concept graph matches state 1.

4.2.4 Block diagram and pseudocode

The block diagram in Figure 4.16 shows the various components of the question

mediator.

1. Search controller - The controller incrementally extends a question with infor-

mation in the knowledge base to compute an answer. It controls the process of

finding and trying different subsets of the knowledge base to answer the ques-

tion. Each subset of the knowledge base is a node in the search graph explored

by the search controller. I call a node in the search graph a “minikb”.

78

!"#"$%&$'()%*")+(,-"$+(./,

012(+/,

!"#"$%&$'()%*")+(,

3455,

67()/%8")+(, +4&5(5,

"12(+/,

$(/%*")+(,

"12(+/,

!""#$#%!&'()

9..:;+41:(,(<&4#"$,(7.)(55;"$5,

*!)(#+,-'%"#).)/!00)1)!""#$#%!&'()

012(+/,

!"=(,

3455,

2(2&!$,3#$'"2+4)

5(!$,3#$'"2+4)

67()/%8")+(, +4&5(5,

"12(+/,

$(/%*")+(,

>?,@A,

B,

6)/70)

89)/70)

86)/)

"12(+/,

:20+!("#)C,

3;$;%@1,

D/4/(,E,

(a) semantic match output for State 1 and Motion-under-force

!"#$%&'()*&#)+,$-#.(/0#)1(23/

456(13/

!"#$%&'()*&#)+,$-#.(/

!"!#$%&'&()%*+!,-.

/"$%&'&()%*+!,-.

!"!#$%&-&()%*+!,-.

#56(13/

/"$%&-&()%*+!,-.

-&$++)%)0$#*".

'&$++)%)0$#*".

722,&1+5,(/(89+:#)/(;2<(**&#)*/

1!/=/

456(13/

-#.(/

2$33.

!"!#$%&()%*+!,-.

/"$%&()%*+!,-.

4')0,&5*0+). +$63)3.

*78)+,.

"),&9*0+).

:;.<=.

>.

;.2?3.

@A.2?3.

@;.2.

#56(13/

B!3,$"+).>/

'&)&$?5/

@3+3(/A/

(b) semantic match output for State 1 and Two-Dimensional-Move

Figure 4.15: Different degrees of match between state 1 in Figure 4.6 and theMotion-under-force and Two-Dimensional-Move concepts in the knowledgebase. The match with Motion-under-force has a higher degree of match be-cause a larger portion of Motion-under-force matches state 1. Thus, OperatorB is preferred over Operator C in Figure 4.6.

79

!"#$%&'

%()*$(++"$'

,-.'(/

*0/*'

12/"

$34'

5(#+'*"6*'

7/"

$3'

890#):

"$'

2/"$3'

#)6;

"$164''

2/"$3+<6*'

=+"9<>+"'

6"?#)@%'?

#*%&"$'

A(:

"'890#):

"$')(

:"+<6*'

)(:"

'

)(:"

B'%#):

<:#*"'

%()%"0

*'?#00<)C'

)(:"

B'2/

"$3'

?#00<)C'

+<6*'(

D'#)6;

"$#>+"'

2/"$<"6'

)(:"

B''2/

"$3'

EF'

2/"$<"6'

EF'

$"6/+*'

!"#

$%&'

()#*

+,-&.(

G"+"H#)%"'

G"#6()

"$'

)(:"

+<6*'

6($*":

')(

:"+<6*'

EF'I)D"$")%"'8)

C<)"

'

E)(;

+":C"'>#6"'

Fig

ure

4.16

:B

lock

diag

ram

ofth

equ

esti

onm

edia

tor.

80

2. Relevance reasoner - Given a list of nodes, the relevance reasoner reorders the

list to give priority to: (a) shallower nodes (to preserve completeness property

of breadth first search), (b) goal nodes, and (c) nodes that contain the fewest

number of assumptions to minimize the number of abductive inferences (which

is unsound) .

3. NodeExpander - Given a node, NodeExpander creates its successors. Each

successor node is constructed by joining the node’s minikb with a concept from

the knowledge base. NodeExpander rejects concepts that introduce redundant,

inconsistent, or unrelated information.

4. Goal test - The goal test succeeds when query triples can be projected (via

semantic matching) onto a node’s minikb to compute an answer.

5. ExpandQuery - Given the original query, ExpandQuery consults the knowl-

edge base to find a set of related queries.

6. Flexible semantic matcher - The matcher takes two representations (encoded

in a form similar to conceptual graphs [106]) and uses taxonomic knowledge to

find the largest connected subgraph in one representation that is isomorphic

to a subgraph in the other. The matcher is flexible because it is enhanced

with a set of transformation rules that might apply to both representations to

reduce the number of non-matching features between them[121].

The pseudocode for the question mediator is given in Figures 4.17-4.21.

Completeness and Termination

We know that best first search is complete and terminates if the heuristics are

only used to reorder the queue but not prune it, and when the set of states to be

searched is finite. The basic formulation of the question mediator is complete and

81

The pseudocode for the question mediator is given in Figures ?-?. The high-level

structure of the algorithm is best-first search. This figure gives a standard definition of

best-first search by Nillson[? ]. Figures 2, 3, and 4 show the instantiation of steps 6, 7 and

9 (respectively) to describe the question mediator.

1. The variables OPEN, CLOSED, QUERY, and KB, are global variables

2. Create a search graph, G, consisting solely of the start node, s. Put s on a list called

OPEN.

3. Create a list called CLOSED that is initially empty.

4. LOOP: if OPEN is empty, exit with failure.

5. Select the first node on OPEN, remove it from OPEN, and put it on CLOSED. Call

this node n.

6. If n is a goal node, exit successfully with the solution. ( GoalState-p(n, QUERY ),

see Algorithm 1)

7. Expand node n, generating the set, M, of its successors that are not ancestors of n.

Install these members of M as successors of n in G. ( ExpandNode(n), see Algorithm

4)

8. Establish a pointer to n from those members of M. Add these members of M to OPEN.

9. Reorder the list OPEN, either according to some arbitrary scheme or according to

heuristic merit. ( ReorderNodeList(OPEN)), see Algorithm 11)

10. Go LOOP

i

Figure 4.17: The pseudocode for the question mediator is given in Figures 4.17-4.21. The high-level structure of the algorithm is best-first search. This figure givesa standard definition of best-first search by Nilsson[84]. Figures 4.18, 4.19-4.20, and4.21 show the instantiation of steps 6,7, and 9 (respectively) to describe the questionmediator.

82

This page onwards list the pseudocode for step 5.

Algorithm 1 GoalState-p(node, query), (See section 4.41 on “Goal Test and the GoalState”, and section 4.42 on “Resolving differences in the Goal Test”)Given: node and queryReturn: answerquerylist

expands the user’s query with ones related to itquerylist ← ExpandQuery(query)find the subset of those queries that can be answered using the node’s minikbanswerquerylist ← GetAnswerableQuery(node, query)return answerquerylist

Algorithm 2 ExpandQuery(query)Given: queryReturn: querylist

this function queries the knowledge base for related queries.The workings of this function are described in section 4.42 on “Resolving differences inthe Goal Test”

Algorithm 3 GetAnswerableQuery(node, querylist), (See section 4.41 on “Goal Testand the Goal State”, and section 4.42 on “Resolving differences in the Goal Test”)Given: node and querylistReturn: answerquerylist, sublist of querylist that can be answered using node’s minikb.

querylist ← emptyfor all query in querylist do

mapping ← FlexibleSemanticMatch(node, query)query is answered if it can be projected onto node, i.e., non-empty mappingif ¬ empty(mapping) then

answerquerylist ← add query to answerquerylistend if

end forreturn answerquerylist

ii

Figure 4.18: Pseudocode for step 6 of Figure 4.17

83


Algorithm 4 ExpandNode(node), (See section 4.41 on “Control Strategy”)Given: nodeReturn: nodelist

operatorlist ← CreateOperators(node)for all operator in operatorlist do

newnode ← ApplyOperator(node, operator)nodelist ← add newnode to nodelist

end forreturn nodelist

Algorithm 5 CreateOperators(node) (See section 4.41 on “Creating Operators” andsection 4.42 on “Resolving Differences when Creating Operators”)Given: nodeReturn: operatorlist, a list of mappings, each describing how a concept in the KB can be

related to nodeoperatorlist ← emptyfor all concept in KB (global variable) do

mapping ← FlexibleSemanticMatch(node, concept)if ¬ ( IrrelevantMapping-p(mapping) ∨ RedundantMapping-p(mapping, node,concept) ∨ Inconsistent-Mapping-p(mapping, node, concept)) then

operatorlist ← add mapping to operatorlistend if

end forreturn operatorlist

Algorithm 6 ApplyOperator(node, operator), (See section 4.41 on “Applying Opera-tors”)Given: node and operatorReturn: newnode

operator contains concept and the mapping of how it can be joined to nodenewnode ← compute conceptual graph join of node and concept according to the mappinggiven by operatorreturn newnode

iii

Figure 4.19: Figures 4.19 and 4.20 are the pseudocode for step 7 of Figure 4.17 (Part1/2)

84

Algorithm 7 FlexibleSemanticMatch(Graph1, Graph2)Given: Graph1 and Graph2Return: mapping

finds the largest connected subgraph between Graph1 and Graph2.minimize the non-matching features via transformation rulesdescribed in Yeh et. al.[120]

Algorithm 8 IrrelevantMapping-p(mapping), (See section 4.43 on “Creating Opera-tors”)Given: mappingReturn: boolean

a mapping is irrelevant if no feature of the concept matches the node’s minikbreturn empty(mapping)

Algorithm 9 RedundantMapping-p(mapping, node, concept), (See section 4.43 on“Creating Operators”)Given: mapping and node and conceptReturn: boolean

if mapping describes an isomorphic graph match between node and concept thenreturn true

elsereturn false

end if

Algorithm 10 InconsistentMapping-p(mapping, node, concept), (See section 4.43 on“Creating Operators”)Given: mapping and node and conceptReturn: boolean

newnode ← ApplyOperator(mapping, node, concept)DC ← “the deductive closure of newnode”if DC is inconsistent then

return trueelse

return falseend if

iv

Figure 4.20: Figures 4.19 and 4.20 are the pseudocode for step 7 of Figure 4.17 (Part2/2)

85


Algorithm 11 ReorderNodeList(nodelist), (See Section 4.43)Given: nodelistReturn: sortednodelist

return Sort(nodelist :test ’node-gt)

Algorithm 12 Node-GT(node1, node2), (See Section 4.43 on “Control Strategy”)Given: node1 and node2Return: boolean

node1parent ← parent(node1)node2parent ← parent(node2)prefer node1 if it’s shallower than node2if ply(node1) < ply(node2) then

return true ELSEprefer node1 if it satisfies the goal testif goalstate-p(node1) then

return true ELSEprefer node1 if it adds fewer triples to its parent than does node 2This is because each new triple is an abductive inference(unsound).if (size(node1) − size(node1parent)) < (size(node2) − size(node2parent)) then

return trueend if

end ifend ifreturn false

v

Figure 4.21: Pseudocode for step 9 of Figure 4.17

86

terminates because it employs breadth first search to systematically explore a finite

set of concepts in the knowledge base.

The question mediator is intended for use in an interactive environment.

Thus, it has to be responsive and exhibit good performance. Ideally, questions

should be answered within 60 seconds. In this setting, the question mediator cannot

be exhaustive in its search. Therefore, my implementation of the question mediator

uses a timebound of 5 minutes.

Also, the question mediator employs heuristics to reorder operators on the

same ply of the search graph to give priority to promising ones. This reordering does

not affect the completeness or termination properties of best first search. The ques-

tion mediator also employs heuristics to reject operators that introduce redundant,

unrelated, or contradictory information into the minikb of a node. These operators

cannot lie on the shortest path to a goal (i.e., a node that answers the user’s ques-

tion). Thus, their removal from the search graph does not affect the completeness

property.

4.3 Related work

This section surveys other work related to my design of ASKME. I begin by dis-

cussing various controlled languages developed for different applications. I then

discuss related work on the domain-neutral ontology. Next, I discuss related work

on question taxonomies. Finally, I discuss related work that is similar to my question

mediator.

4.3.1 Restricted English

Controlled languages have been successfully deployed in industry and have been

shown to be useful in many technical settings (e.g., Caterpillar’s Fundamental

English[112], White’s International Language for Servicing and Maintenance[116],

87

and Perkins Approved Clear English[96]). The aerospace industry has also adopted

a common controlled language (ASD simplified technical English) in writing air-

craft maintenance documentation to facilitate their use by non-native speakers of

English[1]. The commercial success of controlled languages suggests that people

can indeed learn to work with restricted English. Although the majority of prior

work on controlled languages has been devoted to making text easier for people to

understand, there have also been several ongoing projects with computer process-

able languages to improve the computational processing of a text (e.g., Attempto

Controlled English(ACE)[45], Processable ENGlish(PENG)[103], and GINO[13]).

Recently, several controlled natural languages have been proposed for the

Semantic Web to represent formal statements with natural language, facilitating

their use by people with no background in formal methods[104].

Attempto Controlled English(ACE) is an ontology authoring language,

and is a subset of English designed “to provide domain specialists with an expressive

knowledge representation language that is easy to learn, read, and write”[45]. ACE

is characterized by a small number of construction rules that define its syntax, and

a small number of interpretation rules that disambiguate constructs that in full

English might be less clear. While looking like natural English, it can be translated

automatically and unambiguously into logic and every ACE text has a single and

well-defined formal meaning.

Processable ENGlish(PENG) is a controlled language designed for writ-

ing unambiguous and precise specification texts for knowledge representation[103].

For example, it can be used to annotate web pages with machine-processable in-

formation. While easily understood by speakers of the base language, it has the

same formal properties as an underlying formal logic language and so is machine-

processable. PENG covers a strict subset of standard English, and is precisely

defined by a controlled grammar and lexicon.

88

With some engineering work, ASKME can substitute CPL with either ACE

or PENG. This is possible because both ACE and PENG translate natural language

into OWL representations which can be mechanically converted into representations

in the KM knowledge representation language.

4.3.2 Available Ontologies

There are a variety of resources to provide the domain-neutral ontology of ASKME.

Linguistically motivated ontologies such as WordNet[41], SENSUS[64], and the Gen-

eralized Upper model[11] have been developed to support sophisticated natural lan-

guage processing tasks. These ontologies have good coverage and access by including

the variety of concepts, relations, and modifiers common in everyday text. For each

English word, these ontologies give its senses along with their definitions, parts of

speech, subclasses, superclasses, and sibling classes. Although existing linguistically

influenced ontologies provide good coverage and access, they lack deep semantics,

beyond linguistic information and definitions in free text, which limits their use for

computer programs.

Other available ontologies provide deeper semantics, but they lack the cover-

age and access of linguistically influenced versions. Ontolingua[43] has rich seman-

tics, but only for restricted domains. Verbnet[60] is a cross-lingual, broad coverage

lexicon that incorporates both semantic and syntactic information about its con-

tents. However, its coverage is limited to verbs. Although the Cyc knowledge base

contains a large number of concepts representing commonsense knowledge[69], it is

difficult to use and finding the right concept often takes a long time[90].

4.3.3 Question categories

An important first step in building a question-answering system is to identify the

supported question types and the approaches necessary to answer them. A variety of

89

taxonomies have been proposed by researchers who have studied question-answering

in the fields of artificial intelligence[68, 101], computational linguistics[54, 113], li-

brary science[34, 92], and education[12, 48, 49, 77].

WH-questions

Who, Which, What, When, Where, Why, and How? is a simple and common

classification of questions that is commonly learnt in grade school as the standard

way to construct questions in English. Robinson and Rackstraw[98, 99] provides a

thorough investigation of wh- words, the forms of questions based on these words,

and the forms of answers to these questions. Many information retrieval systems

use wh- words as the primary criterion for the analysis and logical representation of

questions[56, 65, 76]. These systems typically classify questions lexically into various

wh-word templates to retrieve answers from a set of documents. The simplicity of

classifying questions by their wh- headword is elegant, but to simply consider a ques-

tion on its form may lead to inadequate answers. In order to return a good answer,

it is important to anticipate the question asker’s requirements for the question.

QUALM

Other works on question taxonomies classify questions by considering both

the question and their expected answers. This line of work recognizes that answering

a question well requires identifying the question asker’s expectations so as to guide

the retrieval process to return relevant information. The QUALM system developed

a conceptual theory attempting to replicate the process by which humans understand

and answer questions[68]. The question taxonomy proposed in QUALM is based on

a theory of memory representation called Conceptual Dependency[100], in which the

essential meanings of actions are extracted and encoded as a smaller set of semantic

primitives. The questions answered by the QUALM system concerned short stories

of a few simple sentences, made up mostly of facts. QUALM is able to answer a

question about stories by consulting scripted knowledge that is constructed during

90

Typ

eQ

UA

LM

Gra

essn

er-P

erso

nA

SK

ME

veri

ficat

ion

Did

John

doan

ythi

ngto

keep

Mar

yfr

omle

avin

g?Is

afa

cttr

ue?

Did

anev

ent

oc-

cur?

isit

true

that

the

nucl

eus

isin

-si

deth

ece

ll?di

sjun

ctio

nW

asJo

hnor

Mar

yhe

re?

IsX

,Y

,or

Zth

eca

se?

conc

ept

com

-pl

etio

nW

hat

did

John

eat?

Who

?W

hat?

Whe

n?W

here

?W

hat

isth

ere

fere

nce

ofa

noun

argu

men

tsl

ot?

Wha

tis

the

dura

tion

ofth

em

ove?

Exa

mpl

eW

hat

isan

exam

ple

labe

lor

in-

stan

ceof

the

cate

gory

?W

hat

isan

exam

ple

ofa

prok

aryo

tic

cell?

feat

ure

spec

i-fic

atio

nW

hat

colo

rar

eJo

hn’s

eyes

?W

hat

are

the

prop

erti

esof

X?

Wha

tar

eth

epa

rts

ofa

hum

ance

ll?Q

uant

ifica

tion

How

man

ype

ople

are

here

?H

owm

uch?

How

man

y?A

hum

ance

llco

ntai

nsch

ro-

mos

omes

.H

owm

any

chro

mo-

som

es?

Defi

niti

onW

hat

isX

?W

hat

ism

eios

is?

An

enti

tyis

insi

dea

cell.

Wha

tis

the

en-

tity

?C

ompa

riso

nH

owis

Xsi

mila

rto

Y?

Wha

tis

the

sim

ilari

tybe

twee

na

plan

tan

dan

anim

al?

Inte

rpre

tati

onW

hat

does

Xm

ean?

caus

alan

-te

cede

ntW

hydi

dJo

hngo

toN

ewY

ork?

Why

/how

did

Xoc

cur?

caus

alco

nse-

quen

ceW

hat

happ

ened

whe

nJo

hnle

ft?

Wha

tne

xt?

Wha

tif

?

goal

orie

nta-

tion

For

wha

tpu

rpos

edi

dJo

hnta

keth

ebo

ok?

Why

did

anag

ent

doX

?

inst

rum

ent

/pr

ocec

ural

How

did

John

goto

New

Yor

k?H

owdi

dan

agen

tdo

X?

enab

lem

ent

Wha

tdi

dJo

hnne

edto

have

inor

der

tole

ave?

Wha

ten

able

dX

tooc

cur?

expe

ctat

iona

lW

hydi

dn’t

John

goto

New

Yor

k?W

hydi

dn’t

Xoc

cur?

judg

men

tal

Wha

tsh

ould

John

doto

keep

Mar

yfr

omle

avin

g?W

hat

doyo

uth

ink

ofX

?

requ

est

Wou

ldyo

upl

ease

pass

the

salt?

Tab

le4.

6:A

vaila

ble

ques

tion

cate

gori

esin

QU

AL

M,

Gra

essn

er-P

erso

n,an

dA

SKM

E.

For

conv

enie

nce,

we

repr

oduc

eve

rbat

imth

esa

mpl

equ

esti

ons

that

wer

epr

evio

usly

desc

ribe

din

QU

AL

M[6

8]an

dth

eG

raes

sner

-Per

son

ques

tion

taxo

nom

y[49

].

91

the story understanding. The taxonomy developed in QUALM comprises thirteen

types of questions (causal antecedent and consequent, goal orientation, enablement,

verification, disjunction, instrument/procedure, concept completion, expectation,

judgment, quantification, feature specification, and request)[68].

Graesser-Person question categories

The question taxonomy developed in QUALM forms the basis of a more spe-

cific taxonomy developed by Graesser-Person[49] (see Table 4.6). Graesser-Person’s

question taxonomy was developed by studying questions posed by individuals in a

variety of real world settings[20]. The taxonomy classifies questions according to

the nature of the information being sought that would constitute a good answer to

the question. For all the categories, Graesser conducted human studies to score the

reliability of the taxonomy, and found it to accommodate virtually all inquiries that

occur in a discourse[49].

Our eight question categories, which were derived empirically by analysing

sample AP exams and working with subject matter experts, map well with categories

1 through 8 in Graesser-Person’s question taxonomy[49].

4.3.4 Question mediator

The mechanism I have described for the question mediator is similar to model

composition[37, 97] and analogical reasoning[58, 62, 67, 108, 111].

Model composition automatically selects information from a knowledge

base to form a minimal model adequate to perform a particular task [37, 97]. The

goal of model composition is to identify the domain knowledge that is pertinent

to a particular task. In model composition, the knowledge-base consists of model

fragments that are hand-crafted by experts. Each model fragment is conditioned on

the set of assumptions that prescribe which domain objects to include in the model,

what viewpoints to impose on them, and other simplifying information. These model

92

fragments are organized in a coarse- to fine-grained manner whereby coarse grained

fragments are domain independent, and finer-grained model fragments containing

details specific to the domain. Each model fragment is carefully created by the

knowledge engineer and is tagged with pre- and post-conditions to specify how dif-

ferent model fragments can be composed together. By attending to the assumptions

accompanying each model fragment, the system ensures that the model it constructs

to solve a problem is consistent. Model composition facilitates reuse by showing that

different problems can be solved by composing different models; whereas, previous

systems would have required adding new axioms to the knowledge base to handle

different problems. A disadvantage to model composition is that it is very expen-

sive to employ knowledge engineers to craft the domain theory and to train users in

using the models.

Analogical reasoning uses a corpus of previous problem formulations and

their solutions to solve new, unseen problems[58, 62, 67, 108, 111]. A problem is

solved by finding the best example of a previous solution, which is then parameter-

ized, and instantiated with information in the new question for a reasoner to infer an

answer. Unlike model composition, analogical reasoning does not require a domain

theory (i.e., a knowledge base containing first principles). An advantage of analog-

ical reasoning approaches is their ability to solve unseen problems by looking up,

generalizing, and instantiating previous solutions. This allows analogical reasoning

approaches to avoid the expensive and complex task of carefully creating domain

theories (as required in traditional model composition approaches) to answer ques-

tions. However, a disadvantage to analogical reasoning approaches is their inability

to explain answers because they lack knowledge on a domain’s first principles.

Model composition and analogical reasoning approaches to problem solving

are at opposite ends of the spectrum with respect to their use of target knowledge to

perform problem solving. In model composition, the knowledge is a domain theory,

93

carefully created (at high cost) by experts. Whereas, in analogical reasoning there is

no domain theory (i.e., first principles) and the knowledge used to perform problem

solving are examples of previously answered questions.

My work on the question mediator aims to realize the advantages of both

model composition and analogical reasoning. The question mediator uses domain

theories, created independently by SMEs, to answer unseen questions posed by

novice users. Given a question formulation, the question mediator finds relevant

problem-solving information from the SME-authored domain theory, to create an

adequate model for the reasoner to infer an answer. My work builds upon model

composition because it is capable of using domain theories created by SMEs (which

is highly cost effective). It also answers unseen questions using domain theories (i.e.,

first principles).

4.4 Summary

This chapter discussed the ASKME approach to helping users interact with unfa-

miliar knowledge bases. The ASKME approach functions like a funnel that distills

a user’s intent, stated in natural language, into a suitable logical form that can

be used for problem solving with the underlying knowledge base. The four major

components of ASKME are: (a) a version of restricted English, (b) a general upper

ontology, (c) a set of well known question types, and (d) a software component called

the question mediator that identifies relevant information in the knowledge base for

problem solving.

The chapter also described an instantiation of the ASKME approach. This

prototype uses by several existing technologies. The version of restricted English is

called CPL and was developed by Boeing Phantom Works[30]. CPL has been de-

ployed in many industrial and academic systems. ASKME also uses The University

of Texas at Austin’s Component Library(CLib)[8] as the general upper ontology.

94

Like CPL, the CLib has been successfully used in a variety of AI applications. The

set of well known question types supported by ASKME is a portion of a question

taxonomy that has been well studied and shown to be comprehensive and useful in

a variety of settings.

Finally, I described the design of the question mediator, the software compo-

nent which finds and applies relevant information in a knowledge base for problem

solving. The question mediator was described in three parts. First, the search

controller systematically searches the knowledge base for information to answer

questions. The second part described extensions to the search controller to resolve

representational differences that may exist between a user’s question and the knowl-

edge base. Such representational differences are common when the knowledge base

is built by one group of users and used by a another. The third part is a set of

heuristics for relevance reasoning which is necessary to achieve good performance

and scalability on large knowledge bases.

95

Chapter 5

Evaluation

5.1 Introduction

Knowledge base systems are brittle: they require users to pose questions in terms

of the ontology of the knowledge base. This requirement places a heavy burden on

the users to become deeply familiar with the contents of the knowledge base. As a

result, traditional knowledge base systems have historically been built and used by

the same knowledge engineers.

Ideally, I would like to develop a knowledge base system where the users and

the builders of the knowledge base are two different group of users. Achieving this

ideal requires mitigation of the difficulties faced by novices users in using unfamiliar

knowledge bases for problem solving.

I have developed ASKME to progress toward this ideal. With ASKME, a

knowledge base user can use a variety of unfamiliar knowledge bases by posing their

questions using simplified English. ASKME will then interpret the question, identify

relevant information in the knowledge base, and produce an answer and explanation.

96

I evaluated ASKME on the task of answering AP-like1 exam questions on

the domains of college level biology, chemistry, and physics. The Project Halo team

chose the AP test as an evaluation criterion because it is a widely accepted standard

for testing whether a person has understood the content of a given subject. The

team chose the domains of college-level biology, chemistry, and physics because they

are fundamental, hard sciences and they stress different kinds of representations.

The evaluation consisted of a series of experiments to test if ASKME fulfills

this claim. The first measures ASKME’s level of performance under ideal condi-

tions where the knowledge base is built and used by the same knowledge engineers.

Subsequent experiments measure ASKME’s level of performance under increasingly

realistic conditions, where, ultimately, in the final experiment, I measure ASKME’s

level of performance under conditions where the knowledge base is independently

built by subject matter experts and the users of the knowledge base are a different

group of novice users who are unfamiliar with the knowledge base.

The first experiment establishes the level of performance for ASKME operat-

ing under ideal conditions. The conditions are ideal in that: (a) the knowledge base

being queried were built by knowledge engineers, and (b) the questions are formu-

lated by the same people who built the knowledge base. These ideal conditions are

similar to the way knowledge based question-answering systems have traditionally

been used; the people who built the knowledge base were the ones who used it. This

experiment provides a baseline for judging the contribution of ASKME under less

ideal, but typical conditions.

The second experiment presents a brittleness study to provide a fair measure

of ASKME’s ability to answer AP-like questions. As in the first experiment, the

builders and the users of the knowledge are the same group of users. However, this

experiment used a set of unseen AP-like questions. The purpose of the brittleness1“AP (Advanced Placement) exams” are nationally administered college entry-level tests in the

USA

97

study is to identify the major failures preventing questions from answering correctly

using ASKME.

The third experiment tests whether ASKME can continue to answer ques-

tions when the original knowledge base is exchanged for different ones that differ

in content and organization. This experiment provides a datapoint to judge the

generality of ASKME at answering questions using different knowledge bases.

The fourth experiment evaluates my conjecture that ASKME is able to an-

swer questions that are formulated by users who do not know the ontology of the

knowledge base to which they are posed. I describe an experiment that was con-

ducted by an independent evaluation team[72] that did not participate in the design

and development of ASKME. In the experiment, users with no training in using the

knowledge base were tasked to pose questions using ASKME to answer AP-like exam

questions. The answers and explanations returned by ASKME were then graded by

independent graders with experience at grading AP exams. This experiment pro-

vides a datapoint to judge whether novice users can successfully pose questions using

ASKME to answer questions with unfamiliar knowledge bases.

The fifth experiment evaluated ASKME’s level of performance when oper-

ating under realistic conditions, where the knowledge bases are independently built

by subject matter experts and are used by a different group of novice users who are

unfamiliar with them. The experiment was conducted by an independent evalua-

tion team[72] that did not participate in the design and development of ASKME. In

this experiment, a group of subject matter experts with little training in knowledge

representation and reasoning are tasked to create a set of knowledge bases for the

three science domains: biology, chemistry, and physics. After the knowledge bases

were built, a different group of novice users were tasked to query the knowledge

bases using ASKME to answer a set of AP-like questions. As a control, knowledge

engineers were also tasked to use ASKME to attempt question-answering with the

98

same knowledge bases. The answers and explanations returned by ASKME were

then graded by independent graders with experience at grading AP exams. This

experiment provides a datapoint to judge ASKME’s performance under realistic

conditions where the builders of the knowledge base are subject matter experts and

the users of the knowledge base are novice users.

5.2 Experiment #1: Establishing ASKME’s performance

under ideal conditions

I established ASKME’s level of performance operating under ideal conditions. The

conditions are ideal in that: (a) the knowledge base being queried were built by

knowledge engineers; and (b) the questions are formulated by the same people who

built the knowledge base, thereby avoiding the difficulty of using unfamiliar knowl-

edge bases. The evaluation seeks to answer the following questions:

1. Can knowledge engineers pose questions using ASKME to answer a variety of

AP-like questions?

2. How often do questions formulated using ASKME contain relevant information

in the knowledge base for problem solving?

3. Does ASKME’s relevance reasoning improve performance, and if so, does it

work without sacrificing coverage?

The set of knowledge bases used in the evaluation (henceforth referred to as “refer-

ence knowledge bases”) are created by knowledge engineers working in collaboration

with subject matter experts. These knowledge bases are intended to covered a syl-

labus that consist of approximately 50 pages of a college level science textbook[16,

21, 47] for each of the science domains: biology, chemistry, and physics[72]. Be-

sides the set of domain knowledge bases, the evaluation also uses a significantly

99

Reference Number of Number ofKnowledge-base Concepts Tuples

Biology 139 4789Chemistry 475 22956

Physics 65 5796Multidomain 679 33541

Table 5.1: Four knowledge bases were used in the evaluation. The knowledge basesfor each science domain were created by knowledge engineers working alongsidesubject matter experts. The significantly larger multidomain knowledge base wascreated by concatenating the contents of the individual domain knowledge bases.

larger, multidomain knowledge base that is created by concatenating the reference

knowledge bases for the three domains. The size of the reference knowledge bases

(measured by the number of concepts and tuples) used in the evaluation are listed

in Table 5.1.

The question set used in the evaluation covered a portion of an AP exam

and matched the syllabus of the reference knowledge base. These questions were

authored by teachers with experience teaching the AP curriculum in each domain.

The question set consists of 146 biology questions, 86 chemistry questions, and 131

physics questions[23].

The users participating in the question-answering exercise were the same

knowledge engineers who built the knowledge base. They pose questions using

ASKME to answer the set of AP-like questions. Altogether, the knowledge engineers

created 279 question formulations in biology, 238 question formulations in chemistry,

and 130 question formulations in physics. The number of question formulations

is larger than the number of questions because certain multiple choice questions

required several ASKME formulations to identify the correct option.

Each question formulation is tagged with appropriate answer snippets to

provide a quick, mechanical approach to test if an answer to a question is correct.

My test harness determines whether an answer is correct by comparing the generated

100

answer with answer snippets authored by subject matter experts. This enables the

test harness to automatically retry a question, causing ASKME to return different

answers, until the question is correctly answered or a time-bound is reached. The

testing harness automatically grades the answers in the question set by assigning 1

point for a correct answer and 0 points for an incorrect one.

I have also built ablated versions of ASKME (called ASKME-W/O-QM and

ASKME-BFS-QM) to answer two questions: (a) How often do questions formulated

using ASKME address relevant information in the knowledge base for problem solv-

ing? and (b) Does the relevance reasoning performed by ASKME improve perfor-

mance, and, if so, does it work without sacrificing coverage? ASKME-W/O-QM is

a version of ASKME that does not extend a question with relevant information in

the knowledge base for problem solving. It does not contain the question media-

tor component and a question will not answer if formulated without reference to

required facts and axioms in the knowledge base. ASKME-BFS-QM is a version of

ASKME that searches the knowledge base exhaustively for information and reason-

ing methods to extend a question for problem solving. It does not contain a relevance

reasoner to prioritize promising solutions from the knowledge base. Table 5.2 sum-

marizes the differences between the regular version of ASKME, ASKME-W/O-QM,

and ASKME-BFS-QM.

The standard time-bound for the question mediator is 5 minutes. To max-

imize the number of correctly answered questions for experiment #1, we increased

the time-bound to be 20 minutes.

I use ASKME-W/O-QM to answer the question, “How often do questions

formulated using ASKME address relevant information in the knowledge base for

problem solving?”. The experiment involved using ASKME-W/O-QM to attempt

the set of questions formulated by the knowledge engineers. I then compared the set

of correctly answered questions between ASKME and ASKME-W/O-QM. Our com-

101

Extends questions Performs relevanceSystem version for problem solving? reasoning?

ASKME Yes YesABLATED-W/O-QM No Not applicableABLATED-BFS-QM Yes No

Table 5.2: Summary of the differences between the regular version of ASKME,ASKME-W/O-QM, and ASKME-BFS-QM. Relevance reasoning is not applicableto ASKME-W/O-QM because it does not search the knowledge base for informationto extend a question for problem solving.

parison showed that the set of questions that answered correctly on both ASKME

and ASKME-W/O-QM were formulated with relevant information for problem solv-

ing, while the set of questions that answered with only ASKME were formulated

without relevant information for problem solving.

I conducted two tests using ASKME-BFS-QM to answer the question on

whether relevance reasoning improved ASKME’s performance. Both tests employ

the set of questions formulated by the knowledge engineers. First, I compared the

number of states explored by ASKME and ASKME-BFS-QM while answering ques-

tions using the individual domain knowledge bases. Second, I compared the number

of states explored by both versions of ASKME using the significantly larger multido-

main knowledge base. I can claim that relevance reasoning improves performance if

the number of states explored by ASKME is lower than ASKME-BFS-QM.

I answer the question of whether relevance reasoning worked without sacrific-

ing coverage by comparing the set of correctly answered questions between ASKME

and ASKME-BFS-QM. I can claim that relevance reasoning does not sacrifice cov-

erage if the set of correctly answered questions in ASKME subsumes the set of

correctly answered questions using ASKME-BFS-QM.

102

5.2.1 Results

The evaluation provides evidence concerning the following questions:

Question: What is ASKME’s level of performance when operating under ideal

conditions?

Number of questions PercentageDomain Total Correct Incorrect ScoreBiology 279 209 70 75.0%

Chemistry 238 167 71 70.0%Physics 130 121 9 93.0%

Table 5.3: Correctness scores achieved by knowledge engineers posing questionsusing ASKME on the reference knowledge base

The correctness scores achieved by knowledge engineers on the question set

are listed in Table 5.3. I am gratified to find that knowledge engineers were successful

at using ASKME to answer a variety of AP-like questions. The results show the

ASKME approach (i.e., formulating questions using a restricted English and a set of

well-known question types) to be sufficient to answer a variety of AP-like questions

in various science domains.

Question: How often do questions formulated using ASKME address relevant in-

formation in the knowledge base for problem solving?

ASKME- Answered withDomain W/O-QM ASKME Improvement xforms appliedBiology 145/279 204/279 41% 47/279

Chemistry 124/208 174/208 40% 0/208Physics 44/130 119/130 170% 0/130

Table 5.4: Effect of question mediator and xform rules on answering questions withreference knowledge-base

103

Next, I evaluated the contribution of the question mediator. We compared

the correctness scores achieved by ASKME and ASKME-W/O-QM, which does not

extend questions with information from the knowledge base for problem solving. If

a question is formulated using ASKME-W/O-QM, without addressing information

in the knowledge base necessary for problem solving, it will fail to answer as the

problem solver will lack the required axioms, facts, and reasoning methods to infer

the correct answer.

The correctness scores for both ASKME and ASKME-W/O-QM are listed

in Table 5.4. The correctness scores using ASKME-W/O-QM are significantly lower

than the unablated version of ASKME. In biology and chemistry, only about half

the questions can be answered using ASKME-W/O-QM. In Physics, an even larger

portion of the questions fail to answer with ASKME-W/O-QM. I believe this is

because many biology and chemistry questions contain domain specific terminology,

which implicitly addresses specific information in the knowledge base. In contrast,

many physics questions are stated canonically with very general terms and relations

that do not state all the facts and reasoning methods necessary to solve the problem.

The differences in correctness scores between ASKME and ASKME-W/O-

QM show that a significant number of questions that were correctly answered by

ASKME were formulated without regard for the knowledge base ontology. As a re-

sult, these questions do not reference relevant information in the knowledge base for

problem solving. Therefore, to achieve a high level of performance at question an-

swering, it is necessary for ASKME to automatically extend questions with relevant

information in the knowledge base for problem solving.

I also measured the contribution of the transformation rules used by ASKME

to resolve representational differences between questions and the knowledge base.

Column 5 in Table 5.4 reports the number of questions for which the answer im-

proved with the application of transformation rules. I found transformation rules to

104

be especially useful in answering biology questions. This is because many biology

questions query the value of specific slots and that many of these slots are related.

Thus, to mitigate the brittleness in question answering caused by users querying

the wrong slots, transformation rules can be successfully used to include additional

related slots that are likely to answer the user’s question.

Question: Does ASKME’s relevance reasoning improve performance, and if so,

does it work without sacrificing coverage?

In order to ascertain ASKME’s performance on a variety of knowledge bases

that differ in content and organization, I attempted the questions using ASKME-

BFS-QM, which extends a question by exhaustively trying all possible subsets of

the knowledge base until the question is answered or a time-bound is reached.

Table 5.5 lists the number of states explored by ASKME and ASKME-BFS-

QM when answering the set of questions posed by knowledge engineers. In gen-

eral, ASKME explored fewer states when compared to ASKME-BFS-QM. Thus,

the heuristics employed by ASKME to reject irrelevant information and give pri-

ority to information that is highly related to the question appear to be useful and

necessary to achieving good performance. Table 5.6 lists ASKME’s runtime per-

formance at answering questions in the three domains. I found that a majority of

questions to exhibit good runtime performance using the regular version of ASKME.

The correctness scores for both versions of ASKME (with and without rel-

evance reasoning) are listed in Table 5.7. The gold-standard in coverage is the set

of questions answered by ASKME-BFS-QM, which exhaustively searched the entire

knowledge base for facts and reasoning methods relevant to the question for problem

solving. We found that on all knowledge bases the questions answered by ASKME-

BFS-QM were also answered by the regular version of ASKME (i.e., with relevance

reasoning). Additionally, the regular version of ASKME also required fewer retries

105

(a) Domain knowledge-base

Question System version Number of states exploredSet avg median 75th 90th max

BiologyASKME 5 1 2 17 80

ASKME-BFS-QM 12 1 10 17 195

ChemistryASKME 8.01 1 1 1 104

ASKME-BFS-QM 28.46 1 1 1 495

PhysicsASKME 4 2 4 10 58

ASKME-BFS-QM 10 6 13 22 69

(b) multi-domain knowledge-base

Question System version Number of states exploredSet avg median 75th 90th max

BiologyASKME 5 1 2 16 185

ASKME-BFS-QM 36 1 10 92 704

ChemistryASKME 7.56 1 1 1 104

ASKME-BFS-QM 33.72 1 1 1 733

PhysicsASKME 3 2 4 6 15

ASKME-BFS-QM 97 1 232 235 243

Table 5.5: The average, median, 75th percentile, 90th percentile, and the maximumnumber of states explored by both versions of the system – with and without rel-evance reasoning – for both domain knowledge bases and the larger multi-domainknowledge base.

106

(a) Domain knowledge-base

Question System version Runtime performance (seconds)Set avg median 75th 90th max

BiologyASKME 24.13 2.37 9.44 33.83 1200.0

ASKME-BFS-QM 113.17 1.91 20.93 173.44 1200.0

ChemistryASKME 70.43 11.21 52.41 116.75 1200.0

ASKME-BFS-QM 222.73 12.71 73.38 1200.0 1200.0

PhysicsASKME 30.69 8.7 19.95 57.34 1200.0

ASKME-BFS-QM 50.06 18.4 42.12 72.88 1200.0


Question System version Runtime performance (seconds)Set avg median 75th 90th max

BiologyASKME 28.59 6.12 28.54 62.78 1200.0

ASKME-BFS-QM 235.89 5.07 72.86 1200.0 1200.0

ChemistryASKME 55.37 11.49 44.66 117.64 1200.0

ASKME-BFS-QM 195.07 13.16 93.07 1200.0 1200.0

PhysicsASKME 53.85 26.16 54.68 101.7 1200.0

ASKME-BFS-QM 636.27 1200.0 1200.0 1200.0 1200.0

Table 5.6: The average, median, 75th percentile, 90th percentile, and the maximumruntime performance (seconds) on both versions of the system – with and withoutrelevance reasoning – for both domain knowledge bases and the larger multi-domainknowledge base. In some cases, the heuristics used by ASKME contributed to worseruntime performance, but they were necessary to maximize correctness scores.

107

(a) domain knowledge-base

AverageQuestion set System version % correct retries required

BiologyASKME 68.51 0.23

ASKME-BFS-QM 68.18 0.27

ChemistryASKME 73.11 0.36


PhysicsASKME 73.33 0.19



AverageQuestion set System version % correct retries required

BiologyASKME 65.91 0.19


ChemistryASKME 71.01 0.29


PhysicsASKME 70.67 0.16


Table 5.7: The correctness scores for versions of the system with and without rel-evance reasoning. Both versions achieved similar correctness scores on the domainknowledge-bases. This indicates that relevance reasoning did not sacrifice correct-ness. The version of the system without relevance reasoning recorded lower correct-ness scores when used with the significantly larger multi-domain-kb when answeringphysics questions. This is due to the large number of states explored during blind-search and our evaluation setup aborting an attempt after a time-bound is reached.This result highlights the need for the system to select only the most relevant por-tions of the knowledge base to reason with.

108

to find the correct answer. This indicates that the heuristics used in ASKME do not

sacrifice coverage and, in fact, enable ASKME to find the correct answer in fewer

tries.

I was pleasantly surprised to find that ASKME answers additional questions

outside the gold-standard. This is due to the significantly larger number of states

explored by ASKME-BFS-QM and my experimental setup aborting an attempt after

a time-bound.

The lower correctness scores and the higher number of retries required by

ASKME-BFS-QM underscores the need for relevance reasoning to focus the search

on the most relevant portions of the knowledge base.

5.3 Experiment #2: Brittleness analysis

To provide a fair measure of ASKME’s ability to answer AP-like questions, I con-

ducted a brittleness study with a set of unseen AP-like questions. The purpose of

the brittleness study was as follows:

1. Establish ASKME’s level of performance at answering unseen AP-like ques-

tions using the reference knowledge base.

2. Identify the major failures preventing unseen questions from correctly answer-

ing using ASKME and the reference knowledge base.

The unseen question set was authored by teachers with experience teaching

the AP curriculum for each domain[72]. The question set consisted of 128 questions

in biology, 131 questions in chemistry, and 100 questions in physics. The ques-

tions covered a portion of an AP exam and matched the syllabus of the reference

knowledge base.

The participants in this exercise were the same knowledge engineers who

built the reference knowledge base. Their task was to use ASKME to answer the

109

set of AP-like questions with the reference knowledge base.

The answers and explanations returned by ASKME were graded by school

teachers or college professors qualified to teach and evaluate AP courses. The answer

and explanation to each question was graded in an AP style and assigned full (2

points), partial (1 point), or no credit (0 points). For full credit, ASKME produced

the correct answer, or sufficient information to allow a naive reader to unambiguously

select the correct answer from the options provided in the question[72].

5.3.1 Results

Number of Answer credit Correctness ScoreDomain Questions 0 pt 1 pt 2 pt Sum PercentageBiology 128 55 18 58 134 52.34%

Chemistry 131 93 19 19 57 21.76%Physics 100 63 1 36 73 36.5%

Table 5.8: Correctness scores achieved by knowledge engineers on the unseen ques-tion set

The correctness scores achieved by the knowledge engineers on the unseen

question set are listed in Table 5.8. The correctness scores achieved in this evaluation

provide a gold-standard approximation on ASKME’s ability to answer a variety of

unseen AP-like questions using the reference knowledge bases.

5.3.2 Failure Categories

I conducted a failure analysis to identify the major sources of failures causing ques-

tions to receive no points or only partial credit. The rest of this section considers

in detail the source of failures for each domain.

110

AffectedSource Questions Notes

(70)KB gap 35 Knowledge required is not in the KBUnsupportedquestiontypes

15 A general QA failure or the question is beyond thereasoning capabilities of QA, e.g., qualitative rea-soning or simulation

Question for-mulation

8 Difficulty in formulating question (includes missingvocabulary problems and unavoidable fidelity viola-tions)

Explanation 8 Weakness of answer presentationSyllabus 4 Question requires knowledge from outside the syl-

labus

Table 5.9: Failure analysis on why biology questions in the unseen question set failto answer

Biology

Of the 128 questions in biology, 58 questions received full points and another 18

questions received partial credit. I examined 70 questions for sources of failure

causing questions to receive either no points or only partial credit. Table 5.9 suggests

that the majority of questions that failed to receive full credit did so due to gaps in

the knowledge base. However, there are also a significant number of questions that

failed because of unsupported question types that required reasoning mechanisms

such as qualitative reasoning and simulation. I also found a variety of questions that

are difficult to formulate using restricted English.

Chemistry

Of the 131 questions in chemistry, 19 questions received full credit and another 19

questions received partial credit. I examined 112 questions for sources of failure

causing questions to receive either no points or partial credit. Table 5.10 suggests

that the main problem in Chemistry is an incomplete knowledge base. Although the

111


(112)KB gap 88 Knowledge required is not in the KBUnsupportedquestiontypes

18 A general QA failure or the question is beyond thereasoning capabilities of QA, e.g., qualitative rea-soning and simulation

Question for-mulation

2 Difficulty in formulating question (includes missingvocabulary problems and unavoidable fidelity viola-tions)

Explanation 2 Weakness of answer presentationSystem bug 1 Problem with question asking interface (question

answers correctly if Answer button is pressed with-out first hitting Enter key)

Syllabus 1 Question requires knowledge from outside the syl-labus

Table 5.10: Failure analysis on why chemistry questions in the unseen question setfail to answer

reference KB is known to be incomplete, it is possible that some affected questions

might still not answer due to bugs and other shortcomings in ASKME.

Physics

Of the 100 questions in physics, ASKME answered 36 questions correctly and one

question received partial credit. I examined 64 questions for sources of failure caus-

ing questions to receive either no points or partial credit. Table 5.11 suggests that

the main problem is missing problem solvers; ASKME lacked mechanisms to qual-

itatively reason about equations. One question was found to be malformed, and

having an incoherent description. The other 23 incorrect questions were due to gaps

in the knowledge base and bugs in ASKME.

112


(64)KB gap 16 The required knowledge does not appear (or is not

correct) in the reference KB (mostly questions in-volving strings under tension).

Lacked prob-lem solver

40 A general QA failure or the question is beyond thereasoning capabilities of QA, e.g., qualitative rea-soning and simulation

System bug 7 A bug in QA (in the CPL post-interpreter) preventsQA from recognizing queries involving coefficientsof friction. QA is unable to solve equations withmultiple unknowns chained from different concepts.QA problem with collating equation sets preventeduse of appropriate KB content.

Bad question 1 Incoherent question that does not make sense.

Table 5.11: Failure analysis on why physics questions in the unseen question set failto answer

5.3.3 Discussion

The Halo pilot study is the first phase of a projected multi-phase effort by Vulcan

Inc., whose ultimate goal is the creation of a “Digital Aristotle”, an expert tutor

in a wide variety of subjects[9, 44, 114]. The Halo pilot was a six-month effort

intended to assess the state-of-the-art in question answering with an emphasis on

deep reasoning. The Halo pilot phase demonstrated systems that were collabora-

tively built by different knowledge engineers. The effort was structured around the

challenge of answering a variety of AP Chemistry questions that focused on a por-

tion of the AP syllabus. The answers generated by the systems participating in

the Halo pilot were evaluated by graders with advanced expertise according to the

directives of the Educational Testing Service. Three contending systems were built

by Cycorp Inc., SRI International, and Ontoprise. The competing systems attained

scores ranging between 38% to 52%, which is comparable to the mean human scores

113

in AP Chemistry[9, 44].

There are several differences between my question-answering evaluation on

the unseen question set and the Halo pilot. First, the questions in my evaluation are

posed using natural language and a significant number of the questions are posed

without references to the knowledge base for the reasoner to easily infer an answer.

Whereas in the Halo pilot, questions were formulated as logical expressions that

directly addressed information in the knowledge base for problem solving. Second,

my evaluation covered a non-trivial syllabus comprising 50 pages of a science text

book in three different science domains: biology, chemistry, and physics. In contrast,

the Halo pilot focused only on the chemistry domain. Third, the question set used in

my evaluation (consisting of 128 biology questions, 131 chemistry questions, and 100

physics questions) is significantly larger that the 50 questions used in the Halo pilot.

All things considered, ASKME’s performance on the unseen question set compares

favorably to the results achieved by the systems participating in the Halo pilot.

The failure analysis indicates that a majority of questions that fail to an-

swer in the unseen question set did so because of gaps in the knowledge base or

unsupported question types that required qualitative reasoning and simulation. Ad-

dressing these limitations is beyond the scope of this dissertation. The results show

ASKME to be effective at answering a variety of AP-like questions. First, I found

the available question types supported by ASKME to cover the set of AP-like ques-

tions used in the study. Second, the observed correctness scores show the available

question types to produce effective answers and explanations to answer the variety

of AP-like questions.

114

5.4 Experiment #3: Can questions continue to answer

with different knowledge bases?

Authors of knowledge bases make many modeling decisions during knowledge engi-

neering. Users of knowledge bases make assumptions during question formulation.

The result is a mismatch between questions and the representations required to

answer them. One of my claims is that ASKME can bridge the gap between the

formal representations of knowledge bases captured from domain users and the rea-

soners attempting to use those representations to answer questions. Specifically, I

claim that ASKME finds information to answer questions on different knowledge

bases whose content and organization are not known in advance (and may vary

significantly).

I assess this claim by measuring ASKME’s performance at answering ques-

tions using knowledge bases authored by different users. The set of knowledge bases

used in this exercise are independently built by subject matter experts (SME-built)

to cover the same syllabus as the reference knowledge base in each of the science

domains. These subject matter experts are graduates or post-graduate students in

biology, chemistry, and physics. They undergo one week of training and are given

five weeks to build and test their knowledge bases. The subject matter experts

worked independently with little interaction with other users and with only occa-

sional help when faced with technical difficulties using the knowledge formulation

tool. The size of the different SME-built knowledge bases is listed in Table 5.12.

The experiment uses the set of question formulations authored by knowledge

engineers in experiment #1. The set of question formulations used in the experi-

ment consists of 308 question formulations in biology, 238 question formulations in

chemistry, and 150 question formulations in physics. Each question formulation is

tagged with appropriate answer snippets which enables my test harness to determine

whether an answer is correct by comparing the generated answer with the answer

115

Domain Knowledge-base Created by # Concepts # Tuples

BiologyReference KE 139 4789SME #1 SME 126 4505SME #2 SME 226 8240SME #3 SME 167 6016

ChemistryReference KE 475 22956SME #1 SME 251 11830SME #2 SME 282 7589SME #3 SME 420 12818

PhysicsReference KE 65 5796SME #1 SME 20 3734SME #2 SME 8 1058SME #3 SME 15 1461

Table 5.12: The knowledge bases used in the study on ASKME’s ability to answerquestions using a variety of knowledge bases that differ in content and organiza-tion. Aside from the reference knowledge bases, which was created by knowledgeusers working closely with subject matter experts, the other knowledge bases wereindependently authored by subject matter experts.

snippets. In the experiment, the test harness will automatically retry a question,

causing ASKME to return different answers, until the question is correctly answered

or a time-bound is reached.

For each knowledge base, in addition to measuring the number of questions

correctly answered when posed using ASKME, I also use ASKME-W/O-QM to iden-

tify the number of questions that directly referenced information in the knowledge

base for the problem solver to infer the correct answer. The differences in correct-

ness scores between ASKME and ASKME-W/O-QM indicates ASKME’s ability to

bridge the gap between the formal representations of knowledge bases captured from

domain users and the reasoners attempting to use those representations to answer

questions.

116

Correctness scores (%)Domain Knowledge Improvement

base ASKME-W/O-QM ASKME

BiologySME #1 21.42 35.71 68%SME #2 21.1 36.69 74%SME #3 23.7 44.48 88%

ChemistrySME #1 34.02 52.52 54%SME #2 37.82 59.66 58%SME #3 13.03 29.41 126%

PhysicsSME #1 10.67 62.67 487%SME #2 10.0 25.33 153%SME #3 4.67 29.33 528%

Table 5.13: Correctness scores when reference question formulations are attemptedon different KBs independently built by subject matter experts.

5.4.1 Results

Table 5.13 lists the correctness scores observed in the experiment. I found the

correctness scores achieved with different SME-built knowledge bases varied sig-

nificantly and were lower than the correctness scores achieved with the reference

knowledge bases that were reported in Table 5.4. This was expected because the

SME-built knowledge bases were found to vary significantly in size and were found

to have less coverage than the reference knowledge bases authored by knowledge

engineers.

I found ASKME to achieve significantly better scores than ASKME-W/O-

QM in all domains and knowledge bases. This suggests that ASKME is capable of

locating relevant information in a variety of knowledge bases that differ in content

and organization, to extend a question for problem solving. The techniques to

include related and finer-grained queries also improved answers to biology questions.

Thus, the results from the experiment support the claim that ASKME is useful in

finding information to answer questions on different knowledge bases.

117

5.5 Experiment #4: Can users pose questions using

ASKME to successfully query knowledge bases with

which they are unfamiliar?

I next test whether novice users can use ASKME to successfully interact with un-

familiar knowledge bases. This evaluation was conducted by an independent evalu-

ation team[72] that did not participate in the design and development of ASKME.

In this evaluation I am interested in answering the following questions:

1. How do novice users perform in the evaluation?

2. How do the performance of novice users compare with knowledge engineers?

3. Are there significant differences in their levels of performance across domains?

4. Is there any evidence of difficulties faced by novice users in using ASKME to

interact with knowledge bases with which they are unfamiliar?

The evaluation uses the same set of reference knowledge bases that were

created in experiment #1. The question set used in the evaluation is a subset of

the unseen question set that was first introduced in experiment #2. The choice of

questions by the external evaluator “was not random, nor were questions selected

to completely and evenly test the contents of the reference knowledge bases”[72].

Rather, the selection favored questions that were at least partially answerable by

ASKME using the reference knowledge base. The selection was also “calibrated to

avoid having too many questions that were easily answerable (ceiling effect) or too

many questions that were unanswerable (floor effect)”[72]. Altogether, 50 questions

were selected to form the question set used in this evaluation. Additional details on

the question selection process is described in [72].

118

Nine undergraduates were recruited to participate in this evaluation: three

users in biology, three users2 in chemistry , and three users in physics. These under-

graduates had little experience in knowledge representation and had no familiarity

with the knowledge base being queried. They underwent a four-hour training ses-

sion on the basics of using ASKME to pose and receive answers to AP-like questions

using simplified English[72]. After completing the training, the participants were

given 35 hours (over a three week period) to formulate questions in the evaluation

question set[72]. These novice users worked independently and did not discuss their

efforts with other users. They were also not allowed to solicit help except when

faced with technical problems related to the software.

The answers and explanations returned by ASKME were graded by two

independent graders on a scale of 0 to 2 points[72]. Each grader was a high school

AP teacher with at least three years of experience. Graders were asked to assign

full (2 points), partial (1 point), or no credit (0 points). The criteria for full credit

was that ASKME “produced the correct answer, or sufficient information to allow a

naive reader to unambiguously select the correct answer from the options provided

in the question”[72]. In cases where graders differed significantly, “a third grader

examined the question and graded the answer, otherwise the scores assigned by both

graders were averaged”[72].

5.5.1 Results

Question: How do novice users perform in the evaluation?

Table 5.14 lists the correctness scores observed in the evaluation and Fig-

ure 5.1 shows the distribution of answer credit scores for both novice users and

the knowledge engineers. The result shows that novice users can achieve scores2One of the chemistry participants did not finish and eventually dropped out of the evaluation.

119

!"#$%&'(&%)*+'

,''-.%#/0"#'

1'

2'

31'

32'

41'

42'

51'

52'

1'6+' 172'6+' 3'6+' 372'6+' 4'6+'

89'

:*0;0<=>?#%&>3'

:*0;0<=>?#%&>4'

:*0;0<=>?#%&>5'

(a) biology

!"#$%&'(&%)*+'

,'

-'

.,'

.-'

/,'

/-'

,'0+' ,1-'0+' .'0+' .1-'0+' /'0+'

23'

45%6*#+&789#%&8.'

45%6*#+&789#%&8/'

:'';<%#=>"#'

(b) chemistry

!"#$%&'(&%)*+'

,'

-'

.,'

.-'

/,'

/-'

0,'

0-'

,'1+' ,2-'1+' .'1+' .2-'1+' /'1+'

34'

567#*(#89#%&8.'

567#*(#89#%&8/'

567#*(#89#%&80'

:'';<%#=>"#'

(c) physics

Figure 5.1: Answer credit distributions among novice users and the knowledge en-gineer in experiment #4

120

Biology domain Chemistry domain Physics domainKE 76.0% KE 42.0% KE 69.5%

Biology-User-1 74.5% Chemistry-User-1 50.5% Physics-User-1 33.0%Biology-User-2 79.5% Chemistry-User-2 48.5% Physics-User-2 34.5%Biology-User-3 71.5% Chemistry-User-3 N.A. Physics-User-3 52.5%

Table 5.14: Answer credit scores observed in experiment #4’s question-answeringevaluation

that were comparable to those achieved by the knowledge engineers in the biology

and chemistry domains. This result suggests that novice users with four hours of

training in using ASKME, can effectively use the system to interact with unfamiliar

knowledge bases.

I was pleasantly surprised to find instances in biology and chemistry where

novice users achieved results that were better than those achieved by the knowledge

engineers who built the knowledge base. We believe that “instances where biology

and chemistry novice users achieve better scores than knowledge engineers was be-

cause they had more time and were instructed to try as many question formulations

as they needed to, in order to elicit from the system answers and explanations that

addressed the original English question”[72]. Novice users were also told that if

ASKME could not answer a question correctly, they should ask simpler questions

and try to probe the extent of ASKME’s knowledge about the topic. In contrast,

the knowledge engineers’ attempts at question formulation focused on achieving a

high fidelity with the original questions (i.e., generic and faithful), and “did not nec-

essarily spend much effort “backing off” to ask increasingly simple questions about

the topic”[72].

The phenomenon of novice users achieving better results than knowledge

engineers was not detected in physics because question answering is more dependent

on rules/reasoning rather than facts. There were few questions in the evaluation

question set that can be answered with partial answers. So novice users were not able

121

to elicit significant amounts of useful information using simple queries. Questions

in physics also required deep domain knowledge to fill in necessary assumptions

expected by the domain models in the knowledge base.

The results show ASKME to be usable by novice users to interact with

unfamiliar knowledge bases for the task of answering AP-like exam questions.

Question: Are there significant differences in performance among novice users and

the knowledge engineer?

I answer this question by testing if there are significant differences in the

answer credit scores among novice users and the knowledge engineers.

Biology: An ANOVA3 found no significant differences in the distribution of answer

credit scores among the three novice users and the knowledge engineer (F (3, 196) =

0.53, p > 0.10). By analysing the CI4 for the difference between the mean of the

novice users and the mean of the knowledge engineer, I found with 95% certainty

that the knowledge engineer has no more than 12.3% advantage at employing the

knowledge base for answering biology questions. This result shows that deep famil-

iarity with the knowledge base is not a critical advantage in achieving good results

for the biology domain, and that novice users are capable of achieving comparable

results to knowledge engineers.

Chemistry: An ANOVA found no significant differences in the distribution of an-

swer credit scores among the two novice users and the knowledge engineer (F (2, 147) =

0.29, p > 0.10). By analysing the CI for the difference between the mean of the novice

users and the mean of the knowledge engineer, I found with 95% certainty that the

novice users has no more than 19.7% advantage at employing the knowledge base3Analysis of variance4Confidence Interval

122

for answering biology questions. This result shows that deep familiarity with the

knowledge base is not a critical advantage in achieving good results for the chemistry

domain, and that novice users are capable of achieving results that surpass those

achieved by the knowledge engineers.

Physics: An ANOVA found significant differences in the distribution of answer

credit scores among the three novice users and the knowledge engineer (F (3, 196) =

8.02, p < 0.05). A priori contrast tests (Bonferroni corrected T-Tests) show the

distributions to be significantly different between the following pairs of users:

KE > Physics-User-1, (t(98) = 4.31, p < 0.05)

KE > Physics-User-2, (t(98) = 4.26, p < 0.05)

There is no significant difference between the knowledge engineer and Physics-User-

3, or among Physics-User-1, Physics-User-2, and Physics-User-3. The significant

differences in the correctness scores between the knowledge engineer and Physics-

User-1 and Physics-User-2 suggests that deep familiarity with the domain or the

physics knowledge base is necessary to achieve good results. This is because many

questions in the physics domain required the user to specify key facts about the

problem (e.g., making explicit the acceleration of a move when there is negligible

air resistance) that might not be obvious to someone not trained in physics or

someone who is unfamiliar with the knowledge base being queried.

Question: Do novice users use ASKME equally well across domains?

An ANOVA found significant differences in the distribution of answer credit

scores across the three domains (F (2, 147) = 14.94, p < 0.05). Contrast tests (Bon-

ferroni corrected T-tests) show that the distributions are significantly different be-

123

tween the biology and chemistry domains (t(98) = 3.89, p < 0.05), and between

the biology and physics domains (t(98) = 6.10, p < 0.05). There is no significant

difference between the chemistry and physics domains (t(98) = 1.28, p > 0.1).

I believe the differences between biology and chemistry and between biology

and physics is due to the nature of biology questions, which rely heavily on query-

ing facts and less on performing complex reasoning. These questions involve simple

inference and often return sufficient information to satisfy partial or full credit. Ad-

ditionally, I have also observed users posing follow-up questions to elicit additional

details to further improve their answers. This is in contrast to physics questions,

where a majority of questions involved equation solving. Such questions are brittle

in the sense that ASKME requires all facts expected by the applicable equation to

be explicitly stated to compute an answer; otherwise, no solution is returned. This

explanation accounts for why there are few physics questions that received partial

credit and why the differences in correctness scores between novice users and the

knowledge engineer is significantly large for the physics domain.

Question: Were there noticeable difficulties faced by novice users in using ASKME?

How can the individual novice users achieve better results?

I first examined the distribution of answer credit scores among the different

users. The questions answered by the novice users varied in distribution. There

were many instances where a question was answered by a particular user but not

others. The results suggests that users could potentially achieve higher levels of

performance if they cooperated on the question-answering task (e.g., by sharing

information on best practices). For example, if I aggregated the highest scores for

each question achieved among the three biology users, the resulting score is 93.5%,

which is significantly higher than the scores achieved by individual novice users, the

average achieved by the novice users, or the knowledge engineer (see Table 5.15).

124

Calculation method Biology Chemistry Physicsaggregated (union) 93.5 57.5 64.5

averaged 75.17 49.5 40.0

Table 5.15: Aggregated and average correctness scores achieved by novice users inexperiment #4

User #Formulations #Minutes(Ranked by score) Score Sum Avg Median Sum Avg Median

Bio-User-2 79.5% 870 17.4 15 678 13.56 8Bio-User-1 74.5% 453 9.06 7 253 5.06 4Bio-User-3 71.5% 311 6.22 4 1013 20.3 7.5

Chem-User-1 50.5% 576 11.52 11 950 19.0 13.5Chem-User-2 48.5% 504 10.08 6.5 754 15.08 4.5Phys-User-3 52.5% 884 17.68 13.5 901 18.02 14Phys-User-2 34.5% 1121 22.42 19 1610 32.2 24Phys-User-1 33.0% 787 15.74 15 1044 20.08 19.5

Table 5.16: Data on number of formulations and time spent per question by noviceusers in experiment #4

Although users were able to create a correct formulation of the question in

many cases, they were rarely able to do so in their first attempt. I next examine

the amount of effort spent by users at getting questions to answer using ASKME.

Besides correctness scores, I have found four other metrics useful in quantifying the

difficulties faced by the user: the mean number of question attempts; the median

number of question attempts; the time spent by the user formulating a question

before getting a satisfactory answer; and the time spent by the user formulating a

question before giving up. Table 5.16 shows these statistics for the different novice

users participating in the evaluation.

Among the three domains, it appears that users spent significantly more

effort formulating physics questions that answered when compared to biology and

chemistry users. One explanation is that a majority of physics questions were com-

plex “story problems” which required more effort to rephrase using restricted En-

125

glish. Another explanation is that physics questions often contain implicit informa-

tion and because ASKME cannot automatically identify and install assumptions,

users have to invest additional effort to manually identify and state implicit facts

during question formulation.

I also investigated whether there were significant differences in the number

of attempts and time spent per question among users in each domain.

Biology: An ANOVA found significant differences in the number of formulations

attempted (F (2, 147) = 22.05, p < 0.05). A priori contrast tests (Bonferroni cor-

rected T tests) show the distributions for formulations attempted are significantly

different for the following pairs of users:

Biology-User-2 > Biology-User-1, (t(98) = 4.31, p < 0.05)


An ANOVA found significant differences in time spent per question among the

three novice users in biology (F (2, 147) = 6.61, p < 0.05). A priori contrast tests

(Bonferroni corrected T-tests) show the distributions for time spent per question

are significantly different for the following pairs of users:



At face value, I can claim that additional efforts by biology users in terms of the

number of formulations and time spent per question translate into higher correctness

scores. However, an earlier ANOVA did not find the differences in the correctness

scores among the novice users to be significant. Thus, I am gratified to report that,

126

even though there were significant differences in the amount of effort put forth by

the biology users, the resulting correctness scores were similar and are comparable

to the knowledge engineer. This result suggests that ASKME can be used effectively

by novice users to query unfamiliar biology knowledge bases after only four hours

of training in using the system.

Chemistry: A T-Test found no significant differences in the number of formulations

attempted between the two novice users in chemistry (t(98) = 0.78, p > 0.1). A T-

Test also found no significant differences in time spent per question between the

two novice users (t(98) = 1.01, p > 0.1). An earlier ANOVA in experiment #4

found no significant differences in the distribution of answer credit scores between

novice users and knowledge engineer in the chemistry domain. Since there are no

significant differences between number of formulations attempted, time spent per

question, or achieved answer credit scores, I conclude that ASKME can be used

effectively by novice users to query unfamiliar chemistry knowledge bases after four

hours of training in using the system.

Physics: An ANOVA found significant differences in the number of formulations

attempted among the three novice users in physics (F (2, 147) = 3.66, p < 0.05).

An ANOVA also found significant differences in time spent per question among the

three novice users (F (2, 147) = 8.01, p < 0.05). A priori contrast tests (Bonferroni

corrected T-tests) show that the average number of formulations attempted to be

significantly different between:

Physics-User-2 > Physics-User-1, (t(98) = 2.65, p < 0.05)

A priori contrast tests (Bonferroni corrected T-tests) show that the average time

spent per question is significantly different between the following pairs of users:

127



In biology and chemistry, I found that users who used more effort trying different

question formulations achieved better scores. Oddly enough, this wasn’t the case

in physics. The top physics user achieved significantly better scores with fewer

formulations. In my efforts to understand why this was the case, I found the top

performing physics user to have a deep curiosity at understanding the internals of the

reference knowledge base and ASKME. Through his experimentations, he quickly

became familiar with the reference knowledge base and gained a good understanding

on how ASKME worked. I believe his advanced expertise at using ASKME and the

reference knowledge base for problem solving accounts for his level of performance,

which far exceeds his peers.

5.6 Experiment #5: Establishing ASKME’s performance

under production conditions

My final evaluation tests ASKME in a setup that mimics the conditions of the

the ideal knowledge base system described in Chapter 1. The conditions mimic a

production system because: (a) the knowledge bases were built independently by

subject matter experts without any help from knowledge engineers; and (b) the

questions are formulated by a different group of novice users who possess limited

domain expertise and are unfamiliar with the contents of the knowledge base being

queried. These conditions create a substantial challenge for automated question

answering systems: to be able to answer questions that are formulated without


them. This evaluation was conducted by the same independent evaluation team[72]

128

who were involved in experiment #4. In this evaluation I am interested in answering

the following questions:

1. What is the performance of novice users and knowledge engineers at using

ASKME to query unfamiliar knowledge bases?

2. Are there significant differences between novice users and knowledge engineers?

3. Are there any significant differences in their levels of performance across do-

mains?

As I discuss the results of this evaluation, I will also relate the results to

conclusions from the previous evaluation of whether novice ASKME users can suc-

cessfully query reference knowledge bases with which they are unfamiliar.

The set of knowledge bases used in this evaluation are created independently

by subject matter experts without any help from knowledge engineers[72]. They are

intended to cover the same syllabus as the reference knowledge bases in each of the

science domains. These subject matter experts are graduates or post-graduate stu-

dents in biology, chemistry, and physics. They undergo one week of training and are

given five weeks to build and test their knowledge base. The subject matter experts

worked independently and had little interaction with other users. They received

“only occasional help when faced with technical difficulties using the knowledge

formulation tool”[72].

The evaluation uses the same set of questions from experiment #4. Each

domain was assigned two novice users to attempt question-answering. The novice

users underwent a four-hour training session on the basics of using ASKME to

pose questions and receive answers to AP-like questions using simplified English[72].

After completing the training, the participants were given 35 hours (over a three

week period) to formulate questions in the evaluation question set[72]. These novice

users worked independently and did not discuss their efforts with other users. They

129

were only allowed to solicit help when faced with technical problems related to the

software[72].

As a control, knowledge engineers were also tasked to perform question-

answering on the independently built knowledge bases. These knowledge engineers

are unfamiliar with the knowledge base being queried; however, they enjoyed a num-

ber of advantages over the novice user. First, the knowledge engineers are trained in

knowledge representation and reasoning. Second, the knowledge engineers, having

designed and implemented ASKME, are deeply familiar with using it for question-

answering. Third, the knowledge engineers cooperated in the question-answering

task.

5.6.1 Results

Question: What is the performance of novice users and knowledge engineers at

using ASKME to query unfamiliar knowledge bases?

Biology domain Chemistry domain Physics domainKE 34.5% KE 46.0% KE 54.0%

Biology-User-A 54.0% Chemistry-User-A 34.5% Physics-User-A 13.5%Biology-User-B 47.5% Chemistry-User-B 55.0% Physics-User-B 18.0%

Table 5.17: Answer credit scores achieved by knowledge engineers and novice userson knowledge bases that are independently authored by subject matter experts.

Table 5.17 lists the correctness scores observed in the experiment and Figure

5.2 shows the distribution of answer credit scores among novice users and the knowl-

edge engineers. The results continue to show ASKME’s ability to help novice users

to query unfamiliar knowledge bases. The biology result is surprising, as it appears

to show that novice users can use ASKME to achieve better results than knowledge

engineers. One explanation for this surprising result is that the knowledge engi-

130

!"#$%&'(&%)*+'

,'

-'

.,'

.-'

/,'

/-'

,'0+' ,1-'0+' .1,'0+' .1-'0+' /1,'0+'

23'

4*565789:#%&9;'

4*565789:#%&94'

<''=>%#?5"#'

(a) biology

!"#$%&'(&%)*+'

,'

-'

.,'

.-'

/,'

/-'

0,'

0-'

,'1+' ,2-'1+' .2,'1+' .2-'1+' /2,'1+'

34'

56%7*#+&89:#%&9;'

56%7*#+&89:#%&9<'

=''>?%#@A"#'

(b) chemistry

!"#$%&'(&%)*+'

,'

-'

.,'

.-'

/,'

/-'

0,'

0-'

1,'

,'2+' ,3-'2+' .3,'2+' .3-'2+' /3,'2+'

45'

678#*(#9:#%&9;'

678#*(#9:#%&9<'

=''>?%#@A"#'

(c) physics

Figure 5.2: Score distribution among novice users and the knowledge engineer inexperiment #5

131

neers worked quickly to finish the question-answering task, while the novice users

were asked to exhaustively spend more time trying a variety of question attempts

to maximize their scores.

Question: Are there significant differences in performance among novice users and

the knowledge engineer?

I answer this question by testing if there are significant differences in the

answer credit scores among novice users and the knowledge engineers.

Biology: An ANOVA found marginally significant differences in performance among

novice users and the knowledge engineers for the biology domain (F (2, 147) =

2.95, 0.05 < p < 0.10). By analysing the CI for the difference between the mean

of the novice users and the mean of the knowledge engineer, I found with 95% cer-

tainty that the novice users has no more than 29.1% advantage at employing the

knowledge base for answering biology questions.

Chemistry: An ANOVA also found marginally significant differences in perfor-

mance among novice users and the knowledge engineer for the chemistry domain

(F (2, 147) = 2.48, 0.05 < p < 0.10). By analysing the CI for the difference between

the mean of the novice users and the mean of the knowledge engineer, I found with

95% certainty that the knowledge engineer has no more than 16.39% advantage at

employing the knowledge base for answering chemistry questions.

Physics: An ANOVA found significant differences between the knowledge engineer

and the novice users for the physics domain (F (2, 147) = 18.15, p < 0.05). This

result is consistent with my earlier finding that familiarity with the knowledge base

is necessary to achieve good results in physics. A priori contrast tests (Bonferroni

corrected T-tests) show the distributions to be significantly different for the following

132

pairs of users:

KE > Physics-User-A, (t(98) = 5.34, p < 0.05)

KE > Physics-User-B, (t(98) = 4.50, p < 0.05)

Question: Are there any significant differences in their levels of performance across

domains?

In the experiment, an ANOVA found significant differences in the distribution

of answer credit scores across the three domains (F (2, 147) = 16.08, p < 0.05). Con-

trast tests (Bonferroni corrected T-tests) show that the distributions are significantly

different between biology and physics domains (t(98) = 4.53, p < 0.05), and between

the chemistry and physics domains (t(98) = 5.96, p < 0.05). There is no significant

difference between the biology and chemistry domains (t(98) = 0.81, p > 0.1). This

result is consistent with my earlier result in experiment #4 and supports my obser-

vation that biology is an easier domain, with many opportunities for partial credits,

and that novice users can successfully use ASKME to query biology knowledge bases.

Question: Were there noticeable difficulties faced by novice users in using ASKME?

How can the individual novice users achieve better results?

Calculation method Biology Chemistry Physicsaggregated (union) 67.5 60.5 28.5

averaged 50.75 44.75 15.75

Table 5.18: Aggregated and average correctness scores achieved by novice users inexperiment #5

Results from the evaluation provide further anecdotal evidence that novice

users could have achieved significantly better results if they had cooperated, or

133

were given more time for the question-answering task. The questions answered

by the novice users varied in distribution and there were many instances where a

question was answered by a particular user but not others (see Table 5.18). I found

the aggregated scores, calculated by picking the highest achieved score for each

question, to be significantly higher than the scores achieved by individual users

and their average. Thus, I believe if the users were to share information on best

practices, they would be likely to achieve even better results.

5.7 Summary

I assessed the performance of ASKME on the task of answering questions like those

found in the AP exam. The question-answering task involved users posing questions

to the system to retrieve solutions from the knowledge base to answer a set of AP-like

questions. The evaluation consists of successive experiments to test if ASKME can

help novice users to use unfamiliar knowledge bases for problem solving. The first

experiment measured ASKME’s level of performance under ideal conditions, where

the knowledge base is built and used by the same knowledge engineers. Subsequent

experiments measured ASKME’s level of performance under increasingly realistic

conditions. In the final experiment, I measured ASKME’s level of performance

under conditions where the knowledge base is independently built by subject matter

experts and it’s users are are a different group of novice users unfamiliar with the

knowledge base. Results from the evaluation show that ASKME works well on

different knowledge bases and answers a broad range of questions that were posed

by novice users in a variety of domains.

134

Chapter 6

Contributions and Future Work

This dissertation has presented an approach to address the difficulty faced by novice

users in using unfamiliar knowledge bases for problem solving. This approach was

implemented in a system, called ASKME, for novice users to answer AP-like exam

questions using unfamiliar knowledge bases originally built by subject matter ex-

perts. This chapter summarizes the ASKME approach’s research contributions as

well as various avenues for future research suggested by the experience of developing

ASKME.

6.1 Contributions

Authors of knowledge bases make many modeling decisions during knowledge en-

gineering. Knowledge base users make assumptions during question formulation.

The result is a mismatch between questions and the representations required to an-

swer them. Systems that reason over formal representations of domain knowledge

often impose strict requirements on the form and content of the representations.

Deviations from the expected form cause many automated reasoners to fail. This

brittleness makes knowledge base systems difficult to build and imposes a steep

135

learning curve on the user.

6.1.1 Prior art

Early knowledge base systems responded to questions with answers explicitly stored

in databases. These systems work either by pattern matching keywords in a user’s

question with terms in the database or by translating the question into a particular

set of commands to navigate the information in the database. The utility of early

knowledge base systems was limited to answering questions whose answers were

explicitly stored in the database. Thus, questions that required reasoning with

domain expertise could not be answered by these database accessor systems.

The next generation of systems - aptly called expert systems - added in-

ference rules to reason with domain expertise. Expert systems mimic the problem

solving behavior of experts and are typically used to solve problems that do not have

a single correct solution that can be encoded in a conventional algorithm or stored

in a database. The knowledge base in an expert system is narrowly focused and en-

gineered to perform well on specific questions predetermined at design time. Users

interact with expert systems using an interface that restricts the questions that can

be posed to the system. When users pose questions via the user interface, the re-

quired logical forms are then generated and used to answer the queries. Arguably,

the performance of expert systems would have suffered if a different set of users, un-

familiar with knowledge representation or the knowledge base being queried, posed

the questions. The usability of early expert systems has not been evaluated, but

generally these systems were used only by the engineers who built them.

6.1.2 A General Framework for querying unfamiliar knowledge bases

My work advances the state of the art in knowledge based question answering by

addressing the brittleness due to the arms-length separation between the builders

136

and the users of the knowledge base. This is especially challenging when the system

is meant to answer questions posed by users who are unfamiliar with knowledge

representation or the content and organization (i.e., the ontology) of the knowledge

bases.

I studied ASKME for the task of answering questions posed by users who

are unfamiliar with the knowledge base. To make the task easier for computers,

ASKME requires users to formulate their questions with a version of restricted

English and a limited number of question types. The ASKME approach works

for the following reasons. First, it has been shown that users can quickly learn

and effectively use simplified English. Second, studies have also found that users

can easily use a small vocabulary to represent a variety of meaning. Third, it has

been shown that a small set of well-known question types is sufficient to answer a

wide variety of questions. Fourth, I have designed a computer program (called the

question mediator) to answer user’s questions by identifying relevant information in

the knowledge base for problem solving.

ASKME has two advantages when compared to existing systems. First, an

ASKME user can query a variety of unfamiliar knowledge bases by posing questions

using simplified English. Second, because ASKME does not require users to formu-

late their questions in terms of the underlying knowledge base, their questions can

continue to answer on a variety of knowledge bases having different ontologies.

6.1.3 An Application of ASKME for a system answering AP-like

questions

ASKME was studied in the context of Project Halo. A goal of Project Halo is to

develop a knowledge based question answering system capable of answering ques-

tions posed by untrained non-experts, using knowledge bases built by subject matter

experts in a variety of domains [114]. This creates a substantial challenge for auto-

137

mated question answering systems: answering questions that are formulated without


them. A question answering system has successfully addressed this challenge if it

can be coupled with any of a variety of knowledge bases, each with its own inde-

pendently built ontology, and it can answer questions without requiring users to

reformulate the questions for each knowledge base that is used.

I studied ASKME as part of the larger AURA system developed by a team of

researchers to achieve the goals of Project Halo. The AURA system enables SMEs

to create knowledge bases using concept maps, equations and tables, all of which are

converted automatically to computational logic [25]. In addition, the AURA system

enables a different set of users, who have limited domain expertise or familiarity with

the knowledge base, to pose questions, like those found in Advanced Placement(AP)

exams, and to receive coherent answers and explanations.

I assessed ASKME’s performance on the task of answering questions like

those found in the AP exam. The question-answering task involved users posing

questions to the system to retrieve solutions from the knowledge base to answer a

set of AP-like questions. The set of knowledge bases used in the evaluation covered

portions of a college-level science textbook in biology, chemistry, and physics. The

question set used in the evaluation cover a portion of an AP exam and matched the

syllabus of the knowledge base. The Project Halo team chose the AP test as an

evaluation criterion because it is a widely-accepted standard for testing whether a

person has understood the content of a given subject. The team also chose the do-

mains of college level biology, chemistry, and physics because they are fundamental,

hard sciences and they stress different kinds of representations.

The evaluation consists of a series of experiments to test if ASKME can help

novice users to use unfamiliar knowledge bases for problem solving. The initial

experiment measures ASKME’s level of performance under ideal conditions where

138

the knowledge base is built and used by the same knowledge engineers. Successive

experiments measure ASKME’s level of performance under increasingly realistic

conditions. Ultimately in the final experiment, I measure ASKME’s level of per-

formance under conditions in which the knowledge base is independently built by

subject matter experts and its users are a different group of novice users who are un-

familiar with the knowledge base. Results from various experiments show ASKME

to work well on different knowledge bases and to answer a broad range of questions

that were posed by novice users in a variety of domains.

6.2 Future Work

The implementation of ASKME demonstrates that a system consisting of a restricted

English, a domain-neutral ontology, and a set of mmechanisms to handle a small set

of well-known question types to be effective at helping users achieve a high level of

performance at querying unfamiliar knowledge bases. I next discuss some issues that

must be addressed for ASKME to be extended into a production system. Finally, I

describe an application of ASKME beyond the particular application of answering

AP-like questions.

6.2.1 Identifying unstated assumptions

Besides missing knowledge or bad question interpretations, another reason that

ASKME fails to answer questions is when it fails to identify relevant information in

the knowledge base for problem solving. This happens when assumptions, expected

by the knowledge base being queried, are not stated by the user or captured by

ASKME processing. These assumptions often relate information in a question to a

problem-solving model or are default values. Examples include relating the height

of a building to the initial-y-position of a Fall from the building or installing 0

meters as the value for the initial-position of a Move starting from rest [85].

139

It is a difficult task for novice users to recognize missing assumptions or

to steer ASKME into using specific information in the knowledge base. Doing so

requires detailed knowledge of the internal workings of the system and the contents

of the knowledge base being queried. I plan to improve the question-answering

performance of ASKME by investigating mechanisms for ASKME to perform an

automatic failure analysis when it fails to answer a question [79, 97]. The analysis

will query different problem-solving mechanisms (e.g., the deductive reasoner, the

analogical reasoner, or the equation solver) to identify promising sub-goals to help

users install necessary assumptions in their question formulations and to guide the

problem-solving process.

6.2.2 Sanity Checking

Ideally, ASKME will have the ability to check the sanity of an answer. This can be

in the form of simple heuristics to quickly check if an answer makes sense. More

elaborate approaches involve querying large commonsense knowledge bases [69] or

integrating a domain-specific qualitative reasoner to gather “ballpark” answers [89].

Another approach is to query the Internet for evidence to determine if the derived

answer is reasonable [7].

6.2.3 Debug Tool

Helping users determine why ASKME returned an incorrect answer will enable them

to improve the knowledge base and fix problematic question formulations. I have

developed a prototype debugging facility to help knowledge engineers debug the

causes for incorrect answers. The prototype debugger helps the user to:

1. Observe the pieces of knowledge selected by ASKME

2. Examine why a piece of knowledge is selected or rejected by ASKME

140

3. Examine how a piece of knowledge is used by ASKME

4. Explore the question representation examined by ASKME

5. Repair the question representation examined by ASKME

Helping novice users debug the cause of incorrect answers will require the

debugger to clearly show the domain and control knowledge is used by ASKME.

Towards this end, I plan to enhance the debugger to support the common human

problem solving strategy of decomposing a problem into simpler parts and then

solving these parts individually [83, 91]. I plan to enhance the debugger to allow

novice users to specify subgoals and inspect the results in order to check the inter-

mediate problem solving steps. Ideally, the proposed debug tool will help ASKME

users to gain a deeper understanding of why questions fail to answer correctly and

will reduce the time required to debug an incorrectly answered question.

6.2.4 Episodic Memory and Transfer Learning.

Humans make use of past experiences to improve both performance and compe-

tence. The problem-solving experience and assumptions identified in solving other

questions provides guidance when new questions are attempted. I believe leverag-

ing past problem-solving experience can improve scalability and allow ASKME to

answer additional questions that it could not answer previously. A generic episodic

memory has been integrated into ASKME to improve problem solving performance

by learning and transferring control knowledge in solving AP-like physics questions

[108]. A separate AP physics question answering system by [63] demonstrated learn-

ing and transfer of domain knowledge to answer unseen questions by relating them

to previously answered questions. The results from both efforts are encouraging

and merit further investigation on approaches to identify the types of control and

domain knowledge that can be learned and transferred when ASKME attempts a

141

variety of problems in different domains.

6.2.5 Machine Reading Application

Machine reading automatically constructs a knowledge base by processing English

text [36]. Due to the size and complexity of automatically generated knowledge

bases, it may be difficult or tedious for users to understand the contents and orga-

nization of the learned knowledge well enough to query them. ASKME can be used

to pose and receive answers to questions with automatically generated knowledge

bases. I have integrated ASKME into the machine reading system described in [10].

The results have been encouraging. In the future, I plan to further improve and

evaluate the performance of ASKME in a variety of machine reading applications.

142

Bibliography

[1] ASD Simplified Technical English. Technical report, Aerospace and Defence

Industries Association of Europe, 2005. Specification ASD-STE100.

[2] Liane Acker and Bruce W. Porter. Extracting viewpoints from knowledge

bases. In National Conference on Artificial Intelligence, pages 547–552, 1994.

[3] Jan S. Aikins. Prototypical knowledge for expert systems. Artificial Intelli-

gence, (20):163–210, 1983.

[4] I. Androutsopoulos, G.D. Ritchie, and P. Thanisch. Natural language inter-

faces to databases – an introduction. Journal of Natural Language Engineer-

ing, 1(1):29–81, 1995.

[5] S. Auer and J. Lehmann. What have Innsbruck and Leipzig in common?

Extracting semantics from Wiki content. Lecture Notes in Computer Science,

4519:503, 2007.

[6] S Auer, C Bizer, J Lehmann, G Kobilarov, R Cyganiak, and Z Ives. DBpedia:

A Nucleus for a Web of Open Data. In In Sixth International Semantic Web

Conference, Busan, Korea, pages 11–15. Springer, 2007.

[7] Michele Banko, Michael J. Cafarella, Stephen Soderland, Matt Broadhead,

and Oren Etzioni. Open information extraction from the web. In Proceedings

of the 20th International Joint Conference on Artificial Intelligence, 2007.

143

[8] Ken Barker, Bruce Porter, and Peter Clark. A library of generic concepts for

composing knowledge bases. In Proceedings of the First International Confer-

ence on Knowledge Capture, 2001.

[9] Ken Barker, Shaw Yi Chaw, James Fan, Bruce Porter, Dan Tecuci, Peter

Yeh, Vinay K. Chaudhri, David Israel, Sunil Mishra, Pedro Romero, and Pe-

ter E. Clark. A question-answering system for AP chemistry: Assessing KR&R

technologies. In Principles of Knowledge Representation and Reasoning: Pro-

ceedings of the Ninth International Conference, 2004.

[10] Ken Barker, Bhalchandra Agashe, Shaw-Yi Chaw, James Fan, Noah Fried-

land, Michael Glass, Jerry Hobbs, Eduard Hovy, David Israel, Doo-Soon Kim,

Rutu Mulkar, Sourabh Patwardhan, Bruce Porter, Dan Tecuci, and Peter Z.

Yeh. Learning by reading: A prototype system, performance baseline and

lessons learned. In Proceedings of the Twenty-Second National Conference on

Artificial Intelligence, 2007.

[11] John A. Bateman, Renate Henschel, and Fabio Rinaldi. Generalized Upper

Model 2.0: documentation. Technical report, GMD/Institut fur Integrierte

Publikations- und Informationssysteme, Darmstadt, Germany, 1995.

[12] I.L. Beck. Questioning the Author: An Approach for Enhancing Student En-

gagement with Text. Order Department, International Reading Association,

800 Barksdale Road, PO Box 8139, Newark, DE 19714-8139, 1997.

[13] Abraham Bernstein and Esther Kaufmann. GINO - A Guided Input Natural

Language Ontology Editor. In Proceedings of the Fifth International Semantic

Web Conference, pages 144–157. Springer, November 2006.

[14] B.S. Bloom et al. Taxonomy of educational objectives, handbook I: Cognitive

domain, 1956.

144

[15] Ron J. Brachman and James G. Schmolze. An overview of the KL-ONE

knowledge representation system. Cognitive Science, 9:171–216, 1985.

[16] T.L. Brown, T.E. Brown, H.E. LeMay, B.E. Bursten, C. Murphy, and

P. Woodward. Chemistry: The Central Science. Pearson Education Inter-

national, 2008.

[17] B. G. Buchanan and E.H. Shortliffe. Rule-Based Expert Systems: The MYCIN

Experiments of the Stanford Heuristic Programming Project. Addison-Wesley,

Reading, MA, 1984.

[18] Alan Bundy, George Luger, Martha Palmer, and Robert Welham. Mecho:

Year one. In Proceedings of the Second AISB Conference, 1976.

[19] Alan Bundy, L. Byrd, George Luger, Chris Mellish, R. Milne, and Martha

Palmer. Mecho: a program to solve mechanics problems. Technical report,

Department of Artificial Intelligence, Edinburgh University, 1979. Working

paper 50.

[20] J. Burger, C. Cardie, Vinay Chaudhri, R. Gaizauskas, S. Harabagiu, David

Israel, C. Jacquemin, C.Y. Lin, S. Maiorano, G. Miller, et al. Issues, tasks

and program structures to roadmap research in question & answering (Q&A).

Document Understanding Conferences Roadmapping Documents, 2001.

[21] N.A. Campbell and J.B. Reece. Biology. Pearson Education International,

6th edition, 2001.

[22] B. Chandrasekaran. Towards a taxonomy of problem solving types. AI mag-

azine, 4(1):9, 1983.

[23] Vinay Chaudhri. Failure analysis report for the refinement phase. Technical

report, SRI International, 2009.

145

[24] Vinay Chaudhri and Richard Fikes. Question answering systems: Papers from

the 1999 fall symposium. Technical report, AAAI, 1999. FS-98-04.

[25] Vinay Chaudhri, Bonnie John, Sunil Mishra, John Pacheco, Bruce Porter,

and Aaron Spaulding. Enabling Experts to Build Knowledge-bases from Sci-

ence Textbooks. In Proceedings of the Fourth International Conference on

Knowledge Capture, 2007.

[26] P. Clark and P. Harrison. Boeings NLP System and the Challenges of Semantic

Representation. In Semantics in Text Processing. STEP 2008 Conference

Proceedings, Venice, Italy. Citeseer, 2008.

[27] Pete Clark and John Thompson. Why is it hard to understand original english

questions? Technical report, Boeing Phantom Works, 2009. Working note 32.

[28] Pete Clark, Ken Barker, Bruce Porter, Vinay Chaudhri, Sunil Mishra, and

Jerome Thomere. Enabling domain experts to convey questions to a machine:

a modified, template-based approach. In Proceedings of the Second Interna-

tional Conference on Knowledge Capture, 2003.

[29] Peter Clark and Bruce Porter. KM - The Knowledge Machine: Ref-

erence manual. Technical report, University of Texas at Austin, 1998.

http://www.cs.utexas.edu/users/mfkb/km.html.

[30] Peter Clark, Phil Harrison, Tom Jenkins, John Thompson, and Rick Wojcik.

Acquiring and using world knowledge using a restricted subset of English.

In Proceedings of the 18th International FLAIRS Conference (FLAIRS’05),

2005.

[31] Peter Clark, Shaw-Yi Chaw, Ken Barker, Vinay Chaudhri, Phil Harrison,

James Fan, Bonnie John, Bruce Porter, Aaron Spaulding, John Thompson,

146

and Peter Z. Yeh. Capturing and Answering Questions Posed to a Knowledge-

Based System. In Proceedings of the Fourth International Conference on

Knowledge Capture, 2007.

[32] Paul R. Cohen, Robert Schrag, Eric K. Jones, Adam Pease, Albert Lin,

Barbara Starr, David Gunning, and Murray Burke. The DARPA high-

performance knowledge bases project. AI Magazine, 19(4):25–49, 1998.

[33] Halo 2 Evaluation Committee. The large-scale evaluations of Halo 2: Guiding

questions and recommended procedures, 2005.

[34] M. Conner. What a reference librarian should know. The Library Journal, 52

(8):415–418, 1927.

[35] EN Efthimiadis. Query expansion. Annual review of information science and

technology, 31:121–187, 1996.

[36] Oren Etzioni, Michele Banko, and Michael J. Cafarella. Machine reading. In

Proceedings of the Twenty-First National Conference on Artificial Intelligence.

AAAI Press, 2006.

[37] Brian Falkenhainer and Kenneth D. Forbus. Compositional modeling: finding

the right model for the job. Artificial Intelligence, 51(1-3):95–143, 1991.

[38] James Fan and Bruce W. Porter. Interpreting loosely encoded questions. In

Proceedings of the Nineteenth National Conference on Artificial Intelligence.

AAAI Press, 2004.

[39] James Fan, Ken Barker, and Bruce W. Porter. The knowledge required to

interpret noun compounds. In Proceedings of the Eighteenth International

Joint Conference on Artificial Intelligence. Morgan Kaufmann, 2003.

147

[40] James Fan, Ken Barker, and Bruce Porter. Indirect anaphora resolution as

semantic path search. In Proceedings of the Third International Conference on

Knowledge Capture, pages 153–160, New York, NY, USA, 2005. ACM Press.

[41] C. Fellbaum. WordNet: An Electronical Lexical Database. The MIT Press,

Cambridge, MA, 1998.

[42] D. Fensel, E. Motta, F. Van Harmelen, V.R. Benjamins, M. Crubezy,

S. Decker, M. Gaspari, R. Groenboom, W. Grosso, M. Musen, et al. The

unified problem-solving method development language UPML. Knowledge

and Information Systems, 5(1):83–131, 2003.

[43] Richard Fikes and Adam Farquhar. Distributed repositories of highly expres-

sive reusable ontologies. IEEE Intelligent Systems, 14(2):73–79, 1999.

[44] Noah Friedland, Paul Allen, Gavin Matthews, Michael Witbrock, Jon Cur-

tis, Blake Shepard, Pierluigi Miraglia, Angele Jurgen, Steffen Staab, Eddie

Moench, Henrik Oppermann, Dirk Wenke, David Israel, Vinay Chaudhri,

Bruce Porter, Ken Barker, James Fan, Shaw-Yi Chaw, Peter Yeh, Dan Tecuci,

and Peter Clark. Project Halo: Towards a Digital Aristotle. AI Magazine,

2004.

[45] Norbert E. Fuchs, Uta Schwertel, and Rolf Schwitter. Attempto controlled

english (ace) language manual. version 3.0. Technical report, University of

Zurich, 1999.

[46] David Genest and Michel Chein. An experiment in document retrieval using

conceptual graphs. In ICCS ’97: Proceedings of the Fifth International Con-

ference on Conceptual Structures, pages 489–504, London, UK, 1997. Springer-

Verlag.

148

[47] Douglas C. Giancoli. Physics: Principles with Applications. Prentice Hall, 5th

edition, 1998.

[48] Arthur C. Graesser and Person N.K. Question asking during tutoring. Amer-

ican Educational Research Journal, 31(1):104, 1994.

[49] Arthur C. Graesser, Person N., and Huber J. Mechanisms that generate ques-

tions. Questions and information systems, pages 167–187, 1992.

[50] Arthur C. Graesser, Yasuhiro Ozuru, and Jeremiah Sullins. What is a good

question? In M. McKeown, editor, Festscrift for Isabel Beck. Erlbaum, Mah-

wah, NJ, 2009.

[51] B. F. Green, A. K. Wolf, C. Chomsky, and K. Laughery. Baseball: An auto-

matic question answerer. In B. J. Grosz, K. Sparck Jones, and B. L. Webber,

editors, Natural Language Processing, pages 545–549. Kaufmann, Los Altos,

CA, 1986.

[52] Thomas R. Gruber. A translation approach to portable ontology specifications.

Knowledge Acquisition, 5(2):199–220, 1993.

[53] Nicola Guarino, Claudio Masolo, and Guido Vetere. Ontoseek: Content-based

access to the web. IEEE Intelligent Systems, 14(3):70–80, 1999.

[54] SM Harabagiu, SJ Maiorano, and MA Pasca. Open-domain question answer-

ing techniques. Natural Language Engineering, 1:1–38, 2002.

[55] P. Harrison and M. Maxwell. A new implementation of GPSG. In Proc. 6th

Canadian Conf on AI, pages 78–83, 1986.

[56] E. Hovy, U. Hermjakob, and C.Y. Lin. The use of external knowledge in

factoid QA. NIST SPECIAL PUBLICATION SP, pages 644–652, 2002.

149

[57] T. W. C. Huibers, Iadh Ounis, and Jean-Pierre Chevallet. Conceptual graph

aboutness. In ICCS ’96: Proceedings of the 4th International Conference on

Conceptual Structures, pages 130–144, London, UK, 1996. Springer-Verlag.

[58] Kolodner J. Cased-Based Reasoning. Morgan Kaufmann, San Mateo, CA,

1993.

[59] Doo-Soon Kim and Bruce Porter. KLEO: A Bootstrapping Learning-by-

Reading System. In AAAI’09 Spring Symposium on Learning by Reading

and Learning to Read, 2009.

[60] Karen Kipper. VerbNet: A broad-coverage, comprehensive verb lexicon. PhD

thesis, University of Pennsylvania, 2005.

[61] R.I. Kittredge. Sublanguages and controlled languages. The Oxford Handbook

of Computational Linguistics, pages 430–447, 2003.

[62] Matthew Klenk. Using Analogy to Overcome Brittleness in AI Systems. PhD

thesis, Northwestern University, Evanstown, IL, 2009.

[63] Matthew Klenk and Ken Forbus. Measuring the level of transfer learning by

an AP Physics problem-solver. In Proceedings of the Twenty-Second National

Conference on Artificial Intelligence. AAAI Press, 2007.

[64] Kevin Knight and Steve K. Luk. Building a large-scale knowledge base for

machine translation. In AAAI ’94: Proceedings of the twelfth national confer-

ence on Artificial intelligence (vol. 1), pages 773–778, Menlo Park, CA, USA,

1994. American Association for Artificial Intelligence.

[65] KL Kwok and L.D. Grunfeld. N & Chan, M.(2001) TREC 2001 Question-

answering, web and cross language track experiments using PIRCS. In In-

formation Technology: The Tenth Text Retrieval Conference, TREC, pages

500–250, 2001.

150

[66] G. Lakoff and M. Johnson. Metaphors We Live By. University of Chicago

Press, Chicago, 1980.

[67] D.B. Leake. Case-based reasoning: Experiences, lessons and future directions.

MIT Press Cambridge, MA, USA, 1996.

[68] Wendy Lehnert. The Process of Question Answering. PhD thesis, Yale Uni-

versity, New Haven, CT, 1977.

[69] Douglas B. Lenat and R. V. Guha. Building Large Knowledge-Based Systems.

Addison-Wesley Publishing Company, Inc., Reading, Massachusetts, 1989.

[70] Mark T. Maybury. Question answering: An introduction. In New Directions

in Question Answering, pages 3–18. AAAI Press, 2004.

[71] John McDermott. R1: An expert in the computer systems domain. In Proceed-

ings of the First National Conference on Artificial Intelligence, pages 269–271.

AAAI Press, 1980.

[72] David McDonald, Alice Leung, David Getty, and Brett Benyo. Halo Phase II

Evaluation: Final report. Technical report, BBN Technologies, 2009.

[73] O. Medelyan and C. Legg. Integrating Cyc and Wikipedia: Folksonomy meets

rigorously defined common-sense. In Proceedings of Wikipedia and AI work-

shop at the AAAI-08 Conference. Chicago, US, July, volume 12, 2008.

[74] O. Medelyan, D. Milne, C. Legg, and I.H. Witten. Mining meaning from

Wikipedia. International Journal of Human-Computer Studies, 2009.

[75] Cynthia Matuszek Michael, Michael Witbrock, Robert C. Kahlert, John

Cabral, Dave Schneider, Purvesh Shah, and Doug Lenat. Searching for com-

mon sense: Populating cyc from the web. In In Proceedings of the 20th Na-

tional Conference on Artificial Intelligence, 2005.

151

[76] D. Moldovan, S. Harabagiu, M. Pasca, Rada Mihalcea, R. Goodrum, R. Girju,

and Vasile Rus. Lasso: A tool for surfing the answer net. NIST SPECIAL

PUBLICATION SP, pages 175–184, 2000.

[77] P.B. Mosenthal, H. Hall, and L. Green. Understanding the strategies of docu-

ment literacy and their conditions of use. Journal of Educational Psychology,

88(2):314–332, 1996.

[78] Erik T. Mueller. Natural Language Processing with ThoughtTreasure. Signi-

form, New York, USA, 1998.

[79] Ken Murray. Basic Problem Solver Module in AURA. Private communication,

2008.

[80] S H Myaeng and A Lopez-Lopez. Conceptual graph matching: A flexible

algorithm and experiments. Journal of Experimental and Theoretical Artificial

Intelligence, (4):107–126, 1992.

[81] V. Nastase and M. Strube. Decoding Wikipedia categories for knowledge

acquisition. In Proceedings of the AAAI, volume 8, 2008.

[82] Allen Newell and C Ernst. The search for generality. In Proc. IFIP Congress

65, pages 17–24, 1965.

[83] Allen Newell and Herbert A. Simon. Human Problem Solving. Prentice-Hall,

Englewood Cliffs, NJ, 1972.

[84] NJ Nilsson. Principles of artificial intelligence, Tioga Pub. Co., Palo Alto,

CA, 1980.

[85] Gordon S. Novak. [Halo] Rules. Private communication, 2007.

[86] Gordon S. Novak. Computer understanding of physics problems stated in

natural language. American Journal of Computational Linguistics, 1976.

152

[87] Gordon S. Novak and William C. Bulko. Understanding natural language

with diagrams. In Proceedings of the Eighth National Conference on Artificial

Intelligence, 1990.

[88] Gordon S. Novak and Won H. Ng. Rule engine. Private communication, 2006.

[89] Praveen Paritosh. The heuristic reasoning manifesto. In Proceedings of the

20th International Workshop on Qualitative Reasoning, 2006.

[90] Aarati Parmar. The representation of actions in KM and Cyc. Technical

report, Stanford University, 2001.

[91] George Polya. How to solve it. Princeton University Press, New Jersey, USA,

1945.

[92] J. Pomerantz. A linguistic analysis of question taxonomies. Journal of the

American Society for Information Science and Technology, 56(7):715–728,

2005.

[93] S P Ponzetto and M Strube. Deriving a large scale taxonomy from Wikipedia.

In In Proceedings of the 22nd National Conference on Artificial Intelligence,

pages 1440–1445, 2007.

[94] Jonathan Poole and J. A. Campbell. A novel algorithm for matching con-

ceptual and related graphs. In In G. Ellis et al eds, Conceptual Structures:

Applications, Implementation and Theory, pages 293–307. Springer-Verlag,

LNAI, 1995.

[95] P Procter. Longman Dictionary of Contemporary English, 1978.

[96] Peter J. Pym. Simplified english and machine translation. Technical report,

Perkins Engines UK, 1990.

153

[97] Jeff Rickel and Bruce Porter. Automated modeling for answering prediction

questions: selecting the time scale and system boundary. In AAAI’94: Pro-

ceedings of the twelfth national conference on Artificial intelligence (vol. 2),

pages 1191–1198, Menlo Park, CA, USA, 1994. American Association for Ar-

tificial Intelligence.

[98] W.P. Robinson and S.J. Rackstraw. A question of answers, volume 1. Rout-

ledge & Kegan Paul Books, 1972.

[99] W.P. Robinson and S.J. Rackstraw. A question of answers, volume 2. Rout-

ledge & Kegan Paul Books, 1972.

[100] Roger C. Schank. Conceptual Dependency: A Theory of Natural Language

Understanding. Cognitive psychology, 3(4):552–631, 1972.

[101] Roger C. Schank. Explanation patterns: Understanding mechanically and cre-

atively. Lawrence Erlbaum Associates, 1986.

[102] Robert Schrag, M. Pool, Vinay Chaudhri, R. C. Kahlert, J. Powers, Paul R.

Cohen, J. Fitzgerald, and Sunil Mishra. Experimental evaluation of subject

matter expert-oriented knowledge base authoring tools. Technical report, In-

formation Extraction and Transport, Inc., August 2002. Proceedings of the

2002 PerMIS Workshop, August 13-15, 2002, NIST Special Publication 990,

pp. 272-279.

[103] Rolf Schwitter. English as a formal specification language. In DEXA ’02: Pro-

ceedings of the 13th International Workshop on Database and Expert Systems

Applications, pages 228–232, Washington, DC, USA, 2002. IEEE Computer

Society.

[104] Rolf Schwitter, Kaarel Kaljurand, Anne Cregan, Catherine Dolbear, and Glen

154

Hart. A comparison of three controlled natural languages for OWL 1.1. In

OWL: Experiences and Directions (OWLED), 2008.

[105] Push Singh, Thomas Lin, Erik T. Mueller, Grace Lim, Travell Perkins, and

Wan Li Zhu. Open mind common sense: Knowledge acquisition from the

general public. In On the Move to Meaningful Internet Systems, 2002 -

DOA/CoopIS/ODBASE 2002 Confederated International Conferences DOA,

CoopIS and ODBASE 2002, pages 1223–1237, London, UK, 2002. Springer-

Verlag.

[106] John F. Sowa. Conceptual Structures: Information Processing in Mind and

Machine. Addison-Wesley, 1984.

[107] F.M. Suchanek, G. Kasneci, and G. Weikum. Yago: a core of semantic knowl-

edge. In Proceedings of the 16th international conference on World Wide Web,

pages 697–706. ACM New York, NY, USA, 2007.

[108] Dan G. Tecuci. A Generic Memory Module for Events. PhD thesis, The

University of Texas at Austin, Austin, TX, 2007.

[109] Teknowledge Corporation. Rapid Knowledge Formation project, 2002.

http://reliant.teknowledge.com/RKF.

[110] John Thompson and Peter Clark. Guide for CPL users, version 7.5. Technical

report, The Boeing Company, 2006.

[111] M. Manuela Magalhaes A. Veloso. Learning by analogical reasoning in general

problem-solving. PhD thesis, Carnegie Mellon University, Pittsburgh, PA,

USA, 1992.

[112] Charles A. Verbeke. Caterpillar fundamental english. Training and Develop-

ment Journal, 27(2):36–40, February 1973.

155

[113] E.M. Voorhees. The TREC question answering track. Natural Language En-

gineering, 7(04):361–378, 2002.

[114] Vulcan Inc. Project Halo, 2003. http://projecthalo.com.

[115] Daniel S. Weld, Raphael Hoffmann, and Fei Wu. Using Wikipedia to bootstrap

open information extraction. SIGMOD Rec., 37(4):62–68, 2008.

[116] E. N. White. International language for servicing and maintenance. University

of Wales - Institute of Science and technology, 1974.

[117] Wikipedia. Wikipedia, the free encyclopedia, 2009. URL

http://en.wikipedia.org/.

[118] Willian A. Woods, R.N. Kaplan, and B.N. Webber. The Lunar Sciences Nat-

ural Language Information System: Final Report. Technical report, Bolt

Beranek and Newman Inc., 1972. BBN Report 2378.

[119] M. Montes y gomez, A. Gelbukh, A. Lopez-lopez, and R. Baeza-yates. Flex-

ible comparison of conceptual graphs. In In Proceedings of the 12th Interna-

tional Conference and Workshop on Database and Expert Systems Applica-

tions, pages 102–111. Springer, 2001.

[120] Peter Z. Yeh, Bruce W. Porter, and Ken Barker. Using transformations to

improve semantic matching. In Proceedings of the Second International Con-

ference on Knowledge Capture, 2003.

[121] Peter Z. Yeh, Bruce W. Porter, and Ken Barker. A unified knowledge based

approach for sense disambiguation and semantic role labeling. In Proceedings

of the Twenty-First National Conference on Artificial Intelligence, 2006.

[122] C. Zirn, V. Nastase, and M. Strube. Distinguishing between instances and

156

classes in the Wikipedia taxonomy. Lecture Notes in Computer Science, 5021:

376, 2008.

157

Vita

Shaw Yi Chaw was born in Singapore on December 6, 1977. He graduated from the

National University of Singapore with his Bachelor’s degree in Computer Science in

2002. Shaw Yi began his graduate studies in the Computer Sciences Department of

the University of Texas at Austin in 2003. He successfully graduated with a Ph.D.

degree in 2009. Shaw Yi will join IBM T.J. Watson Research Center to work on a

project whose goal is to beat the human champion in the Jeopardy! quiz show.

Permanent Address: 263, Bishan Street 22

#24-261

Singapore 570263

Singapore

This dissertation was typeset with LATEX 2ε1 by the author.

1LATEX 2ε is an extension of LATEX. LATEX is a collection of macros for TEX. TEX is a trademarkof the American Mathematical Society.

158