Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Copyright
by
Shaw Yi Chaw
2009
The Dissertation Committee for Shaw Yi Chaw
certifies that this is the approved version of the following dissertation:
Addressing the Brittleness of Knowledge-Based
Question-Answering
Committee:
Bruce W. Porter, Supervisor
Kenneth J. Barker
Raymond Mooney
Gordon S. Novak Jr.
Art Markman
Addressing the Brittleness of Knowledge-Based
Question-Answering
by
Shaw Yi Chaw, BComp(Hons), MSCS
Dissertation
Presented to the Faculty of the Graduate School of
The University of Texas at Austin
in Partial Fulfillment
of the Requirements
for the Degree of
Doctor of Philosophy
The University of Texas at Austin
December 2009
Dedicated to the memory of Bit Computer Services Pte Ltd, ACE 21, and ACJ
Computers Pte Ltd
Acknowledgments
A Ph.D. dissertation has one name on the front. However behind this one person,
who takes all the credit, are dozens of other people; cajoling, encouraging, criticizing,
stimulating and generally making sure that several years of alternating anguish and
excitement are eventually transformed into a dissertation. Many people contributed
to this dissertation, directly or indirectly, either by discussing the ideas contained
within, reading huge chunks, or helping me get the ASKME system to work correctly.
Many thanks are due to Bruce Porter who supervised my research work.
He provided timely encouragement and advice, carefully read and criticised numer-
ous documents and was always prepared to listen to my ideas. I have benefited
tremendously from his unique blend of energy, vision, technical insights, and prac-
tical sensibility. Most importantly, Bruce has that invaluable asset of the good
supervisor, an ever open door.
A big set of thanks go to Ken Barker. He was a joy to work with. Ken is also
a world class researcher and my work greatly benefited from discussions with him.
I thank him for providing so much practical advice and motivation, for making sure
this dissertation saw the light of day.
I would like to thank Professor Gordon Novak, Professor Raymond Mooney,
and Professor Art Markman for taking on the responsibilities of both serving on
my dissertation committee, and carefully reading my thesis for errors in technical
details, English grammar and style despite their busy schedules.
v
Several pieces of technology used in studying my thesis were developed prior
to my arrival at the university, or by collaborators outside the university. Particular
thanks here are due to Bruce Porter, Ken Barker, Gordon Novak, Art Souther, Peter
Clark, John Thompson, Phil Harrison, Rick Wojcik, Tom Jenkins, Vinay Chaudhri,
Ken Murray, Sunil Mishra, John Pacheco, James Fan, Peter Yeh, and Dan Tecuci.
I enjoyed the intellectual environment UTCS harbors and appreciate the
interactions with other students and researchers. Many fine folks at UTCS have
taught me by questioning, as well as by example, on how to do research.
The Knowledge Systems Group has been a great working environment, and
I particularly appreciate the friendship and many thought-provoking discussions I
have enjoyed with my colleagues: James Fan, Peter Yeh, Dan Tecuci, Michael Glass,
and Doo-Soon Kim.
There are many more friends I have made at UTCS who gave me help and
support but were not mentioned here. I would like to take this opportunity to thank
all of them.
During my time at UTCS, I have had opportunities to know and interact
with many staff members, who have enriched and made unique my experience. I
like to thank the friendly administrative staff who helped make things go smoothly:
Gloria Ramirez, Katherine Utz, and Lydia Griff. I also like to thank the technical
staff of UTCS for providing wonderful computing facilities and software.
I especially like to thank Stacy Miller. She’s the best! Stacy has boundless
energy and is always positive! She gives me endless moral support, laughs at all my
jokes, and is always helpful. Stacy provided pastoral care and is my moral compass.
Stacy also sold me my first car, a 1980, lemon yellow, Toyota Celica! It was a
reliable, awesome, and fun car to drive around Austin, TX! Driving that car made
me look cooler than I really was! I would also like to acknowledge that Stacy played
a major role in naming my system ASKME (we have the picture to prove it!).
vi
My parents and sister have been a huge source of emotional and financial
support. Despite the fact that I am thousands of miles away from them, my welfare
has always been foremost in their minds. I want to say a big thank you to them for
all the sacrifices they have made over the years in order to see me succeed in my
quest for a world-class research career.
I would like to thank my dad, Chaw Shing Cheong, for all his hard work at
supporting the family, and getting me my first home computer. Besides doing real
work on that computer, it has allowed me to play SimCity for days at end. I thank
him for believing in me.
I would especially like to thank my mom, Lek Kim Noi, for introducing me
to my first computer in 1986. My mom taught me how to turn on a computer,
use DOS, play computer games, and, subsequently, how to perform DIY work on
computers. She also greatly supported my curiosity by getting me Internet access at
the earliest opportunity. Without my mom, I would not have spent so many hours
of my life sitting in front of a computer monitor. I greatly appreciate her patience
and dedication in getting me access to computers and software at such a young age.
I would like to thank my sister, Chaw Shaw Lin, for her unwavering support
and practical advice. She has supported me financially time and again. I greatly
appreciate all her help! I would especially like to thank her for generously paying
for my MacBook, doctoral regalia, and car repairs.
I would like to thank my aunt, Chaw Suat King, for letting me access her
bookstore (Computer Book Center Pte Ltd) as a private library. The countless
afternoons and weekends reading technical books and magazines greatly influenced
my decision to study Computer Science.
I would like to thank my uncle, Chaw Kiang, for the many Saturday af-
ternoons we spent together, pounding away at the computer keyboard inside Bit
Computer Services Pte Ltd.
vii
I would like to thank my grandmother, Chng Chian Kim, for her love and
support. She has always been a good friend. Since young, my grandmother has
always cheered me on by the side, be it a kindergarten school play, or my decision to
attend graduate school in Austin, TX. I miss the wonderful times we spent chatting.
I would like to thank my professors at the National University of Singapore,
especially Khoo Siau-Cheng, for providing me an excellent undergraduate education
and getting me initiated into research.
Support for this research was provided by Vulcan Inc. as part of Project
Halo, SRI International as part of Learning by Reading, IBM Corporation, and by
the Boeing company.
I regret that my grandfather, Chaw Boon Se, did not live to see where I am
today. I would like to have shared with him these exciting years spent working on
my doctoral degree.
Shaw Yi Chaw
The University of Texas at Austin
December 2009
viii
Addressing the Brittleness of Knowledge-Based
Question-Answering
Publication No.
Shaw Yi Chaw, Ph.D.
The University of Texas at Austin, 2009
Supervisor: Bruce W. Porter
Knowledge base systems are brittle when the users of the knowledge base
are unfamiliar with its content and structure. Querying a knowledge base requires
users to state their questions in precise and complete formal representations that
relate the facts in the question with relevant terms and relations in the underlying
knowledge base. This requirement places a heavy burden on the users to become
deeply familiar with the contents of the knowledge base and prevents novice users
to effectively using the knowledge base for problem solving. As a result, the utility
of knowledge base systems is often restricted to the developers themselves.
ix
The goal of this work is to help users, who may possess little domain exper-
tise, to use unfamiliar knowledge bases for problem solving. Our thesis is that the
difficulty in using unfamiliar knowledge bases can be addressed by an approach that
funnels natural questions, expressed in English, into formal representations appro-
priate for automated reasoning. The approach uses a simplified English controlled
language, a domain-neutral ontology, a set of mechanisms to handle a handful of well
known question types, and a software component, called the Question Mediator, to
identify relevant information in the knowledge base for problem solving. With our
approach, a knowledge base user can use a variety of unfamiliar knowledge bases
by posing their questions with simplified English to retrieve relevant information in
the knowledge base for problem solving.
We studied the thesis in the context of a system called ASKME. We evaluated
ASKME on the task of answering exam questions for college level biology, chemistry,
and physics. The evaluation consists of successive experiments to test if ASKME can
help novice users employ unfamiliar knowledge bases for problem solving. The initial
experiment measures ASKME’s level of performance under ideal conditions, where
the knowledge base is built and used by the same knowledge engineers. Subsequent
experiments measure ASKME’s level of performance under increasingly realistic
conditions. In the final experiment, we measure ASKME’s level of performance
under conditions where the knowledge base is independently built by subject matter
experts and the users of the knowledge base are a group of novices who are unfamiliar
with the knowledge base.
Results from the evaluation show that ASKME works well on different knowl-
edge bases and answers a broad range of questions that were posed by novice users
in a variety of domains.
x
Contents
Acknowledgments v
Abstract ix
Contents xi
List of Tables xv
List of Figures xix
Chapter 1 Introduction 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 The ASKME Approach . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3 Project Halo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.4 Summary of Evaluation and Results . . . . . . . . . . . . . . . . . . 14
1.5 Organization of Dissertation . . . . . . . . . . . . . . . . . . . . . . . 17
Chapter 2 Technical Challenges 18
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.2 Linguistic processing challenges . . . . . . . . . . . . . . . . . . . . . 19
2.3 Resolving Representational Differences . . . . . . . . . . . . . . . . . 24
2.4 Relevance Reasoning . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
xi
2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
Chapter 3 Related Work 36
Chapter 4 Approach 42
4.0.1 Restricted English . . . . . . . . . . . . . . . . . . . . . . . . 42
4.0.2 Domain-Neutral Ontology . . . . . . . . . . . . . . . . . . . . 43
4.0.3 Question Taxonomy . . . . . . . . . . . . . . . . . . . . . . . 46
4.0.4 Question Mediator . . . . . . . . . . . . . . . . . . . . . . . . 47
4.1 The ASKME Prototype . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.1.1 Computer Processable Language . . . . . . . . . . . . . . . . 49
4.1.2 The Component Library . . . . . . . . . . . . . . . . . . . . . 57
4.1.3 Supported questions . . . . . . . . . . . . . . . . . . . . . . . 58
4.2 The Design of the Question Mediator . . . . . . . . . . . . . . . . . . 59
4.2.1 A search controller to find information to answer questions . 61
4.2.2 Resolving Representational Differences . . . . . . . . . . . . . 69
4.2.3 Relevance Reasoning . . . . . . . . . . . . . . . . . . . . . . . 76
4.2.4 Block diagram and pseudocode . . . . . . . . . . . . . . . . . 78
4.3 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
4.3.1 Restricted English . . . . . . . . . . . . . . . . . . . . . . . . 87
4.3.2 Available Ontologies . . . . . . . . . . . . . . . . . . . . . . . 89
4.3.3 Question categories . . . . . . . . . . . . . . . . . . . . . . . . 89
4.3.4 Question mediator . . . . . . . . . . . . . . . . . . . . . . . . 92
4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
Chapter 5 Evaluation 96
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
5.2 Experiment #1: Establishing ASKME’s performance under ideal con-
ditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
xii
5.2.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
5.3 Experiment #2: Brittleness analysis . . . . . . . . . . . . . . . . . . 109
5.3.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
5.3.2 Failure Categories . . . . . . . . . . . . . . . . . . . . . . . . 110
5.3.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
5.4 Experiment #3: Can questions continue to answer with different
knowledge bases? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
5.4.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
5.5 Experiment #4: Can users pose questions using ASKME to success-
fully query knowledge bases with which they are unfamiliar? . . . . . 118
5.5.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
5.6 Experiment #5: Establishing ASKME’s performance under produc-
tion conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
5.6.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
5.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
Chapter 6 Contributions and Future Work 135
6.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
6.1.1 Prior art . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
6.1.2 A General Framework for querying unfamiliar knowledge bases 136
6.1.3 An Application of ASKME for a system answering AP-like
questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
6.2.1 Identifying unstated assumptions . . . . . . . . . . . . . . . . 139
6.2.2 Sanity Checking . . . . . . . . . . . . . . . . . . . . . . . . . 140
6.2.3 Debug Tool . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
6.2.4 Episodic Memory and Transfer Learning. . . . . . . . . . . . 141
6.2.5 Machine Reading Application . . . . . . . . . . . . . . . . . . 142
xiii
Bibliography 143
Vita 158
xiv
List of Tables
1.1 Differences between two successful types of knowledge-base-systems:
rule-based expert-systems and Wikipedia . . . . . . . . . . . . . . . 7
3.1 Summary of notable knowledge based systems and the challenges ad-
dressed by them . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.1 Some of the known properties of restricted English that are important
to making natural language input easier to process by computers.
(originally produced by Kittredge[61], I reproduce the table here for
convenience) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.2 Reproduced for convenience. Example CPL sentences taken verbatim
from Clark et. al.[30] . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.3 Users follow a set of guidelines while writing CPL sentences. Some
of the guidelines are stylistic recommendations to reduce ambiguity,
while others are firm constraints on vocabulary and grammar. These
guidelines are taken verbatim from Clark et. al.[30]. . . . . . . . . . 54
4.4 For convenience, we reproduce verbatim the question categories, ab-
stract specification, and associated examples from the Graesser-Person
question classification scheme[49] . . . . . . . . . . . . . . . . . . . . 60
xv
4.5 Summary on how each question type can be represented into query
triples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.6 Available question categories in QUALM, Graessner-Person, and ASKME.
For convenience, we reproduce verbatim the sample questions that
were previously described in QUALM[68] and the Graessner-Person
question taxonomy[49]. . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.1 Four knowledge bases were used in the evaluation. The knowledge
bases for each science domain were created by knowledge engineers
working alongside subject matter experts. The significantly larger
multidomain knowledge base was created by concatenating the con-
tents of the individual domain knowledge bases. . . . . . . . . . . . . 100
5.2 Summary of the differences between the regular version of ASKME,
ASKME-W/O-QM, and ASKME-BFS-QM. Relevance reasoning is
not applicable to ASKME-W/O-QM because it does not search the
knowledge base for information to extend a question for problem solving.102
5.3 Correctness scores achieved by knowledge engineers posing questions
using ASKME on the reference knowledge base . . . . . . . . . . . . 103
5.4 Effect of question mediator and xform rules on answering questions
with reference knowledge-base . . . . . . . . . . . . . . . . . . . . . . 103
5.5 The average, median, 75th percentile, 90th percentile, and the maxi-
mum number of states explored by both versions of the system – with
and without relevance reasoning – for both domain knowledge bases
and the larger multi-domain knowledge base. . . . . . . . . . . . . . 106
xvi
5.6 The average, median, 75th percentile, 90th percentile, and the maxi-
mum runtime performance (seconds) on both versions of the system
– with and without relevance reasoning – for both domain knowledge
bases and the larger multi-domain knowledge base. In some cases,
the heuristics used by ASKME contributed to worse runtime perfor-
mance, but they were necessary to maximize correctness scores. . . 107
5.7 The correctness scores for versions of the system with and without rel-
evance reasoning. Both versions achieved similar correctness scores
on the domain knowledge-bases. This indicates that relevance rea-
soning did not sacrifice correctness. The version of the system with-
out relevance reasoning recorded lower correctness scores when used
with the significantly larger multi-domain-kb when answering physics
questions. This is due to the large number of states explored during
blind-search and our evaluation setup aborting an attempt after a
time-bound is reached. This result highlights the need for the system
to select only the most relevant portions of the knowledge base to
reason with. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
5.8 Correctness scores achieved by knowledge engineers on the unseen
question set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
5.9 Failure analysis on why biology questions in the unseen question set
fail to answer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
5.10 Failure analysis on why chemistry questions in the unseen question
set fail to answer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
5.11 Failure analysis on why physics questions in the unseen question set
fail to answer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
xvii
5.12 The knowledge bases used in the study on ASKME’s ability to answer
questions using a variety of knowledge bases that differ in content
and organization. Aside from the reference knowledge bases, which
was created by knowledge users working closely with subject matter
experts, the other knowledge bases were independently authored by
subject matter experts. . . . . . . . . . . . . . . . . . . . . . . . . . 116
5.13 Correctness scores when reference question formulations are attempted
on different KBs independently built by subject matter experts. . . . 117
5.14 Answer credit scores observed in experiment #4’s question-answering
evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
5.15 Aggregated and average correctness scores achieved by novice users
in experiment #4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
5.16 Data on number of formulations and time spent per question by
novice users in experiment #4 . . . . . . . . . . . . . . . . . . . . . . 125
5.17 Answer credit scores achieved by knowledge engineers and novice
users on knowledge bases that are independently authored by sub-
ject matter experts. . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
5.18 Aggregated and average correctness scores achieved by novice users
in experiment #5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
xviii
List of Figures
2.1 Two questions answered using different viewpoints of the Plant con-
cept shown in Figure 2.2. . . . . . . . . . . . . . . . . . . . . . . . . 25
2.2 The plant concept and its two viewpoints . . . . . . . . . . . . . . 26
2.3 Questions having multiple answers using different information con-
texts in the knowledge base. . . . . . . . . . . . . . . . . . . . . . . . 28
2.4 The question representation contains two modeling differences with
the knowledge base. First, the question was described as a single
Move event, while the knowledge base contains a richer representa-
tion, where the Move is caused by an Exert-Force. Second, the
query in the question is on the force slot, while the necessary equa-
tions to answer the question are stored on the net-force slot in the
knowledge base. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.5 This question highlights the need to automatically install unstated
assumptions commonly present in questions. The sentences in version
2 that are rendered in italics introduce commonly assumed values for
the initial y speed, initial x position, and final y position. . . . . . . 31
xix
2.6 This example highlights the difficulty in finding relevant information
in the knowledge base to return correct answers. Two answers to
the question are shown. Both answers were returned by the prob-
lem solver using the same knowledge base. One of them is incorrect
because it assumed acceleration to be 9.8 m/s2 instead of -9.8 m/s2. 32
2.7 Using a typical biology knowledge base, the problem solver describes
Lysosomes to play the role of a container. Ideally, the returned an-
swer should describe the specific container role played by Lysosomes
(rendered in italics). . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.1 Different generations of knowledge-based systems and their contribu-
tions to question-answering . . . . . . . . . . . . . . . . . . . . . . . 36
4.1 A graphical table of contents for the detailed example given in Figures
4.2 and 4.3. ASKME answers the user’s question in three steps:
interpreting the question using the CPL processor, selecting domain
knowledge to answer it using the question mediator, and generating
an explanation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.2 In panel 1, a physics question is posed to the system in simplified
English. The system interprets the question as shown in Panel 2. The
scenario and query of the question is interpreted as a Move event on
an Object having mass 80 kg. The initial and final velocity of the
Move are 17 m/s and 0 m/s respectively. The distance of the Move
is 10 m. There is also an Exert-Force event whose object is the
same object of the Move event. The Exert-Force event causes the
Move event. The query is on the net-force of the Exert-Force and
is the node with a question-mark. ASKME’s processing continues in
Figure 4.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
xx
4.3 The continuation of the example from Figure 4.2. In panel 3, the
question mediator draws in information from the knowledge base.
The final answer and explanation are shown in Panel 4. . . . . . . . 52
4.4 The user poses questions in restricted English. CPL tries to un-
derstand the question. The CPL interpreter provides reformulation
advice if it detects CPL errors. Otherwise, it presents the system’s
understanding of the question both as an English paraphrase and
graphically for the user to check if CPL understood the question cor-
rectly. This figure is courtesy of Peter Clark at Boeing Phantom
Works. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.5 On the left-hand side, the scenario and query of the question is rep-
resented as a Move event on an Object having mass 80 kg. The
initial and final velocities of the Move are 17 m/s and 0 m/s respec-
tively. The distance of the Move is 10 m. There is also an Exert-
Force event whose object is the same as that of the Move event.
The Exert-Force event causes the Move event. The query is on
the net-force of the Exert-Force and is the node with a question-
mark. To answer the question, the question mediator has to draw in
information from the knowledge base. As shown on the right-hand
side, information from the Motion-under-force and Motion-
with-constant acceleration concepts are used to elaborate the
question for the reasoner to compute the answer. . . . . . . . . . . 62
xxi
4.6 Search graph created by the question mediator to answer the question
introduced in Figure 4.5. Each state in our state-space tree contains
a minikb. The initial state in the tree represents the original minikb
for the question. The minikbs in other states are elaborations of the
original minikb. Each operator in the tree describes how a concept in
the knowledge base can be applied to elaborate a minikb to produce
another. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.7 The minikbs for states 1, 2, and 5 in the search graph in Figure 4.6.
State 1, the initial state in the search graph, contains the original
minikb for the question. State 2 contains the minikb created by elab-
orating the minikb in State 1 using the Motion-with-constant-
acceleration concept. This elaboration introduced an equation to
calculate the acceleration of the Move event. Using this equation, the
acceleration of the Move was computed to be 14.45 m/s2. Further
elaborating the minikb in State 2 using the Motion-under-force
concept results in the minikb in state 5. This minikb contains the
equations to compute the net-force causing the Move to be -1156
Newtons. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.8 The semantic matcher identified the matching features (highlighted
in bold) between the state 1 on the left-hand side and the Motion-
with-constant-acceleration concept on the right-hand side. In
this case, the result of semantic matching becomes Operator A relat-
ing state 1 to state 2 in the search graph shown in Figure 4.6 . . . . 67
xxii
4.9 The elaborated minikb after applying Operator A in the search graph
(Figure 4.6). Panel 1 shows the minikb of state 1 and panel 2 shows
the Motion-with-constant-acceleration concept. The overlap-
ping features found by semantic matching between the minikb in state
1 and Motion-with-constant-acceleration are highlighted in
bold in panels 1 and 2. These overlapping features are joined to form
the minikb in panel 3. The new piece of knowledge introduced by the
concept Motion-with-constant-acceleration is highlighted in
bold in panel 3. This minikb forms State 2 of the search graph
(Figure 4.6). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.10 The application of a transformation rule to include an additional
relations between Human-Cell and Chromosome in the richer
representation. This particular rule encodes the transitivity of the
has-part relation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.11 The original set of query triples to answer “What is the function of
lysosomes?” is expanded to include related queries on the Agentive
roles. This, in turn, retrieves additional details from the knowledge
base to answer the question. . . . . . . . . . . . . . . . . . . . . . . . 74
4.12 Answer differences when additional queries are used/not used to re-
trieve finer-grained information. . . . . . . . . . . . . . . . . . . . . . 75
4.13 The semantic matcher did not identify any matching features between
State 1 (Figure. 4.6) on the left-hand side and the Circle concept
on the right-hand side. In this case, no operator is created. . . . . . 77
4.14 Information from Fall-from-rest is added to State 1 of Figure 4.6,
resulting in an inconsistency in which the initial velocity of the Move
has multiple values, 17 m/s and 0 m/s. . . . . . . . . . . . . . . . . . 78
xxiii
4.15 Different degrees of match between state 1 in Figure 4.6 and the
Motion-under-force and Two-Dimensional-Move concepts in
the knowledge base. The match with Motion-under-force has a
higher degree of match because a larger portion of Motion-under-
force matches state 1. Thus, Operator B is preferred over Operator
C in Figure 4.6. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.16 Block diagram of the question mediator. . . . . . . . . . . . . . . . . 80
4.17 The pseudocode for the question mediator is given in Figures 4.17-
4.21. The high-level structure of the algorithm is best-first search.
This figure gives a standard definition of best-first search by Nilsson[84].
Figures 4.18, 4.19-4.20, and 4.21 show the instantiation of steps 6,7,
and 9 (respectively) to describe the question mediator. . . . . . . . . 82
4.18 Pseudocode for step 6 of Figure 4.17 . . . . . . . . . . . . . . . . . . 83
4.19 Figures 4.19 and 4.20 are the pseudocode for step 7 of Figure 4.17
(Part 1/2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.20 Figures 4.19 and 4.20 are the pseudocode for step 7 of Figure 4.17
(Part 2/2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
4.21 Pseudocode for step 9 of Figure 4.17 . . . . . . . . . . . . . . . . . . 86
5.1 Answer credit distributions among novice users and the knowledge
engineer in experiment #4 . . . . . . . . . . . . . . . . . . . . . . . . 120
5.2 Score distribution among novice users and the knowledge engineer in
experiment #5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
xxiv
Chapter 1
Introduction
Knowledge base systems enable non-experts to perform tasks that have tradition-
ally required experts. However, these systems are brittle and expensive to build.
The domain knowledge has to be formalized into logic to do problem solving. This
knowledge acquisition task is difficult and error-prone, particularly for domain ex-
perts who have little training in knowledge representation. Knowledge engineers
usually work closely with domain experts to create the knowledge base. The result-
ing knowledge base is complex and engineered to perform well on specific tasks, but
is ill suited to more general use.
I would like to reduce the cost of knowledge acquisition and enhance the
utility of the knowledge base. In the ideal system, domain experts create knowledge
bases using familiar notations in their domains. These domain-specific representa-
tions are then converted into computational logic for problem solving. The resulting
knowledge base may then be employed by users with little training to perform a va-
riety of tasks.
Achieving this ideal requires overcoming the brittleness in automated reason-
ing. This brittleness is due to the arms-length separation between the people who
built the knowledge base and the users who will eventually use it. First, the domain
1
experts building the knowledge base cannot anticipate all the tasks it will be used
for. Thus, they cannot engineer their representations to support problem solving for
these unanticipated tasks. Second, the users of the knowledge base are unfamiliar
with the contents of the knowledge base. Therefore, the users face difficulties in
working with the knowledge base for problem solving.
Learning to use an unfamiliar knowledge base is similar to learning to commu-
nicate in a new language. A non-native speaker visiting a foreign-speaking country
has to learn the vocabulary, the grammar, and the pragmatics of the foreign lan-
guage to communicate effectively. Likewise, a knowledge base user has to learn the
terms and relations in the knowledge base, the valid constructions for those terms
and relations, and identify the most appropriate representation in the knowledge
base for problem solving.
A steep learning curve confronts both expert and novice knowledge base
users, and is especially pronounced for users with limited domain expertise. Knowl-
edge base users who lack domain expertise often convey their intent in long descrip-
tions composed of general terms and relations. For example, a patient who lacks
medical expertise typically describes his ailment to the doctor in long descriptions
using general concepts familiar to himself and the doctor. His description may also
contain irrelevant or even incorrect information. To treat the patient, the doctor
applies his medical expertise to recognize how different ailments fit the described
symptoms. The difficulty faced by the patient to access relevant medical expertise
possessed by the doctor is analogous to the difficulties faced by a knowledge base
user with limited domain expertise. In a best effort, the user describes his intent
in long descriptions, using only general terms in the knowledge base. However,
his description is unlikely to identify specific information in the knowledge base for
problem solving.
2
This work describes the design and evaluation of a system, called ASKME1,
that is intended to bridge the gap between the formal representations of knowledge
bases and the automated reasoners attempting to use those representations to an-
swer questions. The described approach helps users with different levels of domain
expertise to use unfamiliar knowledge bases for problem solving. ASKME’s algo-
rithms take advantage of features of the generic knowledge representation language,
but are independent of the specific content and organization in these knowledge
bases. I note that all of the domain knowledge and inference methods are in the
knowledge bases and that ASKME’s role is to interpret the user’s question and
identify the necessary information in the underlying knowledge base for problem
solving.
The first section describes two successful knowledge base systems and their
design tradeoffs. In an ideal system, the knowledge base is built independently
by subject matter experts, and it can be used by a different group of users who
possess limited expertise in the subject matter for problem solving. This creates
a substantial challenge for automated question answering systems: to be able to
answer questions that are formulated without regard for (or even familiarity with)
the knowledge base that is expected to answer it. This ideal knowledge base system
has yet to be built due to the difficulty faced by users with limited domain expertise
to use unfamiliar knowledge bases for problem solving.
The second section describes in detail the goal of this dissertation. This
work aims to help users, who may possess little domain expertise, to use unfamiliar
knowledge bases for problem solving. My thesis is that the difficulty in using unfa-
miliar knowledge bases can be addressed by an approach that consists of a version of
restricted English, a set of mechanisms to answer a handful of well-known question
types, and a software component to identify relevant information in the knowledge1ASKME is the acronym for “Automated Selection of Knowledge in a Mediated Environment”
3
base for problem solving. With my approach, a knowledge base user can use a vari-
ety of unfamiliar knowledge bases by posing their questions with simplified English
to retrieve relevant information in the knowledge base for problem solving.
The third section outlines my approach to studying the thesis. I studied the
thesis for the task of question answering in the context of a system called ASKME,
that is used to answer questions posed by users who are unfamiliar with the knowl-
edge base. I built a prototype version of ASKME to study and evaluate my approach.
The fourth section summarizes the results of my evaluation of ASKME. With
help from an independent evaluation team, I evaluated ASKME on the task of an-
swering AP-like2 exam questions in the domains of college level biology, chemistry,
and physics. I conducted successive experiments to test if ASKME can help novice
users solve problems with unfamiliar knowledge bases. The initial experiment mea-
sures ASKME’s level of performance under ideal conditions, where the knowledge
base is built and used by the same knowledge engineers. Successive experiments
measure ASKME’s level of performance under increasingly realistic conditions. In
the final experiment, I measure ASKME’s level of performance under conditions
where the knowledge base is built independently by subject matter experts and the
users of the knowledge base are novice users who are unfamiliar with its contents.
1.1 Background
Knowledge base users face the challenge of creating inputs for an automated rea-
soner to perform automated reasoning. The inputs have to be described using the
vocabulary and grammar of the knowledge base. They also have to identify rele-
vant information in the knowledge base for problem solving. To successfully create
the inputs requires the user to have domain expertise and a deep understanding of2“AP (Advanced Placement) exams” are nationally administered college entry-level tests in the
USA
4
the knowledge base. This familiarity is difficult to acquire from the builders of the
knowledge base and also difficult to transfer to its intended users.
To overcome the difficulty facing users of unfamiliar knowledge bases, the
designers of knowledge base systems typically make a tradeoff between engineering
a small knowledge base for a narrowly defined, deep reasoning task, or to widen
the knowledge acquisition bottleneck to quickly create a large knowledge base for
shallow reasoning. Rule-based[17] and Wikipedia[117] systems are two successful
knowledge base systems that approach this tradeoff differently.
Rule-based systems are designed to automate tasks that have traditionally
required experts. However, these systems are expensive to build and they require
domain experts to work alongside knowledge engineers to create the knowledge base.
Acquiring the knowledge from experts to build the knowledge base is a difficult and
error prone process, that often results in knowledge bases that are engineered to
perform well only on specific tasks. The limited nature of rule-based systems has
three disadvantages. First, rule-based systems cannot be easily adapted to perform
unanticipated tasks because they lack the knowledge of “first principles” from which
to reason [69, 82]. Second, because the knowledge base is tailored for a particular
task, that knowledge is not easily reused or extended to perform other tasks within
the same domain [22, 42]. Third, the resulting knowledge bases are often usable
only by the developers themselves due to their complexity and brittleness[17].
By contrast, it is easy to build and use a Wikipedia knowledge base. Wikipedia
uses representations familiar to many users, such as text, images, and other unstruc-
tured elements. While Wikipedia is successfully used by a large user community to
search for relevant documents, it lacks the ability to solve problems on higher levels
such as performing diagnosis or prediction tasks. For Wikipedia to have such capa-
bilities that rely on automated reasoning would require separate systems to convert
the unstructured content into computational logic [5, 6, 73, 74, 107]. Automating
5
this task is complex and difficult to do well. The conversion task requires sophis-
ticated natural language and image understanding systems [6, 75, 93, 122]. Once
converted, the generated knowledge bases often contain shallow information[73, 81,
93, 115], errors[74], and are too highly fragmented [10, 59] to be useful for auto-
mated reasoning. Plus, they are large and complex, making it difficult for users to
become proficient in using them to perform automated reasoning.
Table 1.1 lists the differences between rule-based and Wikipedia systems.
Their differences convey the tension between engineering a knowledge base to au-
tomate a narrowly defined task that requires advanced expertise, or to widen the
knowledge acquisition bottleneck to quickly create knowledge bases for shallow rea-
soning.
Ideally, a knowledge base system has the following three properties:
• The knowledge base is built independently by domain experts.
• The system achieves an arms-length separation between the builders and the
users of the knowledge base.
• Users who lack domain expertise or familiarity with the knowledge base can
perform automated reasoning with the resulting knowledge base
This knowledge base system is ideal from several perspectives, e.g., the re-
duced knowledge acquisition costs since domain experts can build the knowledge
base themselves without working alongside knowledge engineers. But more impor-
tantly, for purposes of this dissertation, the ideal system is less brittle and does
not require users to face a steep learning curve. Therefore the ideal knowledge base
system is easier to use and its application is not limited to the developers themselves.
The goal of this dissertation is to enable users who lack domain expertise to
use unfamiliar knowledge bases for problem solving.
6
Rule-basedExpert-Systems Wikipedia
Purpose Automate task that have tradi-tionally required experts
Repository of a wide variety ofinformation
Knowledge-representation
structured, formal-logic unstructured, natural language,diagrams, equations, etc.
Inference automated reasoning information retrieval techniquesKnowledge-acquisiton
difficult easy
Problem-solving
easy to achieve, e.g., deduction,heuristic classification, etc.
difficult to achieve, requires ro-bust natural language and im-age understanding systems
Created by Knowledge-engineers workingclosely with subject-experts
Subject-experts working inde-pendently
Used by small user community, typicallythe developers themselves
large user community with dif-ferent levels of domain expertise
Coverage Narrow. Focused on particulartasks
Broad coverage
Table 1.1: Differences between two successful types of knowledge-base-systems: rule-based expert-systems and Wikipedia
7
1.2 The ASKME Approach
My thesis is that the brittleness problem of knowledge based question answering
can be addressed.
• Questions can be posed with little regard for the ontology of the knowledge
base
• Knowledge base users can use a variety of unfamiliar knowledge bases for
problem solving
• There can be an arms-length separation between the builders and the users of
the knowledge base.
ASKME’s algorithms take advantage of features of the generic knowledge
representation language, but are independent of the specific content and organization
in these knowledge bases. All of the domain knowledge and inference methods
necessary for problem solving are in the knowledge bases. ASKME’s role is to
interpret the user’s question and identify the necessary information in the underlying
knowledge base for problem solving.
I have built a prototype version of ASKME. The major features of ASKME
collectively function as a funnel (metaphorically) to map questions stated in unre-
stricted English into a formal representation that can be answered by a knowledge
base system. I have designed ASKME to be easy to learn and use, while being
robust enough to answer a variety of questions. I believe my engineering choices
should not burden the user with restrictions on how they can pose questions. With
ASKME, users formulate their questions using a restricted English vocabulary. The
questions are then interpreted by ASKME and represented in a logical form us-
ing general terms and relations from the knowledge base. ASKME then identifies
relevant information in the knowledge base to extend the question formulation for
problem solving.
8
The next sections discuss the design of the ASKME components.
Restricted English
Producing formal representations from natural language is very difficult be-
cause natural language permits an enormous amount of expressive variation. Dif-
ferent writers may use a variety of styles and grammatical constructions to express
the same meaning. ASKME uses a controlled language, which avoids difficult prob-
lems such as resolving ambiguity and co-reference in natural language processing.
A controlled language is a form of language with special restrictions on grammar
and style usage that are based on well-established writing principles. Using a con-
trolled language improves the consistency and readability of a text by countering
the tendency of writers to use unusual or overly specialized language constructions.
Controlled language has been successfully deployed in industry and has been shown
to be useful in many technical settings[96, 112, 116]. The commercial success of
controlled language suggests that people can indeed learn to work with restricted
English.
Domain-Neutral Ontology
The English language with its large vocabulary allows users to express things
in subtly different ways. This expressiveness complicates the understanding of text
by computers, which involves mapping the many ways of expressing something in
English, into meaningful representations in the knowledge base. Automating this
task is complicated because what is written in English often does not fit into the
target ontology of the knowledge base. ASKME uses a domain-neutral ontology to
overcome the technical difficulty of choosing appropriate entries in the knowledge
base to create relevant meaning representations. In a variety of applications, domain-
neutral ontologies has been shown to be easily learnt and used by novices to express
a wide variety of knowledge through the specialization and composition of these
9
general terms.
Limited question types
The English language also allows a wide variety of question types. It is a
challenging task to enumerate all questions and implement the required process-
ing to correctly answer each question. ASKME requires the user to formulate the
original question in one of several well-known question types to make it easier for
computer processing. The supported questions are the simple and intermediate
difficulty questions in the Graessner-Person question taxonomy[49]. The difficulty
scale in the Graessner-Person taxonomy is defined by the amount and complexity of
content produced to answer a question. The set of simple and intermediate difficulty
questions is found to be well known and to cover a majority of questions posed by
both students and tutors in a variety of settings.
Question mediator
Given a question stated with general terms and relations that are common
among different knowledge bases, the question mediator’s challenge is to identify
information in a knowledge base to extend the question so as to enable the problem
solver to answer the question. The question mediator synthesizes a mini-knowledge
base (“minikb”) that contains the information needed to infer an answer. Initially,
the minikb contains only the information (triples) in the question. The question
mediator incrementally extends the minikb with frames (domain concepts) drawn
from the knowledge base. The frames include both domain assertions and inference
methods. The question mediator succeeds if it constructs a minikb that is sufficient
to answer the question. My approach to the question mediator consists of searching a
state space. The states are minikbs and operators select content from the knowledge
base being queried and add it to the minikb. I describe the question mediator using
the five components of state-space search: states, goal test, goal state, operators,
10
and the control strategy.
State: Each state in the state-space tree contains a minikb represented in a form
that is similar to conceptual graphs[106]. The initial state in the tree represents the
original minikb for the question. The minikbs in other states are elaborations of
the original minikb that contain additional information from other concepts in the
knowledge base.
Goal test and goal state: The goal test determines if a state’s minikb answers
the question. A state in the search graph containing such a minikb is known as a
goal state. The basic goal test retrieves and tests if the values to the queries in a
state’s minikb is null. Where applicable, the goal test includes additional queries
to improve the answer by retrieving similar kinds of information or finer-grained
details from the state’s minikb.
Operator: Operators elaborate a state to produce another. Operators are created
from the set of concepts in the knowledge base – these concepts are represented as
concept graphs. A semantic matcher takes a state’s minikb and a concept (encoded
in a form similar to conceptual graphs) and uses taxonomic knowledge to find the
largest connected subgraph that is isomorphic between the two representations.
The output of semantic matching forms an operator relating two states. Where
applicable, the question mediator transforms the candidate graphs to improve the
match between a state’s minikb and the concept. This transformation resolves
representational differences that may exist between a state’s minikb and concepts
in the knowledge base. Applying an operator to a state creates a successor state.
The successor state includes new information introduced by a concept from which
the operator was created. The new information is merged with the parent state
by joining [106] the representations on their common features found by semantic
matching.
11
Control strategy: The question mediator expands the search graph in a breadth
first manner to ensure the minikb returned by the question mediator contains only
concepts necessary to answer the question. To achieve good performance and scal-
ability on large knowledge bases, the search controller applies heuristics to reject
operators that are not useful. It also orders operators on each ply based on their
relevance to answering the question.
1.3 Project Halo
The research of this dissertation was performed in the context of Project Halo.
A goal of Project Halo is to develop a knowledge-based question answering
system capable of answering novel (previously unseen) questions posed by untrained
non-experts using knowledge bases built by subject matter experts in a variety of
domains [114]. This goal creates a substantial challenge for automated question
answering systems: to be able to answer questions that are formulated without
regard for (or even familiarity with) the knowledge base that is expected to answer
it. A question answering system has successfully addressed this challenge if it can be
coupled with any of a variety of knowledge bases – each with its own independently
built ontology – and it can answer questions without requiring users to reformulate
the questions for each knowledge base that is used.
The ASKME prototype described in this dissertation is part of a larger
AURA system developed by a team of researchers to achieve the goals of Project
Halo. The AURA system enables subject matter experts to create knowledge bases
using concept maps, equations and tables – all of which are converted to computa-
tional logic [25]. In addition, the AURA system enables a different set of users, who
have limited domain expertise or familiarity with the knowledge base, to pose ques-
tions, like those found in Advanced Placement(AP) exams, and to receive coherent
answers and explanations.
12
The AURA system consists of three main components: (1) the knowledge
capture system; (2) the ASKME system; and (3) inference engines and explanation
generation.
SRI implemented the “Knowledge Capture” portions of the system and in-
tegrated the components into AURA.
ASKME was primarily developed at the University of Texas at Austin with
significant help from Peter Clark (Boeing Phantom Works). The CPL interpreter
used in ASKME was primarily developed by Boeing Phantom Works. The domain-
neutral ontology used in ASKME, the Component Library, has been under develop-
ment by the Knowledge System Group for the past ten years. I have been a major
contributor to the Component Library. The set of supported question types was
identified by myself, and I also designed and implemented the question mediator.
The integration of the various components of ASKME was done by Peter Clark
(Boeing Phantom Works) and myself.
The University of Texas at Austin also contributed the KM inference engine,
an equation solver, a conversion system for units of measurement, and an expla-
nation generator. KM was originally developed by Peter Clark (Boeing Phantom
Works) and Bruce Porter. The equation solver and conversion system for units of
measurement were developed by Gordon Novak. Explanation generation support
was primarily developed by Ken Barker.
Bonnie John of Carnegie Mellon University designed the user interface for
all parts of AURA.
BBN Technologies helped in the evaluation of the AURA system.
Participating in Project Halo provides a platform to study the difficulties
faced by users in employing unfamiliar knowledge bases. First, the AURA system
makes available the components for performing knowledge formulation, reasoning,
and explanation generation. The availability of these components lets me study the
13
thesis without having to build them myself. Second, the prototype’s performance in
achieving the goals of Project Halo is rigorously evaluated by an independent eval-
uation team that did not participate in the design and development of the AURA
system[33, 72]. Participating in this evaluation gives me access to data (e.g., knowl-
edge bases, question formulations, and user logs) I can use to study aspects of
ASKME.
1.4 Summary of Evaluation and Results
I assessed the baseline performance of ASKME on the task of answering questions
like those found in the AP exam. The question-answering task involved users posing
AP-like questions to the system to retrieve solutions from the knowledge base. The
Project Halo team chose the AP test as an evaluation criterion because it is a
widely accepted standard for testing whether a person has understood the content
of a given subject. The team chose the domains of college level biology, chemistry,
and physics because they are fundamental, hard sciences and they stress different
kinds of representations[33].
The evaluation consists of successive experiments to test if ASKME can help
novice users employ unfamiliar knowledge bases for problem solving. The initial
experiment measures ASKME’s level of performance under ideal conditions, where
the knowledge base is built and used by the same knowledge engineers. Succes-
sive experiments measure ASKME’s level of performance under increasingly realis-
tic conditions, where, ultimately, the final experiment measures ASKME’s level of
performance under conditions where the knowledge base is independently built by
subject matter experts and the users of the knowledge base are a different group of
novice users who are unfamiliar with the knowledge base.
The first experiment established the level of performance for ASKME oper-
ating under ideal conditions. These ideal conditions are similar to the way knowl-
14
edge base question-answering systems have traditionally been used - the people who
built the knowledge base were the ones who used it. Additionally, the set of AP-like
questions used in this evaluation are known to the knowledge engineers building the
knowledge base. The set of knowledge bases used in the experiment covered por-
tions of a science textbook in biology, chemistry, and physics. Although the ideal
conditions avoid the problem of novice users employing unfamiliar knowledge bases
for problem solving, the experiment provides a baseline for judging the contribution
of ASKME under less ideal, but typical conditions. Results from the experiment
show ASKME to increase the number of correctly answered questions by identifying
relevant information in the knowledge base for problem solving.
The second experiment presented a brittleness study to provide a fair measure
of ASKME’s ability to answer AP-like questions. The setup for this experiment is
similar to the previous one except that it uses a different set of AP-like questions that
were not made available to the knowledge engineers when they built the knowledge
base. The purpose of the brittleness study is to identify the major failures preventing
ASKME from correctly answering questions. Results from the study indicate that
failures to answer the questions correctly arise from gaps in the knowledge base,
bugs in the implementation, or unsupported question types.
I claim that ASKME’s algorithms take advantage of features of the generic
knowledge representation language, but are independent of the specific content and
organization in the knowledge bases. The third experiment tested this claim by
measuring ASKME’s performance at answering questions using knowledge bases
authored by different users. The experiment tests whether ASKME can continue
to answer questions when the original knowledge base is exchanged for one with
different content and organization. Results from the experiment show that ASKME
continues to answer questions correctly despite the difference in knowledge base.
15
The fourth experiment evaluated my conjecture that ASKME is able to an-
swer questions that are formulated by users who do not know the ontology of the
knowledge base to which they are posed. The experiment was conducted by an inde-
pendent evaluation team[72] that did not participate in the design and development
of ASKME. Nine undergraduates were recruited to participate in this experiment.
These undergraduates have little experience in knowledge representation and have
no familiarity with the knowledge base being queried. They underwent a four hour
training session that taught the basics in using ASKME to pose and receive answers
to AP-like questions. The answers and explanations returned by ASKME were then
graded by independent graders with experience at grading AP-exams. Results from
the experiment show that when novice users can achieve correctness scores that are
comparable to (and in some cases surpass) the knowledge engineers who built the
knowledge base. The experiment shows that using ASKME, deep familiarity with
the knowledge base does not provide a big advantage, and that novice users can
effectively use ASKME to interact with unfamiliar knowledge bases for the task of
answering AP-like exam questions.
The fifth experiment evaluated ASKME’s level of performance when operat-
ing under realistic conditions, where the knowledge bases are independently built by
subject matter experts and the users of the knowledge base are novice users who are
unfamiliar with the knowledge base. The experiment was conducted by the same
independent evaluation team as in the previous experiment. In this experiment, a
group of subject matter experts with little training in knowledge representation and
reasoning are tasked to create a set of knowledge bases for the three science domains:
biology, chemistry, and physics. After the knowledge bases were built, a different
group of novice users were then tasked to query the knowledge bases using ASKME
to answer a set of AP-like questions. As a control, knowledge engineers were also
tasked to use ASKME to attempt question answering with the independently-built
16
knowledge bases. The answers and explanations returned by ASKME were then
graded by independent graders with experience at grading AP exams. Results from
the experiment continue to show that novice users can achieve correctness scores
that are comparable to (and in some cases surpass) the knowledge engineers who
built the knowledge base.
1.5 Organization of Dissertation
The remainder of this dissertation is organized as follows:
1. Chapter 2 discusses the technical challenges facing ASKME.
2. Chapter 3 surveys related work on overcoming the difficulty in using knowledge
base systems.
3. Chapter 4 presents my approach to answer questions formulated by users un-
familiar with the knowledge base.
4. Chapter 5 describes an evaluation of ASKME’s ability to answer questions
posed by users unfamiliar with the knowledge base.
5. Chapter 6 summarizes the dissertation and lists possible areas for future re-
search.
17
Chapter 2
Technical Challenges
2.1 Introduction
ASKME is designed to help users avoid the difficulty of using unfamiliar knowledge
bases. This chapter describes three technical challenges faced by ASKME.
ASKME’s first technical challenge is to parse questions posed in natural lan-
guage and create a correct meaning representation of the text. Producing formal
representations from natural language poses a challenge because natural language
permits an enormous amount of expressive variation and much of what is commu-
nicated in language is not explicitly stated. A sample of the linguistic processing
challenges include handling prepositional phrase attachment, pronoun resolution,
reference resolution, compound noun interpretation, word sense disambiguation, se-
mantic role labeling, proper noun interpretation, units of measure, comparatives,
negation, and ellipsis.
ASKME must also resolve representational differences between questions for-
mulated by the user and representations in the knowledge base. Such representa-
tional differences arise for several reasons. For example, when knowledge bases are
built and used by different people, differences may arise. When the knowledge bases
18
are built using an expressive ontology, the same meaning can be represented differ-
ently. When the knowledge bases are large and complex, representational differences
may occur when trying to cover broad domains.
The third technical challenge of ASKME is to perform relevance reasoning.
First, it is well known that traditional problem solving methods such as deductive
reasoning and heuristic classification do not scale well on large knowledge bases.
Therefore, relevance reasoning is necessary for ASKME to work well with knowledge
bases covering broad and complex domains. Second, a knowledge base may return
different answers to a question depending on how it is used or how a question is
formulated. Thus, it is useful to have heuristics to return a relevant answer.
2.2 Linguistic processing challenges
Linguistic challenges are concerned with parsing the original sentences and creating
a correct meaning representation of the text. All the usual linguistic challenges
fall into this category. These include tasks like prepositional phrase attachment,
pronoun resolution, reference resolution, compound noun interpretation, word sense
disambiguation, semantic role labeling, proper noun interpretation, units of mea-
sure, comparatives, negation, and ellipsis. I briefly discuss four types of linguistic
challenges, attachment preferences, co-reference resolution, ellipsis, and interroga-
tives, using the following question1:
“A banana is thrown into the air with an initial velocity of 20 m/s. Gravity
stops the piece of fruit briefly before it falls back down. Calculate the acceleration
at a height of 2 m during the movement upwards. Is the acceleration greater than 2
m/s2?”1These examples are inspired by an earlier report by Clark et. al.[27]
19
Attachment preferences
One of the first steps in natural language processing(NLP) is to parse the input
sentences to determine the syntactic structure of the text. Resolving attachment
ambiguities poses a pervasive problem in parsing natural language. A syntactic
parser often encounters phrases that can be attached to two or more different nodes
in the parse tree, and must decide which one is correct. Depending on how the
nodes are attached, there can be significant differences in meaning. For example,
a standard NLP system may return the following two syntactic parses for the first
sentence in the question below.
Attachment preference #1
“A banana vp[is thrown pp[into the air] pp[with an initial velocity of
20 m/s]]. Gravity stops the piece of fruit briefly before it falls back down. Calculate
the acceleration at a height of 2 m during the movement upwards. Is the acceleration
greater than 2 m/s2?”
Attachment preference #2
“A banana is thrown into np[the air pp[with an initial velocity of 20
m/s]]. Gravity stops the piece of fruit briefly before it falls back down. Calculate the
acceleration at a height of 2 m during the movement upwards. Is the acceleration
greater than 2 m/s2?”
Both parses have different interpretations depending on whether “initial velocity”
is attached to “air”, or to “throw”. Picking the correct parse automatically in a
reliable and accurate manner is challenging and requires pragmatic and contextual
knowledge to infer that “velocity” is more commonly associated with “throw” in the
context of physics problems.
20
Co-reference resolution
An important subtask in natural language processing is to determine if two ex-
pressions in natural language refer to the same entity in the world. This task is
commonly known as co-reference resolution.
“A banana is thrown into the air with an initial velocity of 20 m/s. Gravity
stops the piece of fruit briefly before it falls back down. Calculate the acceleration
at a height of 2 m during the movement upwards. Is the acceleration greater than 2
m/s2?”
An example of co-reference resolution is the task of resolving what “it” refers to in
our example question. Does “it” refer to Gravity or “the piece of fruit”? Performing
this task requires commonsense knowledge that “gravity” does not fall and that a
“piece of fruit” can fall. Similarly, commonsense knowledge is necessary to realize
the intended co-reference between “banana” and “piece of fruit”.
“A banana is thrown into the air with an initial velocity of 20 m/s. Gravity
stops the piece of fruit briefly before it falls back down. Calculate the acceleration
at a height of 2 m during the movement upwards? Is the acceleration greater than
2 m/s2?”
Another example of co-reference resolution is to resolve references for “the acceler-
ation” and “the movement” in the example question. One approach is to simply
associate both references to the Throw event, however this is not ideal as a Throw
consists of two subevents (e.g., Propel and Let-Go-Of) happening sequentially:
first an object is propelled, then the object is released. The ideal resolution is to
associate “the acceleration” and “the movement” with the Propel event that is
a subevent of Throw. To perform this successfully, NLP systems must infer the
21
presence of a Propel event after the Throw. This type of co-reference resolution
is also sometimes referred to as “indirect anaphora resolution”. The challenge here
is not only having commonsense knowledge that a Propel follows a Throw, but
also the mechanisms to apply commonsense to language interpretation to resolve
such references.
Ellipsis
Much of what is communicated in language is not explicitly stated, which makes
extracting content from natural language difficult. Such incomplete utterances range
from sentences that fail to include all requisite semantic information to syntactically
incomplete sentence fragments.
“A banana is thrown into the air [travelling upwards] with an initial
velocity of 20 m/s. Gravity stops the piece of fruit briefly before it falls back down.
Calculate the acceleration at a height of 2 m [above ground, whose default
height is 0 m] during the movement upwards. Is the acceleration greater than 2
m/s2?”
Our question example contains two missing arguments that need to be identified.
The NLP system has to recognize “thrown into the air” to mean “travelling up-
wards” and that “heights” are by default above the ground (which also has a de-
fault height of 0 m). Humans perform well at this task because they can leverage a
vast amount of commonsense knowledge to help resolve ambiguities and fill in the
missing arguments. This is not the case for computers. To automatically identify
these missing arguments will require a mixture of linguistic knowledge and common-
sense. Precisely how this should be done is a difficult problem for natural language
systems.
22
Question types
ASKME’s design goal is to be capable of answering a wide variety of questions. This
is a very challenging goal, considering that the expressiveness of language admits a
wide variety of questions differing in form and function.
“A banana is thrown into the air with an initial velocity of 20 m/s. Gravity
stops the piece of fruit briefly before it falls back down. Calculate the acceleration
at a height of 2 m during the movement upwards. Is the acceleration
greater than 2 m/s2?”
Although questions are typically posed using interrogative sentences containing cues
such as question marks or wh-words (e.g., who, what, when, where, why, and how),
it is not sufficient to simply use these criteria to identify and categorize questions.
Some expressions, such as “Would you pass the ball?”, have the grammatical form
of questions but actually function as requests for action, not for answers. Further,
questions may be stated without using wh-words. For example, questions may
be phrased as imperative sentences expressing commands such as “Calculate the
acceleration at a height of 2 m.” as opposed to “What is the acceleration at a
height of 2 m?”, or they may be posed as a comparative such as “Is the acceleration
greater than 2m/s2?”. Therefore, systems that depend heavily on the use of question
marks or wh-words for clues may have difficulty handling a wide variety of question
types.
Without doing a good job of analyzing the type of questions and the answers
expected, it is difficult to build a system to process questions and identify relevant
information in the knowledge base for answering them.
23
2.3 Resolving Representational Differences
Representational differences (e.g., the same meaning represented differently) be-
tween questions and the knowledge base are common in Project Halo because:
1. the knowledge bases are built by subject-matter experts and are intended to
be used by a different group of non-experts.
2. the knowledge bases are built by extending an expressive ontology, which per-
mits the same content to be expressed in multiple ways.
3. the knowledge bases are large and complex, covering broad domains.
Resolving representational differences is important for achieving good ques-
tion answering performance in Project Halo. I describe three types of representa-
tional differences that are commonly encountered in Project Halo:
1. granularity differences
2. viewpoint differences
3. modeling differences
Granularity differences
Consider the biology question “How many chromosomes are in a cell?”. The ques-
tion can be answered by looking up the Human-Cell concept and returning the
number of chromosomes. However, the assertion that the human cell has 46 chro-
mosomes might be encoded in the knowledge base as either:
Simpler Representation:
Human-Cellhas−part−→ (46 Chromosome)
or
Richer Representation:
Human-Cellhas−part−→ Nucleus
has−part−→ DNAhas−part−→ (46 Chromosome)
24
Question #1An entity contains chloroplast and cell walls. What is this entity?
Question #2What is the agent of an event consuming carbon dioxide and producing oxygen?
Figure 2.1: Two questions answered using different viewpoints of the Plant conceptshown in Figure 2.2.
Without resolving granularity differences, the automated reasoner answers
the simpler representation correctly, but fails on the richer representation.
Viewpoint differences
I describe two types of viewpoint differences.
First, a concept can have different viewpoints. For example, the Plant
concept (shown in Figure 2.2(a)) can be viewed as a container for multiple or-
ganelles such as the Chloroplast, Vacuole, and Ribosomes, enclosed inside
a Cell-Wall. This same concept can also be viewed as an organism that con-
verts light and carbon dioxide into chemical energy, carbohydrates, and oxygen.
Although both views describe the same concept, their representations include dif-
ferent features. Recognizing different viewpoints is necessary to identify portions of
a concept relevant to answering questions.
Figure 2.1 lists two questions answered using different views of the Plant
concept. To answer the first question “An entity contains chloroplast and cell walls.
What is this entity?”, ASKME has to identify the container viewpoint of Plant
shown in Figure 2.2(b). To answer the second question “What is the agent of an
event consuming carbon dioxide and producing oxygen?”, ASKME has to identify
the consumer/producer viewpoint of Plant shown in Figure 2.2(c).
25
(a) Representation for the Plant concept
(b) The parts of the graph highlighed in bold show the viewpoint describing Plant as con-taining different organelles
(c) The parts of the graph highlighted in bold show the viewpoint describing Plant as aconsumer-producer. The plant consumes light and CO2 to produce O2, Carbohydrates, andchemical energy
Figure 2.2: The plant concept and its two viewpoints
26
A second type of viewpoint difference arises when information to answer ques-
tions is represented contextually. The frame-based representation used in Project
Halo provides a convenient representation for contextual information within the
<frame slot value> paradigm[2]. This representation minimizes the proliferation
of frames corresponding to concepts that are important in very limited contexts,
without requiring additional rules and reasoning mechanisms. To achieve good
performance using knowledge bases created in Project Halo, it is necessary to use
information represented contextually. Consider the question “What is the state of
water(H2O)?”. There can be multiple answers to the question depending on the dif-
ferent contexts in which H2O is referred to. For example, the state of water can be a
gaseous substance when it is a result of a C2H4 +3O2 −→ 2CO2 +2H2O reaction, or
it can be a liquid substance when it is a result of a CaOH2 +2HI −→ CaI2 +2H2O
reaction. Figure 2.3 lists other examples in physics and biology where the use of
contextual information yields different answers.
Modeling differences
The users in Project Halo make many modeling decisions during knowledge forma-
tion and question formulation. There are many opportunities for modeling differ-
ences because the representations are built using an expressive ontology to cover
broad domains. Resolving modeling differences is difficult, as it requires the users of
the knowledge base to be aware of the modeling decisions made during knowledge
acquisition. I give two examples of modeling differences.
Figure 2.4 shows the first example, a a physics question whose representation
differ with the expected representation in the knowledge base.
The question representation contains two modeling differences with the knowl-
edge base. First, the question was described as a single Move event, while the
knowledge base contains a richer representation, where the Move is caused by an
Exert-Force. Second, the query in the question is on the force slot, while the
27
Biology example
QuestionIs it true that interphase occurs before mitosis?
AnswerNo. But it is true in the context of Mitotic-Cell-Cycle.In a mitotic cell cycle, interphase and mitosis are subevents, and interphase occursbefore mitosis.
Chemistry example
QuestionWhat is the state of water(H2O)?
AnswerTypically H2O does not have a state value.
H2O is gaseous in the context of the reactionC2H4 + 3O2 −→ 2CO2 + 2H2O
H2O is a liquid in the context of the reactionCaOH2 + 2HI −→ CaI2 + 2H2O
Physics example
QuestionWhat is the acceleration of a move?
AnswerAcceleration typically does not have a value.
In the context of a Free-Fall, acceleration is 9.8m/s2
In the context of a Up-Down-Free-Fall, acceleration is −9.8m/s2
In the context of a Motion-with-Force, acceleration = force/mass
Figure 2.3: Questions having multiple answers using different information contextsin the knowledge base.
28
necessary equations to answer the question are stored on the net-force slot in the
knowledge base. The question fails to answer when modeling differences between
the question and the knowledge base remain unresolved.
In a second example of modeling differences, experts creating knowledge
bases often introduce bias and assumptions in their representations. Questions fail
to answer when assumptions expected by the knowledge base are not stated in
the question. In Figure 2.5, the sentences rendered in italics introduce commonly
assumed values for the initial x position, initial y speed, and final y position necessary
to infer an answer on a typical physics knowledge base.
2.4 Relevance Reasoning
I describe two scenarios in which relevance reasoning is useful. The first scenario is
for yielding good performance.
Knowledge base systems typically suffer from poor performance as the size
of the knowledge base grows. It is well known that traditional inference methods
such as deductive reasoning and heuristic classification, do not scale well on large
knowledge bases. During the Halo pilot, the chemistry question-answering system
developed using the CYC knowledge base took approximately 27 hours to answer
the chemistry question set[44].
In Project Halo, the systems are intended for a wide variety of users in
a production environment. Performance is important, as these systems have to
be sufficiently responsive to captivate and engage the user. The AURA system
is designed to return an answer to a majority of questions within 60 seconds[33].
Therefore, relevance reasoning is important to ensure that the knowledge relevant
to answering questions is found quickly and correctly. This is especially important
for large knowledge bases covering broad and complex domains.
29
Figure 2.4: The question representation contains two modeling differences with theknowledge base. First, the question was described as a single Move event, whilethe knowledge base contains a richer representation, where the Move is caused byan Exert-Force. Second, the query in the question is on the force slot, while thenecessary equations to answer the question are stored on the net-force slot in theknowledge base.
30
Question (Version 1)A water balloon is thrown.The vertical distance of the throw is 30 meters.The horizontal distance of the throw is 100 meters.What is the duration of the throw?
Question (Version 2)A water balloon is thrown.The vertical distance of the throw is 30 meters.The horizontal distance of the throw is 100 meters.The initial y speed of the throw is 0 m/s.The initial x position of the throw is 0 meters.The final y position of the throw is 0 meters.What is the duration of the throw?
Figure 2.5: This question highlights the need to automatically install unstated as-sumptions commonly present in questions. The sentences in version 2 that are ren-dered in italics introduce commonly assumed values for the initial y speed, initial xposition, and final y position.
The second scenario in which relevance reasoning is useful is for returning
relevant answers. A knowledge base may return multiple answers to a question
because different subsets of the knowledge base may be used to answer the question.
Some subsets may return incorrect answers, making the task of finding appropriate
subsets to infer the correct answer challenging. Consider the question:
“A parachutist’s friend pushes him down from an airplane with an initial
velocity of 2 m/s. How far did he fall after 10 seconds?”.
In Figure 2.6, different answers to the question were returned using the same
knowledge base. One of them is incorrect because a subset of the knowledge base
assumed acceleration to be 9.8 m/s2 instead of -9.8 m/s2.
Relevance reasoning may also be used to return relevant answers to open-
ended questions. Such questions are common in the biology and chemistry domains.
31
QuestionA parachutist’s friend pushes him down from an airplane with an initial velocity of2 m/s. How far did he fall after 10 seconds?
Derived answer #1 (Incorrect) Derived answer #2 (Correct)
Answer
h = 470 m
Explanation
free-fall : An event in which an object is
in free fall at 9.8 m/s^2 (i.e. the typically
assumed acceleration due to Earths gravity).
Assumptions:
* We assume upward positive.
Given:
* a = 9.8 m/s^2
* theta2 = -90.0 deg
* v2 = -2 m/s
* t = 10 s
* u = -2 m/s
* g = 9.8 m/s^2
* h = u * t + ((1 / 2) * g) * t2
* s = h
Solving for s ...
s = h
h = u * t + ((1 / 2) * g) * t2
.:. h = 470 m
.:. s = 470 m
Therefore, the distance of the fall
(h) = 470 m
Answer
h = -510 m
Explanation
up-down-free-fall : A move of an object in a
gravity field of 9.8 m/s^2. The object is in
free fall may have an initial upward velocity
(by virtue of being thrown, fired, etc),
followed by an apex velocity of 0, followed
by a downward velocity.
Given:
* a = -9.8 m/s^2
* theta2 = -90.0 deg
* v2 = -2 m/s
* t = 10 s
* u = -2 m/s
* g = -9.8 m/s^2
* h = u * t + ((1 / 2) * g) * t2
* s = h
Solving for s ...
s = h
h = u * t + ((1 / 2) * g) * t2
.:. h = -510 m
.:. s = -510 m
Therefore, the distance of the fall
(h) = -510 m
Figure 2.6: This example highlights the difficulty in finding relevant informationin the knowledge base to return correct answers. Two answers to the question areshown. Both answers were returned by the problem solver using the same knowledgebase. One of them is incorrect because it assumed acceleration to be 9.8 m/s2 insteadof -9.8 m/s2.
32
QuestionWhat is the function of lysosomes?
AnswerA lysosome plays the role of a container.
Desired AnswerA lysosomes plays the role of a container, ..., and is an instrument enclosingDigestive-Enzymes.
Figure 2.7: Using a typical biology knowledge base, the problem solver describesLysosomes to play the role of a container. Ideally, the returned answer shoulddescribe the specific container role played by Lysosomes (rendered in italics).
There can be different answers depending on the amount of detail expected by the
user. These expectations are often assumed and seldom captured in the question
formulation. The example listed in Figure 2.7 illustrates the problem for the question
“What is the function of lysosomes?”. A naive answer to the question would be to
simply describe lysosomes to be containers, which is an accurate but incomplete
answer in this context. Ideally, the answer should also describe the container role
played by lysosomes as an instrument enclosing digestive enzymes (rendered in italics
in Figure 2.7).
2.5 Summary
I study ASKME as part of Project Halo where the builders and the users of the
knowledge base are different. Project Halo’s vision is to develop a computer program
capable of answering novel questions in a broad range of scientific disciplines[9, 114].
The type of questions targeted in Project Halo are unlikely to answer using search
or database lookup because the answers to these questions are not stored for simple
retrieval in the knowledge base. Answering these questions requires the user to
interact with the knowledge base for problem solving. To perform this interaction
33
successfully, a user has to learn the terms and relations in the knowledge base,
the valid constructions for those terms and relations, and must identify the most
appropriate representation in the knowledge base for problem solving. Therefore,
users face a steep learning curve to become proficient at using a knowledge base for
problem solving.
Project Halo’s success requires that ASKME help users overcome the diffi-
culty of using unfamiliar knowledge bases for problem solving. ASKME’s role is to
process questions as they are stated in English and identify relevant information in
the knowledge base for problem solving. This requires that ASKME overcome three
technical challenges.
The first technical challenge is to parse the original sentences and create
a correct meaning representation of the text. All the usual linguistic challenges
arise. These include tasks like prepositional phrase attachment, pronoun resolution,
reference resolution, compound noun interpretation, word sense disambiguation,
semantic role labeling, proper noun interpretation, quotations, parentheticals, units
of measure, comparatives, negation, temporal reference, and ellipses.
The second technical challenge is to resolve representation differences when
the same piece of knowledge is represented differently. There are a variety of rea-
sons why such representational differences are common in Project Halo. First, the
knowledge bases are built and used by different people. Second, the knowledge bases
are built by extending an expressive ontology, which permits the same content to
be expressed in multiple ways. Third, the knowledge bases are large and complex,
covering broad domains. Thus, resolving representational differences is necessary to
achieve good question answering performance in Project Halo.
The third technical challenge is to perform relevance reasoning. In Project
Halo, the systems are to be used by a wide variety of users to query large, complex
knowledge bases that cover broad domains. These knowledge bases perform poorly
34
with traditional problem solving methods and may even return different answers to a
question when different subsets of the knowledge base are used for reasoning. Some
of these answers may be incorrect or lack the details desired in a good answer. Thus,
relevance reasoning is necessary to help the problem solver focus on the relevant
portions of the knowledge base so as to yield good runtime performance and answers.
35
Chapter 3
Related Work
Figure 3.1: Different generations of knowledge-based systems and their contributionsto question-answering
An important goal of research on knowledge-based systems is building sys-
tems that allow novice users to pose questions and receive answers, using knowledge
bases encoded by others. Work in knowledge based systems can be categorised into
four generations of systems, as shown in Figure 3.1. The earliest systems responded
36
Knowledge Based Systems Challenges addressedBASEBALL(1961), LUNAR(1972)
Answers questions posed by novice users byextracting information stored in databases
MYCIN(1970s),XCON/R1(1978)
Answer questions by carrying out reasoningwith domain knowledge acquired from subjectmatter experts with the help of knowledge en-gineers
Cyc(1984 - Present),ThoughtTreasure(1998),OpenMind(2002)
Large repositories of commonsense knowledgethat can be used for question answering.
Systems built duringHPKB(1998)
Demonstrated systems that can be quicklybuilt and usefully applied to question-answering tasks by knowledge engineers.
Systems built duringRKF(2002)
Demonstrated systems that SMEs can use toauthor knowledge bases that can be usefullyapplied to question-answering tasks with littlehelp from KEs.
Project Halo (2003 - Present) Aim is to demonstrate systems that SME canuse to author knowledge bases that can an-swer questions posed by a different set of usersunfamiliar with knowledge representation orthe knowledge base being queried.
Table 3.1: Summary of notable knowledge based systems and the challenges ad-dressed by them
37
to questions using answers explicitly stored in databases. The next generation of
systems added inference rules to reason with domain expertise, and later genera-
tions involved efforts in scaling up to larger knowledge bases by adding commonsense
knowledge and facilitating domain experts in authoring the knowledge themselves.
More recently, efforts such as Project Halo have focused on mitigating the brittleness
in knowledge base interaction caused by the arm’s-length separation between the
experts who author the knowledge bases and the novice users posing questions to
them. Table 3.1 gives an overview of the different challenges addressed by notable
knowledge based systems of the past.
Natural language interfaces to databases. Early question-answering systems
focused on helping novice users access information stored in databases. The task of
extracting relevant information from a database could be difficult for novice users,
as it required familiarity with the contents of the database and other programming
tricks necessary for extracting the data. Typical natural language interfaces to
databases work by either pattern matching keywords in a user’s question with terms
in the database or by translating the question into a particular set of commands
to navigate the information in the database. Two effective and notable natural
language frontends to databases were the BASEBALL[51] and LUNAR[118] sys-
tems. These systems demonstrated good performance in handling questions posed
by novice users to access information stored in databases, without requiring users
to learn the structure of the database or a specialised language for querying it. The
LUNAR system answered questions about the geological analysis of rocks returned
by the Apollo moon missions. It demonstrated particularly good performance at a
lunar science convention in 1971, answering questions posed by geologists untrained
on the system. A good summary describing database accessor systems is found in
Androutsopoulos et. al.[4].
38
Reasoning with domain expertise. The utility of early question answering
systems was limited to answering questions whose answers were explicitly stored in
the database. Thus questions that require reasoning with domain expertise cannot
be answered by earlier database accessor systems. The next generation of question
answering systems, called expert systems, answered questions by reasoning with the
domain knowledge and heuristics used by real experts. Expert systems mimic the
problem solving behavior of experts and are typically used to solve problems that
do not have a single correct solution that can be encoded in a conventional algo-
rithm or stored in a database. The knowledge base in an expert system is narrowly
focused and engineered to perform well on a class of questions predetermined at
design time. Acquiring the domain expertise is difficult and domain experts work
alongside knowledge engineers to encode “rules of thumb” on how to evaluate and
solve pre-determined problems. Novice users interact with the system using a user
interface that restricts the questions that can be posed to the system. These sys-
tems predetermine the logical forms for the restricted set of questions at design time.
When users pose questions via the user interface, the required logical forms are then
generated and used to derive the answers to the queries. The MYCIN (for diagnos-
ing blood diseases and recommending antibiotics)[17] and XCON/R1 (for computer
equipment order processing)[71] systems are classic examples of successful expert
systems.
Knowledge Acquisition. Building an expert system is generally expensive and
tedious, as it requires knowledge engineers to work closely with domain experts. The
next generation of systems tried to address the problem by reducing the cost and
complexity in building knowledge bases. For example, the systems built during the
Defense Advanced Research Projects Agency (DARPA)’s High-Performance Knowl-
edge Base (HPKB)[32] project demonstrated tools to quickly build large, broadly
applicable knowledge based systems by reusing content authored by different knowl-
39
edge engineers1. Further reducing the time and effort in building large knowledge
based applications will require subject matter experts (SME) to enter knowledge
directly. This was a goal of DARPA’s Rapid Knowledge Formation (RKF)[109]
project and it demonstrated systems SMEs can use to quickly develop knowledge
bases applicable to question answering tasks. In the HPKB and RKF projects, the
functional performance of the developed systems were evaluated by measuring how
well the knowledge bases answered a set of unseen questions. The questions were
posed by the same users authoring the knowledge base, using a set of pre-determined
domain specific question templates. The completed question templates were then
used to generate detailed logical forms that were in turn used by the reasoner to
answer the question. More details on the HPKB and RKF evaluations are described
in Schrag et. al.[102] and Clark et. al.[28] respectively.
Question-Answering. Arguably, the performance of the systems built during
the HPKB and RKF projects would have suffered if a different set of users, unfa-
miliar with knowledge representation or the knowledge base being queried, posed
the questions. The separation between the builders and the users of the knowledge
bases continues to be a source of brittleness in the interaction between knowledge
bases authored by experts and their novice users. Systems such as ISAAC2[86] and
MECHO[18, 19] have previously demonstrated that systems could correctly answer
certain types of high school level physics questions as originally stated in the exams.
Both systems, however, are likely to perform poorly at answering questions like
those found on an AP exam using knowledge bases authored by other experts. The1In a later effort similar to HPKB, the Project Halo pilot phase also demonstrated systems
that were collaboratively built by different knowledge engineers. The competing systems attemptedpreviously unseen Advanced Placement (AP) exam questions in the chemistry domain and attainedAP scores similar to the mean human score. These systems demonstrate that knowledge basedsystems can answer unseen complex questions using complex knowledge bases that are quicklybuilt in short amounts of time (4 months).
2The ISAAC system was later extended to include diagrams as an additional modality for statingquestions[87].
40
MECHO and ISAAC systems are brittle because they were engineered to work well
for narrow domains and cannot be easily adapted to query other knowledge bases.
Commonsense Knowledge. Newell and Ernst[82] observed that one reason that
knowledge based systems are brittle is their lack of general commonsense knowledge
about the world. That is, as knowledge engineers begin to debug why questions
or problem solving tasks fail, they find the need to add commonsense, rather than
domain-specific, knowledge. For instance, a type of brittleness faced by the ISAAC
system is the importance and difficulty of inferring unstated assumptions from a
question. To answer questions about projectile motion, for example, it is often
necessary to assume that air resistance can be ignored, that the projectile is near
Earth and that the only gravitational force acting on the projectile is due to Earth.
Thus, commonsense knowledge about the world is necessary to achieve good perfor-
mance for a variety of knowledge based applications. Several projects are formalizing
large commonsense knowledge bases to improve the performance of knowledge based
systems. These efforts include CYC[69], ThoughtTreasure[78], and the OpenMind
project[105]. The challenge here is not only building a large commonsense knowl-
edge base, but also to identify the mechanisms necessary for applying commonsense
to various problem solving tasks to improve overall system performance.
ASKME. Question answering tasks have become increasingly challenging in re-
cent years. Annual competitions for information extraction and information retrieval
have “begun to require direct answers and their explanations instead of returning
factoids or extracting relevant passages from a text” [24, 70]. Answering and ex-
plaining the solutions to these questions will require the use of domain knowledge
and deeper reasoning capabilities commonly found in knowledge based applications.
Our work advances the state of the art in knowledge based question answering by
mitigating the brittleness due to the arms-length separation between the builders
and the novice users of the knowledge base.
41
Chapter 4
Approach
ASKME helps users with different levels of domain expertise use unfamiliar knowl-
edge bases for problem solving. ASKME is intended to work well in a variety of
domains and knowledge bases, each having its own ontology. ASKME is designed
to answer questions as they are posed in natural language. To make the task eas-
ier for computers, ASKME requires users to formulate their questions in restricted
English. The four major pieces of ASKME are: a) a version of restricted English
(Computer Processable Language), b) a domain-neutral ontology (Component Li-
brary), c) mechanisms to handle a closed set of well known question types, and (d) a
program, called the question mediator, to identify relevant information in the knowl-
edge base for problem solving. The four major pieces of ASKME which, when taken
together, function (metamorphically) as a funnel that distills the original question
into a simpler form that can be more easily processed by the computer for problem
solving.
4.0.1 Restricted English
Producing formal representations from natural language is very difficult because
natural language permits an enormous amount of expressive variation. This ex-
42
pressiveness allows different writers to use a variety of styles and grammatical con-
structions to express the same meaning. Although it would be an easy task for
human readers to reconcile different representations that have the same meaning,
it is not so easy for computers. This difficulty stems from the fact that much of
what is communicated in language is not explicitly stated and computers perform
poorly at extracting content from the language. Humans perform well at this task
because they can leverage a tremendous amount of commonsense knowledge to per-
form a variety of linguistic processing tasks (e.g., resolve ambiguities, determine
anaphoric reference, fill in ellipses). In contrast, machines continue to lack com-
monsense knowledge as well as the tools to usefully apply commonsense to guide
the language understanding process.
For question answering, the Project Halo team chose to support a simpler
version of English to avoid difficult problems in natural language processing. Con-
sequently, ASKME places special restrictions on grammar, style, and vocabulary
usage that are based on well established writing principles. These restrictions have
the effect of improving the consistency and readability of text by countering the
tendency of writers to use unusual or overly specialized language constructions that
are difficult for the computer to process. For convenience, Table 4.1 lists in verbatim
(originally produced by Kittredge[61]), some of the known properties of restricted
English that are important to making natural language input easier to process by
computers.
4.0.2 Domain-Neutral Ontology
A key step in interacting with knowledge bases is to select the appropriate entries in a
knowledge base for problem solving. A history of economic importance has endowed
English with an enormous vocabulary (e.g., the Oxford English Dictionary contains
over 300000 words). In contrast, knowledge bases typically contain a handful to a
43
Property1 restricted lexicon (and possibly including special words not used elsewhere
in the language)
2 a relatively small number of distinct lexical classes (e.g., nouns or nominalphrases denoting <body part>) which occur frequently in the major sentencepatterns
3 restricted sentence syntax (e.g., some sentence patterns found in literatureseem to be rare in scientific or technical writing
4 deviant sentence syntax that are unusual in the standard language
5 restricted word co-occurrence patterns which reflect domain semantics (e.g.,verbs take limited classes of subjects and objects; nouns have sharp word-class restrictions on their modifiers)
6 restricted text grammar (e.g., stock market summaries typically begin withstatements of the overall market trend, followed by statements about sectorsof the market that support and go against the trend, followed by salientexamples of stocks which support or counter the trend)
7 different frequency of occurrence of words and syntax patterns from the normfor the whole language – each controlled language has its own statisticalprofile, which can be used to help set up preferred interpretations for newtexts.
Table 4.1: Some of the known properties of restricted English that are important tomaking natural language input easier to process by computers. (originally producedby Kittredge[61], I reproduce the table here for convenience)
44
few hundred concepts relevant to a domain.
This selection task involves mapping the many ways of expressing something
in English into meaningful representations in the knowledge base. Automating this
selection task is complicated because what is written in English often does not fit
into the target ontology of the knowledge base. Some of the reasons include (1)
an incomplete mapping from English words to terms and relations in the knowl-
edge base, (2) a literal interpretation that violates constraints in the ontology, or
(3) the knowledge cannot be expressed in the underlying knowledge representation
language.
ASKME uses a domain-neutral ontology to overcome the technical difficulty
of choosing appropriate entries in the knowledge base to create relevant meaning
representations. Its design is influenced by experience in controlled languages and
lexicography in which small vocabularies have been found to be easy to learn and
are sufficient to express a wide variety of knowledge without using uncommon terms
unfamiliar to many users[1, 96, 112, 116].
For example, although the Longman Dictionary of Contemporary English
contains 45,000 entries and 65,000 word senses, all the definitions in the dictionary
are defined by a vocabulary of about 2000 words[95].
ASKME’s domain-neutral ontology must be satisfactory in three areas: cov-
erage, access, and semantics. First, the ontology has to have broad coverage because
ASKME is intended to be domain independent. To accomplish this, we need not
enumerate all the terms in the English vocabulary and install one-to-one mappings
between words and entries in the knowledge base. Indeed, doing so would just make
the knowledge base huge and difficult to use. Instead, we desire a small set of well
known terms and relations that are commonly employed by everyday users. From
this small set of terms and relations, a wide variety of meaning can be represented
via specialization and composition to achieve good coverage.
45
Second, the ontology has to be intuitive and accessible because its intended
users are accustomed to expressing their questions with English. Ideally, the knowl-
edge base should not contain so many entries that a distinct mapping cannot be
identified. The senses for a word should be easily identifiable and highly distin-
guishable from alternative choices, to avoid having multiple concepts that are closely
related but differ in subtlely different ways. Such concepts will greatly burden the
mapping process and yield poor results.
Third, the terms of the ontology should be grounded with deep semantics for
reasoning with an automated reasoner. Each concept should be richly axiomatized
to encode the meaning of the component as well as how it interacts with other
components. These axioms should also be general enough to be reusable for encoding
detailed domain knowledge, and they should be written in such a way that they are
consistent with the axioms of other concepts during composition.
4.0.3 Question Taxonomy
My long-term research goal is to create systems capable of answering a wide vari-
ety of questions. This is a very challenging goal considering the expressiveness of
language that admits a wide variety of questions that differ in form and function.
Although questions are typically posed using interrogative sentences contain-
ing cues such as question marks or wh-words (e.g., who, what, when, where, why,
and how), it is not sufficient to simply use this criteria to identify and categorize
questions. First, some expressions, such as “Would you pass the ball?”, have the
grammatical form of questions but actually function as requests for action, not for
answers. Second, questions may be stated without using wh-words. For example,
questions may be phrased as imperative sentences expressing commands such as
“Calculate the acceleration at a height of 2 m.” as opposed to “What is the accel-
eration at a height of 2 m?”, or they may be posed as a comparative such as “Is
46
the acceleration greater than 2m/s2?”. Therefore, systems that depend heavily on
the use of question marks or wh-words for clues may have difficulty handling a wide
variety of question types.
A good taxonomy of types of questions should meet three criteria. First, it
has to have broad coverage because ASKME is intended for use in a variety of do-
mains. Ideally, ASKME achieves broad coverage using a small set of question types
that are commonly used in everyday communication. From this small set of ques-
tion types, users can effectively state their question intent (via multiple questions
or through composition) to retrieve answers from the knowledge base.
Second, the set of supported questions has to be intuitive and accessible to
everyday users who are accustomed to expressing their questions with English.
Third, the set of supported questions should be grounded with deep seman-
tics for reasoning with an automated reasoner. These semantics should be well
understood and easy to implement. Each question type should be richly axioma-
tized to encode the meaning of the question, as well as how it interacts with the
problem solver and the knowledge base. These axioms should also be general enough
to be reusable for a variety of domains and different knowledge bases.
4.0.4 Question Mediator
The question mediator’s goal is to synthesize a “minikb” containing information rele-
vant to inferring an answer. Given the initial minikb containing only the information
(triples) originally stated in the question, the question mediator incrementally ex-
tends the minikb with frames (domain concepts) drawn from the knowledge base.
The frames include both domain assertions and inference methods. The mediator
succeeds if it constructs a minikb that is sufficient to answer the question.
For ASKME to be useful for novice users interacting with unfamiliar knowl-
edge bases, it has to work well in a variety of domains and different knowledge
47
bases. The design of the question mediator must, therefore, satisfy two conditions.
First, there are many opportunities for the same meaning to be represented differ-
ently when the knowledge bases are built and used by different groups of users. For
ASKME to achieve good performance, the question mediator has to have mecha-
nisms to resolve representational differences that may exist between question formu-
lations and the knowledge base. Second, because different subsets of the knowledge
base may be used to answer a question, the question mediator may have to ex-
plore a large portion of the knowledge base before identifying the relevant piece of
knowledge for problem solving. For ASKME to perform well on knowledge bases
of varying sizes, the question mediator should give priority to information in the
knowledge base that is highly relevant to answering the question.
4.1 The ASKME Prototype
I have built a prototype to study whether the ASKME approach helps users interact
with unfamiliar knowledge bases. The ASKME prototype consists of a variety of
systems that have been deployed in both production and academic settings. The
prototype uses a version of restricted English called Computer Processable Lan-
guage(CPL) and an upper ontology called the Component Library(CLib). The
question types supported by ASKME are from a well studied question taxonomy.
I designed and implemented a program that performs the question mediation task.
The knowledge bases used by the system are conventional, consisting of a set of
frames in an inheritance hierarchy[3, 8, 15, 52, 69]. Each frame represents a domain
concept, such as Eukaryotic-Cell in biology, Metathesis-Reaction in chem-
istry, and Fall-from-Rest in physics. Frames encode declarative assertions with
associated inference methods, such as rules for automatic classification [15] and com-
putation methods such as, in physics, an equation to compute velocity from distance
and time. Domain experts build the knowledge bases by extending the concepts in
48
the Component Library[8].
The intended users of the knowledge base are novices in the subject matter,
who will use the knowledge base to answer questions. ASKME users formulate their
questions using Computer Processable Language (CPL) [30]. These questions are
then interpreted and represented in a logical form - described with general terms and
relations from the knowledge base - for processing by the problem solving system.
ASKME consists of several components including the CPL interpreter from Boeing
Phantom Works[30], additional word sense disambiguation and semantic role label-
ing technologies developed in our research group [38, 39, 40, 121], and a prototype
Scenario Elaborator that uses heuristics to identify the applicable assumptions in
a question [88]. Using the question formulation generated by CPL, the question
mediator identifies relevant information from the knowledge base for problem solv-
ing. The derived answer is then explained with domain specific terminology using
explanation generation support provided by KM[29] and the CLib[8]. Figure 4.1
shows the prototype system’s end to end processing of a sample question.
4.1.1 Computer Processable Language
ASKME creates formal representations from restricted English input using a con-
trolled language interpreter, called Computer Processable Language (CPL), that
was originally developed by Boeing Phantom Works[26, 30, 31]. The CPL inter-
preter consists of a syntactic parser called SAPIR[55], a logical form (LF) genera-
tor, an initial logic generator, and subsequent processing modules. CPL currently
“deals with a subset of linguistic phenomena, including nominalizations, passives,
plurals, prepositional phrases, relative clauses, direct anaphora, and a limited form
of conjunction”[30, 31]. For our purposes, Boeing further “extended CPL to handle
chemical equations, interrogatives, indirect anaphora, comparatives, variables, and
physical quantities”[31]. Clark et. al.[30] described the basic CPL sentence to have
49
!"#
$%&'
()'*#+,+#*-%&
'(!"#
$%&'
(.#/
0-%&
'(1'$2#+(
34,5-'-%
&'(
!"#$%"&'(
)*+,%"-.)
/0'
-12+
)
6-'#
5(7(
8##(9:
8(;0<"+#(=>?(
6-'#
5(@(
8##(A:
8(;0<"+#(=>?(
6-'#
5(?(
8##(9:
8(;0<"+#(=>=(
6-'#
5(=(
8##(A:
8(;0<"+#(=>=(
Fig
ure
4.1:
Agr
aphi
cal
tabl
eof
cont
ents
for
the
deta
iled
exam
ple
give
nin
Fig
ures
4.2
and
4.3.
ASK
ME
answ
ers
the
user
’squ
esti
onin
thre
est
eps:
inte
rpre
ting
the
ques
tion
usin
gth
eC
PL
proc
esso
r,se
lect
ing
dom
ain
know
ledg
eto
answ
erit
usin
gth
equ
esti
onm
edia
tor,
and
gene
rati
ngan
expl
anat
ion.
50
!"#$
%&'&
!"#$
%&(&
)*+$,-&
./0$&
1"22&
3#34"%50$%/,3-6
&
7#"%50$%/,3-6
&
89$:-5;/:,$&
,"<2$2&
/*+$,-&
#$-5=/:,$&
>?&@A&
B&
?&1C2
&
'D&1
C2&
'?&1
&
/*+$,-&
E32-"#,$&
!""
#$%&'
#()(*+
&,-.
)(/"0(11$-.1)
2+(
1,-.
)F&,6,%32-&1
<2-&2-/G&H$
:&*3@$&3#&'?&1I&JH$
&32&
-:"0$%3#A&"-&"&0$%/,3-6
&/=&'
D&1C2I&KH$
&,/1*3#$
E&1"22&/=&-H
$&,6,%32-&"
#E&*3,6,%$&32&>?&
@AI&L
H"-&32&-H$&=/:,$&:$M<
3:$E&-/&2-/G&-H$&
*3@$&3#&-H
32&E32-"#,$B&&
3$4"#$5(6
)7.8#$19)
F#&*3,6,%$&1/0$2I&
KH$&1"22&/=&-H
$&*3,6,%$&32&>?&@AI&
KH$&3#34"%&0$%/,3-6
&/=&-H$
&*3,6,%$&32&'D&1C2I&
KH$&7#
"%&0$%/,3-6
&/=&-H$
&*3,6,%$&32&?&1
C2I&
KH$&E32-"#,$&/=&-H$
&1/0$&32&'?&1I&
LH"-&32&-H$&=/:,$&/#
&-H$&*3,6,%$B&
Fig
ure
4.2:
Inpa
nel
1,a
phys
ics
ques
tion
ispo
sed
toth
esy
stem
insi
mpl
ified
Eng
lish.
The
syst
emin
terp
rets
the
ques
tion
assh
own
inP
anel
2.T
hesc
enar
ioan
dqu
ery
ofth
equ
esti
onis
inte
rpre
ted
asa
Move
even
ton
anO
bje
ct
havi
ngm
ass
80kg
.T
hein
itia
lan
dfin
alve
loci
tyof
the
Move
are
17m
/san
d0
m/s
resp
ecti
vely
.T
hedi
stan
ceof
the
Move
is10
m.
The
reis
also
anExert-F
orce
even
tw
hose
obje
ctis
the
sam
eob
ject
ofth
eM
ove
even
t.T
heExert-F
orce
even
tca
uses
the
Move
even
t.T
hequ
ery
ison
the
net-
forc
eof
the
Exert-F
orce
and
isth
eno
dew
ith
aqu
esti
on-m
ark.
ASK
ME
’spr
oces
sing
cont
inue
sin
Fig
ure
4.3
51
!"#$%&'
()*
)+,-
.&/,%)+
0&1+&,1%%$2$31*)
+'()*
)+,4+5
$3,6)3%$'
7100'
.+.*12,8$2)%.&9
' :+12,8$2)%.&9
'
;<$3&,=)3%$'
%140$0'
)"#$%&'
1%%$2$31*)
+'+$
&,6)3%$'
>?'@A'
,BBC
DE'
BFGFC'7H0
I'
?'7H0
'
BJ'7
H0'
B?'7
'
)"#$%&'
5.0&1+%$'
!""
#$%&'
#()(*+
&,-.
)(/"0(11$-.1)
$.20-3
+%(3
)'4)5-,
-.6+.3
(067-0%()
K '+$
&,6)3%$'L'7100'M'1%%$2$31*)
+'$.20-3
+%(3
)'4)5-,
-.68
$296%-.12&.26&%%(#(0&,
-.'
K ':+
12,8$2)%.&9
I 'L'.+.*12,8$2)%.&9
I 'N'OI
'M'1%%$2$31*
)+'M'5.0&1+%$P'
Q1+$
2'R'
Q1+$
2'F'
!.18(0)
+$&,6)3%$'L',BBCD'+$
-&)+0'
:/"#&.
&,-.
)7100'L'>?'@A'
.+.*12,8$2)%.&9
'L'BJ'7H0'
:+12,8$2)%.&9
'L'?'7
H0'
5.0&1+%$'L'B?'7'
1%%$2$31*)
+'L'O.+
.*12,8$2)%.&9
I 'S':+12,8$2)%.&9
I P'H'
OI'M'5.0&1+%$P'
1%%$2$31*)
+'L'OBJ'7H0
I 'S'?'7
H0I P'H'OB
'M'B?'7P'
1%%$2$31*)
+'L',BFGFC'7
H0I''
+$&,6)3%$'L'7100'M'1%%$2$31*)
+'+$
&,6)3%$'L'>?@A'M',B
FGFC'7
H0I'
+$&,6)3%$'L',BBCD'+$
-&)+0'
Fig
ure
4.3:
The
cont
inua
tion
ofth
eex
ampl
efr
omF
igur
e4.
2.In
pane
l3,t
hequ
esti
onm
edia
tor
draw
sin
info
rmat
ion
from
the
know
ledg
eba
se.
The
final
answ
eran
dex
plan
atio
nar
esh
own
inP
anel
4.
52
Example CPL SentencesA man picks up a large box from a table.The man carries the box across the room.The man is sweeping the powder with a broom.Two vehicles drive past the factorys hangar doors.The narrator is walking past racks of equipment.The narrator is standing beside a railing beside a stormwater outfall.
Table 4.2: Reproduced for convenience. Example CPL sentences taken verbatimfrom Clark et. al.[30]
the form:
subject + verb + complements + adjuncts
“where complements are obligatory elements required to complete the sentence, and
adjuncts are optional modifiers”[30]. The CPL grammar does not allow pronouns
and requires users to use definite references[30]. Figure 4.2 lists some examples of
CPL sentences.
Users follow a set of guidelines while writing CPL sentences. Some of the
guidelines are stylistic recommendations to reduce ambiguity, while others are firm
constraints on vocabulary and grammar. Table 4.3 lists some example guidelines
taken verbatim from Clark et. al.[30]. The full list of guidelines along with examples
is given in the CPL user’s guide[31, 110].
CPL sentences are converted into logic in three main steps, namely pars-
ing, generation of an intermediate logical form (LF), and conversion of the LF to
statements in the KM knowledge representation language[30].
CPL performs parsing using SAPIR[55], a mature, bottom-up, broad cover-
age chart parser[30, 31]. SAPIR generates an intermediate logical form (LF) that
is a simplified and normalized tree structure with logic-type elements[30]. The LF
53
Guideline1 Keep sentences short and simple as possible
2 Use just one clause per sentence
3 Assume the computer has no common sense. State the obvious inthe question
4 Identify and describe the objects, events, and their properties in-volved in the question.
5 Use “a” to introduce an item, and “the” to refer back to it.
6 Use “first” and “second” to distinguish two of the same kind of items
7 Use “there is a ...” if needed to introduce an object.
8 Do not mix groups and group members in a scenario
9 Avoid using pronouns, instead refer using a name (“the block”, “thetable”)
10 Begin a question with the words “what is”, “what are”, “how many”,“how much”, or “is it true”.
11 Restate a multiple choice question as a set of simple questions
12 Ask for just one value in a single question
13 Set up a question by talking about one specific object
14 Use a question or statement in place of a command
15 Always include a unit of measure after a numerical value
Table 4.3: Users follow a set of guidelines while writing CPL sentences. Some of theguidelines are stylistic recommendations to reduce ambiguity, while others are firmconstraints on vocabulary and grammar. These guidelines are taken verbatim fromClark et. al.[30].
54
is generated by rules parallel to the grammar rules, and contains variables for noun
phrases and additional expressions for other sentence constituents[30]. The resulting
LF is used to generate ground KM assertions.
Because we are working in restricted domains, word sense disambiguation
and semantic role labelling is considerably easier than with a broad coverage ap-
plication, as the choice of senses is constrained to the concepts in the knowledge
base[31]. The logic generator first performs a straightforward transformation of the
LF to first-order logic syntax. Subsequent processing modules then perform word
sense disambiguation, semantic role labeling, co-reference resolution, and some lim-
ited metonymic and other transformations[26, 30, 31]. Wordnet is used to help
expand the vocabulary of words recognized by CPL. To disambiguate a word, first
its Wordnet senses are found, and then CPL looks to see if any are mapped directly
to a concept in the knowledge base (each concept in the knowledge base has a list
of associated Wordnet synsets). If one or more are found, the most likely sense is
selected based on corpus frequency statistics. If not, CPL searches up WordNet’s
hypernym tree until a synset that is mapped to a concept in the knowledge base is
found. In this way, specific words like “bicycle” and “cliff” can be used by the user,
although they are not explicitly stated in the knowledge base, as they are mapped
through this process to more general concepts in the knowledge base.
While having a controlled language makes interpretation more reliable, it also
introduces the challenge of having the users learn to use it. To facilitate this, Boeing
integrated two additional components into CPL, namely, an advice system that
detects CPL errors and provides reformulation advice, and an interpretation display
system that presents the system’s understanding of the question (both as English
paraphrases and graphically) so the user can check that the system understood
correctly[31].
55
Original question
A ball is thrown from a cliff. The horizontal velocity of the object is 20 m/s. The height of the cliff is 125 m.
CPL (Controlled english)
Logic
Question- Answering
Paraphrase of system’s understanding
A ball is the object of a throwing. A cliff is the origin of the throwing. The velocity is equal to 20 meter-per-second(s). The velocity is horizontal. The cliff has a height of 125 m.
Rewriting advice
Figure 4.4: The user poses questions in restricted English. CPL tries to understandthe question. The CPL interpreter provides reformulation advice if it detects CPLerrors. Otherwise, it presents the system’s understanding of the question both asan English paraphrase and graphically for the user to check if CPL understood thequestion correctly. This figure is courtesy of Peter Clark at Boeing Phantom Works.
CPL detects grammatical violations when grammar rules outside the scope
of CPL, but within the scope of full English, are used in a formulation. If there
are CPL errors, the system highlights them and offers rewriting advice1 to help the
user fix the mistakes. If the user’s reformulation is valid CPL, the system creates
a logical interpretation, and shows its understanding back to the user in two ways.
One, as a set of English paraphrases of the logic, i.e., the question is “round-tripped”
from the user’s CPL, to logic, to a machine generated English paraphrase. Two,
as a graph, whose nodes and edges represent terms and relations in the knowledge
base. The user can then inspect the logical form and edit the graph or reformulate
the original sentences to repair the CPL interpretation. The CPL interpretation
display system allows the user to validate if the system has understood the question1The advice messages presented in CPL are canned text, not an automatic rewrite of the user’s
actual input.
56
correctly. Both the graph and paraphrases are intuitive visualizations allowing users
to identify interpretation errors without becoming familiar with the content of the
knowledge base being queried.
A more complete description of the CPL interpreter can be found in [26, 30,
31].
4.1.2 The Component Library
The Component Library (CLib)[8] is used as the upper ontology in ASKME. It has
been under development for about ten years by the Knowledge System Group at
the University of Texas at Austin. I am a member of the group and one of the main
contributors to the Component Library.
The Component Library consists of a set of generic Event, Entity,
and Role concepts (so called “components”) and a language for combining them.
The design of the component library emphasizes coverage (allowing users to encode
knowledge in a variety of domains), access (components satisfying user expectations
can be found easily), and semantics (components are general enough to be used in
a variety of contexts, but specific enough to express non-trivial knowledge)[8].
Coverage: The components in the CLib are diverse enough that they can be used
to compose a wide variety of domain knowledge. The CLib achieves coverage with a
small set of concepts (a few hundred) and a closed set of relations (about 80). The
concepts are inspired by English lexical resources (such as dictionaries, thesauri,
and English word lists) and is organized into an inheritance hierarchy in which the
top level concepts are: Entities, Events, and Roles. Unlike traditional approaches
that achieve coverage through the enumeration of a large number of concepts (e.g.,
Cyc[69]), coverage in the CLib is primarily achieved through composition. Compo-
sition consists of specifying relationships between instantiated components so that
additional implications can be computed.
57
Access: To make the library intuitive and accessible to users accustomed to ex-
pressing knowledge with natural language, the set of general terms and relations in
the CLib is heavily influenced by English language usage. The set of concepts is in-
spired by English lexical resources (such as dictionaries, thesauri, and English word
lists) and are carefully selected to be highly distinct so as to minimize confusion due
to subtle differences.
Semantics: Each concept in the CLib is also richly axiomatized to encode the
meaning of the component as well as how it interacts with other components. These
axioms are general enough to be reusable for encoding detailed domain knowledge,
and are written in such a way that they are consistent with the axioms of other
components during composition.
4.1.3 Supported questions
ASKME supports a total of eight question types. To determine the question types
to support, I worked with subject matter experts to identify the most common
question types and their required processing for a corpus of AP-like exams. The
question types supported by ASKME are also heavily influenced by the Graesser-
Person question taxonomy[49].
The Graesser-Person taxonomy classifies questions by the complexity in-
volved in producing a good answer to a question. The taxonomy was developed
by studying questions posed by individuals in a variety of settings[20]. There are
16 categories in the taxonomy (see Table 4.4) and they are scaled by depth, which
is defined as the amount of information and complexity involved in producing a
good answer to the question. In their analysis, Graesser and Person have differen-
tiated simple shallow questions (categories 1-4), intermediate questions (categories
5-8), versus complex deep questions (categories 9-16). The Graesser-Person scale of
depth has been validated to correlate significantly with both Mosenthal’s[77] scale
58
of question depth and the original Bloom’s[14] taxonomy of cognitive difficulty[50].
Ideally, ASKME would support all 16 question categories in the Graesser-
Person taxonomy. I did not do so for two reasons. First, in my analysis, as well as
that discussed in the literature, deep questions are not common. In a study con-
ducted on “a corpus of multiple choice questions on psychology in college textbooks
and a Graduate Record Examination practice book, it was found that only 23%
of the questions were classified as deep questions according to the Graesser-Person
taxonomy, and 21% were classified as deep questions according to the Mosenthal
scale”[50]. Second, although it is relatively easy to implement the required mech-
anisms to generate good answers for simple and intermediate questions, this is not
the case for deep questions. Answering deep questions requires sophisticated prob-
lem solvers which entail serious engineering work. Therefore, for practical reasons,
I chose to only implement the set of simple and intermediate question types in the
Graessner-Person taxonomy.
4.2 The Design of the Question Mediator
I describe the design of the question mediator in three parts. First, I describe a
basic search controller to systematically search the knowledge base for information
to extend the question formulation for automated reasoning. This search controller
is described using the five components of state-space search: states, goal test, goal
state, operators, and the control strategy. I then describe extensions to the search
controller to resolve representational differences that may exist between question
formulations and the knowledge base. Finally, I describe heuristics used to perform
relevance reasoning.
59
Dep
thQ
ues
tion
cate
gory
Ab
stra
ctsp
ecifi
cati
onE
xam
ple
1V
erifi
cati
onIs
afa
cttr
ue?
Did
anev
ent
occu
r?Is
the
answ
er5?
2D
isju
ncti
veIs
Xor
Yth
eca
se?
IsX
,Y
,or
Zth
eca
se?
Isge
nder
orfe
mal
eth
eva
riab
le?
3C
once
ptco
mpl
etio
nW
ho?
Wha
t?W
hat
isth
ere
fere
ntof
ano
unar
gum
ent
slot
?W
hora
nth
isex
peri
men
t?
4Fe
atur
esp
ecifi
cati
onW
hat
qual
itat
ive
attr
ibut
esdo
esen
-ti
tyX
have
?W
hat
are
the
prop
erti
esof
aba
rgr
aph?
5Q
uant
ifica
tion
Wha
tis
the
valu
eof
aqu
anti
tati
veva
riab
le?
How
man
y?H
owm
any
degr
ees
offr
eedo
mar
eon
this
vari
able
?6
Defi
niti
onW
hat
does
Xm
ean?
Wha
tis
at
test
?7
Exa
mpl
eW
hat
isan
exam
ple
labe
lor
inst
ance
ofth
eca
tego
ry?
Wha
tis
anex
ampl
eof
afa
ctor
ial
de-
sign
?8
Com
pari
son
How
isX
sim
ilar
toY
?H
owis
Xdi
f-fe
rent
from
F?
Wha
tis
the
diffe
renc
ebe
twee
na
tte
stan
dan
Fte
st?
9In
terp
reta
tion
Wha
tco
ncep
tor
clai
mca
nbe
infe
rred
from
ast
atic
orac
tive
patt
ern
ofda
ta?
Wha
tis
happ
enin
gin
this
grap
h?
10C
ausa
lan
tece
dent
Wha
tst
ate
orev
ent
casu
ally
led
toan
even
tor
stat
e?H
owdi
dth
isex
peri
men
tfa
il?
11C
ausa
lco
nseq
uenc
eW
hat
are
the
cons
eque
nces
ofan
even
tor
stat
e?W
hat
happ
ens
whe
nth
isle
vel
de-
crea
ses?
12G
oal
orie
ntat
ion
Wha
tar
eth
em
otiv
esor
goal
sbe
hind
anag
ent’
sac
tion
?W
hydi
dyo
upu
tde
cisi
onla
tenc
yon
the
y-ax
is?
13In
stru
men
tal/
proc
edur
alW
hat
inst
rum
ent
orpl
anal
low
san
agen
tto
acco
mpl
ish
ago
al?
How
doyo
upr
esen
tth
est
imul
uson
each
tria
l?14
Ena
blem
ent
Wha
tob
ject
orre
sour
ceal
low
san
agen
tto
perf
orm
anac
tion
?W
hat
devi
ceal
low
syo
uto
mea
sure
stre
ss?
15E
xpec
tati
onal
Why
did
som
eex
pect
edev
ent
not
oc-
cur?
Why
isn’
tth
ere
anin
tera
ctio
n?
16Ju
dgm
enta
lW
hat
valu
edo
esth
ean
swer
erpl
ace
onan
idea
orad
vice
?W
hat
doyo
uth
ink
ofth
isop
erat
iona
lde
finit
ion?
Tab
le4.
4:Fo
rco
nven
ienc
e,w
ere
prod
uce
verb
atim
the
ques
tion
cate
gori
es,
abst
ract
spec
ifica
tion
,an
das
soci
ated
exam
ples
from
the
Gra
esse
r-P
erso
nqu
esti
oncl
assi
ficat
ion
sche
me[
49]
60
4.2.1 A search controller to find information to answer questions
To ground the description of the question mediator’s search controller, consider the
following physics question:
“A cyclist must stop her bike in 10 m. She is traveling at a velocity of 17
m/s. The combined mass of the cyclist and the bicycle is 80kg. What is the force
required to stop the bike in this distance?”
For this sample question, relevant information must be combined from the
Motion-under-force and Motion-with-constant acceleration concepts
for the reasoner to infer an answer. Figure 4.5 shows the question and the knowledge
necessary to answer it. The search graph explored by the question mediator to
answer the question is shown in Figure 4.6.
States
Each state in the state-space contains a representation of knowledge drawn from
the user’s question formulation and from the knowledge base that is intended to
answer it. I call each state a “minikb”. The initial state contains only the question
formulation. The other states are elaborations of the initial state, constructed by
operators in the state space that elaborate the minikbs.
Figure 4.7 shows the minikbs for states 1, 2, and 5 in the search graph of
Figure 4.6. State 1 is the initial state in the search graph and it contains the
original minikb for a user’s question. State 2 results from elaborating State 1 using
the Motion-with-constant-acceleration concept in the knowledge base. This
elaboration introduced an equation to calculate the acceleration of the Move event.
Using this equation, the reasoner calculated the acceleration of the Move event
to be 14.45 m/s2. Further elaborating state 2 using the Motion-under-force
concept results in state 5, which introduces other equations. In this case, state 5
61
!"#$%&'
()*$'
+,--'
./.0,12*$1)%.&3'
4/,12*$1)%.&3'
56$7&28)7%$' %,9-$-'
)"#$%&'
/$&2:)7%$'
;<'=>'
?' <'+@-'
AB'+@-'
A<'+'
)"#$%&'
C.-&,/%$'
!"#$%&'
()0)/2D.&E2%)/-&,/&2,%%$1$7,0)/'()0)/29/C$72:)7%$'
+,--'
./.0,12*$1)%.&3'
4/,12*$1)%.&3'
56$7&28)7%$' %,9-$-'
)"#$%&'
,%%$1$7,0)/'/$&2:)7%$'
;<'=>'
2AAFGH' AIJIF'+@-K' <'+@-'
AB'+@-'
A<'+'
)"#$%&'
C.-&,/%$'
!"#$%&'(!L '4/,12*$1)%.&3K'M'./.0,12*$1)%.&3K'N'OK'P',%%$1$7,0)/'P'C.-&,/%$Q'L '/$&2:)7%$'M'+,--'P',%%$1$7,0)/'
)#*(%&'+ ,-'-./+
)#*(%&'+
R'%3%1.-&'+9-&'-&)S'E$7'".=$'./'A<'+J''
TE$'.-'&7,*$1./>',&','*$1)%.&3'):'AB'+@-J''UE$'%)+"./$C'+,--'):'&E$'%3%1.-&',/C'".%3%1$'.-';<'=>J''
VE,&'.-'&E$':)7%$'7$W9.7$C'&)'-&)S'&E$'".=$'./''&E.-'C.-&,/%$?'
!"#$%&'(!
Figure 4.5: On the left-hand side, the scenario and query of the question is repre-sented as a Move event on an Object having mass 80 kg. The initial and finalvelocities of the Move are 17 m/s and 0 m/s respectively. The distance of theMove is 10 m. There is also an Exert-Force event whose object is the same asthat of the Move event. The Exert-Force event causes the Move event. Thequery is on the net-force of the Exert-Force and is the node with a question-mark.To answer the question, the question mediator has to draw in information from theknowledge base. As shown on the right-hand side, information from the Motion-under-force and Motion-with-constant acceleration concepts are used toelaborate the question for the reasoner to compute the answer.
62
Figure 4.6: Search graph created by the question mediator to answer the questionintroduced in Figure 4.5. Each state in our state-space tree contains a minikb. Theinitial state in the tree represents the original minikb for the question. The minikbsin other states are elaborations of the original minikb. Each operator in the treedescribes how a concept in the knowledge base can be applied to elaborate a minikbto produce another.
contains the necessary equations to derive the net-force causing the Move event
to be -1156 Newtons.
Goal Test and the Goal State
The eight question types supported by ASKME are represented as query triples to
the minikb. Complex questions can be expressed by chaining multiple query triples.
Table 4.5 summarizes how each question type can be represented as query triples.
The goal test determines if a state answers the question. The goal test is
implemented as a projection operator, which is a common solution for matching
queries with documents in semantic information retrieval systems[46, 53, 57, 106].
The projection operator maps a query triple to a state’s minikb if it is isomorphic to
a subgraph in the minikb, and when every concept and relation in the query triple
subsumes its corresponding concept and relation in the minikb.
63
!"#$%&'
()*$'
+,--'
./.0,12*$1)%.&3'
4/,12*$1)%.&3'
56$7&28)7%$'
%,9-$-'
)"#$%&'
/$&2:)7%$'
;<'=>'
?'
<'+
@-'
AB'+
@-'
A<'+
'
)"#$%&'
C.-&,/%$'
!"#$%&'
()0)/2D
.&E2%)/-&,/&2,%%$1$7,0)/'
+,--'
./.0,12*$1)%.&3' 4/,12*$1)%.&3'
56$7&28)7%$'
%,9-$-'
)"#$%&'
,%%$1$7,0)/'
/$&2:)7%$'
;<'=>'
?'
2AFGFH'+
@-I'
<'+
@-'
AB'+
@-'
A<'+
'
)"#$%&'
C.-&,/%$'
!"#$%&'
()0)/2D
.&E2%)/-&,/&2,%%$1$7,0)/'
()0)/29/C$72:)7%$'
+,--'
./.0,12*$1)%.&3' 4/,12*$1)%.&3'
56$7&28)7%$'
%,9-$-'
)"#$%&'
,%%$1$7,0)/'
/$&2:)7%$'
;<'=>'
2AAHJK'
2AFGFH'+
@-I'
<'+
@-'
AB'+
@-'
A<'+
'
)"#$%&'
C.-&,/%$'
A'
I'
H'
!""#$%&'#()(*+&,-.)(/"0(11$-.1)
$.20-3+%(3)'4)5
-,-.6+.3(067-0%()
L!'/$&2:)7%$'M'+
,--'N',%%$1$7,0)/'
$.20-3+%(3)'4)5
-,-.68
$296%-.12&.26&%%(#(0&,-.'
L!'4/,12*$1)%.&3
I'M'./.0,12*$1)%.&3
I'O'PI'N',%%$1$7,0)/'N'C.-&,/%$Q'
!""#$%&'#()(*+&,-.)(/"0(11$-.1)
!""#$%&'#()(*+&,-.)(/"0(11$-.1)
$.20-3+%(3)'4)5
-,-.68
$296%-.12&.26&%%(#(0&,-.'
L!'4/,12*$1)%.&3
I'M'./.0,12*$1)%.&3
I'O'PI'N',%%$1$7,0)/'N'C.-&,/%$Q'
Fig
ure
4.7:
The
min
ikbs
for
stat
es1,
2,an
d5
inth
ese
arch
grap
hin
Fig
ure
4.6.
Stat
e1,
the
init
ials
tate
inth
ese
arch
grap
h,co
ntai
nsth
eor
igin
alm
inik
bfo
rth
equ
esti
on.
Stat
e2
cont
ains
the
min
ikb
crea
ted
byel
abor
atin
gth
em
inik
bin
Stat
e1
usin
gth
eM
otio
n-w
ith-c
onst
ant-a
cceleratio
nco
ncep
t.T
his
elab
orat
ion
intr
oduc
edan
equa
tion
toca
lcul
ate
the
acce
lera
tion
ofth
eM
ove
even
t.U
sing
this
equa
tion
,the
acce
lera
tion
ofth
eM
ove
was
com
pute
dto
be14
.45
m/s
2.
Furt
her
elab
orat
ing
the
min
ikb
inSt
ate
2us
ing
theM
otio
n-u
nder-f
orce
conc
ept
resu
lts
inth
em
inik
bin
stat
e5.
Thi
sm
inik
bco
ntai
nsth
eeq
uati
ons
toco
mpu
teth
ene
t-fo
rce
caus
ing
the
Move
tobe
-115
6N
ewto
ns.
64
Question type Query tripleWhat is the r of X? (X r ?)Is it true that the r of X is Y? (X r Y)Is it true that X is greater than Y? (X <operator> Y)What is the relationship between X and Y? (X ? Y)How many r of X? (X rhowmany ?)How many r of X is Y? (X rhowmany Y)What is an example of X? (? instance-of Class)What is the X? (X instance-of ?)Describe X? (X instance-of Class)What is the similarity between X and Y? (X similar-to Y)What is the difference between X and Y? (X different-to Y)
Table 4.5: Summary on how each question type can be represented into query triples
I implement the projection operator using a semantic matcher [80, 94, 119].
The matcher determines if a query can be projected onto a state’s minikb by finding
a mapping between the query triple and triples in the state’s minikb.
In our example the goal test is satisfied when the projection operator maps
the query onto the state’s minikb to compute the net-force causing the Move event.
A state in the search graph containing such a minikb is known as a goal state.
Consider the minikbs for states 2 and 5 in Figure 4.7. State 2 fails the goal
test because it does not return an answer to the query – its minikb lacks the necessary
axioms and inference methods to return a value for the net-force of the Exert-
Force event. In contrast, the minikb in state 5 contains the necessary equations
to derive an answer. Thus, state 5 satisfies the goal test and is a goal state, as it
returns the value, i.e., -1156 Newtons, for the net-force causing the Move.
Operators
Operators elaborate a state to produce another. I next describe how operators are
created and applied.
65
Creating Operators Operators are created from the set of concepts in the knowl-
edge base. Each concept contains declarative assertions with associated inference
methods (e.g., rules for automatic classification) and computation methods (e.g.,
mathematical equations). Concepts in the knowledge base are represented as con-
cept graphs and the search controller uses a semantic matcher [80, 94, 119] to de-
termine how information in a concept relates to a state. A semantic matcher takes
two representations (encoded in a form similar to conceptual graphs [106]) and uses
taxonomic knowledge to find the largest connected subgraph in one representation
that is isomorphic to a subgraph in the other. The output of semantic matching
forms an operator relating two states.
Figure 4.8 shows in bold the common features between state 1 and the
Motion-with-constant-acceleration concept found by the semantic matcher.
In this case, Move matches Motion-with-constant-acceleration because
the former subsumes the latter, and the distance of the Move in the minikb
matches the distance of the Motion-with-constant-acceleration concept.
These common features become Operator A relating state 1 to state 2 in the search
graph shown in Figure 4.6.
Applying Operators Applying an operator to a state creates a successor state.
The successor state includes new information introduced by a concept from which
the operator was created. The new information is merged with the parent state
by joining [106] the representations on their overlapping features found by seman-
tic matching. The successor state contains additional information and inference
methods introduced by the operator’s concept.
Figure 4.9 shows the successor state that results from applying Operator A
to state 1 (Figure 4.6). It was created by the match shown in Figure 4.8. This
successor state becomes state 2 in the search graph (Figure 4.6).
66
Figure 4.8: The semantic matcher identified the matching features (highlighted inbold) between the state 1 on the left-hand side and the Motion-with-constant-acceleration concept on the right-hand side. In this case, the result of semanticmatching becomes Operator A relating state 1 to state 2 in the search graph shownin Figure 4.6
.
67
Fig
ure
4.9:
The
elab
orat
edm
inik
baf
ter
appl
ying
Ope
rato
rA
inth
ese
arch
grap
h(F
igur
e4.
6).
Pan
el1
show
sth
em
inik
bof
stat
e1
and
pane
l2
show
sth
eM
otio
n-w
ith-c
onst
ant-a
cceleratio
nco
ncep
t.T
heov
erla
ppin
gfe
atur
esfo
und
byse
man
tic
mat
chin
gbe
twee
nth
em
inik
bin
stat
e1
and
Motio
n-w
ith-c
onst
ant-a
cceleratio
nar
ehi
ghlig
hted
inb
old
inpa
nels
1an
d2.
The
seov
erla
ppin
gfe
atur
esar
ejo
ined
tofo
rmth
em
inik
bin
pane
l3.
The
new
piec
eof
know
ledg
ein
trod
uced
byth
eco
ncep
tM
otio
n-w
ith-c
onst
ant-a
cceleratio
nis
high
light
edin
bol
din
pane
l3.
Thi
sm
inik
bfo
rms
Stat
e2
ofth
ese
arch
grap
h(F
igur
e4.
6).
68
Control Strategy
The question mediator expands the search graph in a breadth first manner. Ex-
panding the search graph in a breadth first manner ensures the minikb returned by
the question mediator contains only concepts necessary to answer questions. This is
to avoid an overly complex minikb that may be inefficient to execute or may yield
incoherent explanations.
4.2.2 Resolving Representational Differences
When knowledge bases and question formulations are built using an expressive on-
tology, the same meaning can be represented differently. Such representational dif-
ferences are especially common when the knowledge bases are built and used by
different groups of users, and for large knowledge bases covering complex domains.
For the question mediator to work well, it has to resolve representational differences
that may exist between question formulations and the knowledge base. Towards this
end, I extend the question mediator to resolve representational differences during
operator creation and in the goal test.
Resolving Differences when Creating Operators
Consider the question:
Question:
A cell has 46 chromosomes. What is the cell?
Question formulation:
Cell?has−part−→ (46 Chromosome)
The assertion that the human cell has 46 chromosomes might be encoded in
the knowledge base as either:
69
Simpler Representation:
Human-Cellhas−part−→ (46 Chromosome)
or
Richer Representation:
Human-Cellhas−part−→ Nucleus
has−part−→ DNAhas−part−→ (46 Chromosome)
With the simpler representation, there are no representational differences
between the question formulation and the knowledge base. Therefore, answering the
question is trivial because the semantic matcher can easily find the correspondence
between the question formulation and the information in the Human-Cell concept.
In contrast, it is necessary to resolve representational differences when the richer
representation is used. This is because semantic matching fails to match:
Question formulation:
Cell?has−part−→ (46 Chromosome)
with
Richer Representation:
Human-Cellhas−part−→ Nucleus
has−part−→ DNAhas−part−→ (46 Chromosome)
To resolve such representational differences, I use a semantic matcher that is
enhanced with a set of transformation rules. These transformation rules resolve rep-
resentational differences when two knowledge structures having the same meaning
are represented differently [120].
Each transformation rule used in the flexible semantic matcher is an instance
of the pattern “transfers through” [69] which has the following form:
C1r1−→ C2
r2−→ C3 ⇒ C1r1−→ C3
where Ci is a concept and rj is a relation. A transformation rule is applicable to
70
a representation if its antecedent subsumes a subgraph of the representation. The
flexible semantic matcher applies a rule by joining [106] the rule’s consequent with
the subgraph that matched the rule’s antecedent. Applying a transformation rule
will replace one relation in an encoding with a sequence of semantically similar
relations. The transformation rules are iteratively applied to transform the input
graphs to include additional triples. The flexible semantic matcher stops and returns
a result when no additional correspondence is found between the graphs or when a
depth bound is reached. This forms the transistive closure of a transistive relation
such as has-part.
Figure 4.10 shows the application of transformation rules to include addi-
tional relations between the Human-Cell and Chromosome entities in the
richer representation. The transformed version of the richer representations allows
the semantic matcher to find the correspondence with the question by resolving
the granularity mismatch between the question “How many chromosomes are in
a person’s cell?” and the richer representation. This allows the question to be
answered because the necessary information is found and applied by the question
mediator. Additional details on the list of transformation rules and the flexible
semantic matcher are described in [121].
Resolving Differences in the Goal Test
For every query triple in a question formulation, the basic goal test retrieves the
values for a single slot of a given frame. Consider the question:
Question:
What is the function of lysosomes?
Question formulation:
Lysosomeplays−→ Role?
71
Figure 4.10: The application of a transformation rule to include an additional rela-tions between Human-Cell and Chromosome in the richer representation. Thisparticular rule encodes the transitivity of the has-part relation.
The basic goal test returns a shallow answer describing lysosomes simply
as containers. Ideally, the returned answer will describe the specific container role
played by lysosomes to be an instrument enclosing digestive enzymes (rendered in
italics).
Basic goal test’s answer
Lysosomes play the role of a container.
Ideal answer
Lysosomes play the role of a container, ..., and is an instrument enclosing digestive
enzymes.
The basic goal test may return uninteresting or incorrect answers for two
reasons. First, it is challenging for users unfamiliar with the knowledge base to
determine which frame-slots are relevant to the answer because there can be a variety
of related frame-slots answering a question. Second, additional details expected in
the question may not be returned by the basic goal test if granularity differences
72
exist between the query triple and the knowledge base being queried. I use two
techniques to improve the basic goal test to return additional details expected in
the question. Both techniques use knowledge represented in the knowledge base to
transform the original query. This has the advantage of being domain and ontology
independent.
The first technique, a type of query expansion[35], is to include related queries
to return additional details expected in the question. Semantically related relations
in the knowledge base are cataloged into a variety of dimensions[2, 8, 66] that
convey similar kinds of information. For example the structural properties of an
Entity can be retrieved by querying the relations in the Structural dimension.
The question mediator includes additional queries from the same dimension if the
original query is a member of the dimension. Using this technique, the goal test to
answer “What is the function of lysosomes?” is expanded to include related queries
on the Agentive roles, e.g., the instrument and agent relations, of the Lysosome
(see Figure 4.11). This in turn retrieves additional details from the minikb to answer
the question.
The second technique to improve the goal test resolves the granularity differ-
ences that may exist between the question’s query triples and the knowledge base.
Consider the question:
Question:
What are the parts of a cell?
Question formulation:
Cellhas−part−→ Entity?
The basic goal test will return the <frame slot value> triples highlighted
in Bold in Figure 4.12(a). Ideally, I like to return additional details shown in
Figure 4.12(b) to describe the partonomy of a Cell.
73
Figure 4.11: The original set of query triples to answer “What is the function oflysosomes?” is expanded to include related queries on the Agentive roles. This,in turn, retrieves additional details from the knowledge base to answer the question.
ASKME achieves the ideal answer by transforming the original query into
additional finer-grained queries on the knowledge base. It creates the additional
queries by applying the transformation rules of the transfer-thru pattern[69, 121] on
the original query triples. For example,
Question:
What are the parts of a cell?
Question formulation: (with finer-grained query triples)
Cellhas−part−→ Entity?
Cellhas−part−→ Entity?
has−part−→ Entity?
Cellhas−part−→ Entity?
has−basic−structural−unit−→ Entity?
...
74
(a) Answer returned when additional queries to retrieve finer-grainedinformation are not used
(b) Answer returned when additional queries to retrieve finer-grainedinformation are used
Figure 4.12: Answer differences when additional queries are used/not used to re-trieve finer-grained information.
75
These additional queries retrieve other finer-grained information to answer
the question.
4.2.3 Relevance Reasoning
Because states can be elaborated by potentially many concepts in the knowledge
base, the question mediator may explore a large number of states to find a minikb
answering the question. Because of my concern about the scalability of the question
mediator on large knowledge bases, I want the question mediator to select only
relevant concepts. Toward this end, the system performs relevance reasoning with
a set of heuristics that control operator creation and ordering.
Creating Operators.
When creating operators, the question mediator rejects concepts when:
1. No new knowledge is added
2. Knowledge added is unrelated
3. Inconsistent knowledge is added
Figure 4.13 shows an operator that is irrelevant because it adds only un-
related knowledge. The left hand side of Figure 4.13 shows state 1, the original
minikb for the sample question in Figure 4.5. The right hand side shows the Cir-
cle concept. In this case, there are no common features between the minikb and
the Circle concept, as the two do not have any features (i.e., terms and relations)
in common.
Figure 4.14 shows an example where a concept is rejected because it adds
inconsistent knowledge to a state. The left hand side of Figure 4.14 shows state
1, the original minikb for the sample question in Figure 4.5. The right hand side
76
Figure 4.13: The semantic matcher did not identify any matching features betweenState 1 (Figure. 4.6) on the left-hand side and the Circle concept on the right-hand side. In this case, no operator is created.
shows the Fall-from-rest concept. The semantic matcher finds the overlap
between these two graphs, as highlighted in bold. In this case, the Move event
in state 1 and its properties, object, initial velocity, final velocity, and distance,
align with Fall-from-rest having similar features. Fall-from-rest is rejected
by the question mediator because it introduces a contradiction in which the initial
velocity of the Move has multiple values, 17 m/s and 0 m/s.
Control Strategy
The question mediator orders the application of operators to give priority to concepts
in the knowledge base having:
1. knowledge structures directly affecting the query
2. a high degree of similarity
3. the least number of assumptions added
77
!"##$%&'($&)*+,-'./)0+,
123)/+,
!"##$%&'($&)*+,
4.45"#$6)#'/4+7,
8."#$6)#'/4+7,
!""#$#%!&'()
*+,)-./0)
9,(:*,
'23)/+,
;4*+"./),
<00#4/"2#),)=>"5'.,)?0&)**4'.*,
1!2(!$34#$'"5670)8)5(5&!$34#$'"5670)9)0:!""#$#%!&'();)<5/6!("#=)
123)/+,
@'6),
-!//)
4.45"#$6)#'/4+7,
8."#$6)#'/4+7,
>?#%63@'%"#) "!A/#/)
'BC#"6)
,D)EF)
G)
9,(:*,
AB,(:*,
A9,(,
'23)/+,
;4*+"./),(#63H'%"#)C,
(4.4$D2,
E+"+),A,
Figure 4.14: Information from Fall-from-rest is added to State 1 of Figure 4.6,resulting in an inconsistency in which the initial velocity of the Move has multiplevalues, 17 m/s and 0 m/s.
Figure 4.15 shows how operators B and C in Figure 4.6 are ordered based
on their concept’s similarity to State 1. Figures 4.15(a) and 4.15(b) show how state
1 matches Motion-under-force and Two-Dimensional-Move respectively.
The match with Motion-under-force is preferred because a larger portion of
the Motion-under-force concept graph matches state 1.
4.2.4 Block diagram and pseudocode
The block diagram in Figure 4.16 shows the various components of the question
mediator.
1. Search controller - The controller incrementally extends a question with infor-
mation in the knowledge base to compute an answer. It controls the process of
finding and trying different subsets of the knowledge base to answer the ques-
tion. Each subset of the knowledge base is a node in the search graph explored
by the search controller. I call a node in the search graph a “minikb”.
78
!"#"$%&$'()%*")+(,-"$+(./,
012(+/,
!"#"$%&$'()%*")+(,
3455,
67()/%8")+(, +4&5(5,
"12(+/,
$(/%*")+(,
"12(+/,
!""#$#%!&'()
9..:;+41:(,(<&4#"$,(7.)(55;"$5,
*!)(#+,-'%"#).)/!00)1)!""#$#%!&'()
012(+/,
!"=(,
3455,
2(2&!$,3#$'"2+4)
5(!$,3#$'"2+4)
67()/%8")+(, +4&5(5,
"12(+/,
$(/%*")+(,
>?,@A,
B,
6)/70)
89)/70)
86)/)
"12(+/,
:20+!("#)C,
3;$;%@1,
D/4/(,E,
(a) semantic match output for State 1 and Motion-under-force
!"#$%&'()*&#)+,$-#.(/0#)1(23/
456(13/
!"#$%&'()*&#)+,$-#.(/
!"!#$%&'&()%*+!,-.
/"$%&'&()%*+!,-.
!"!#$%&-&()%*+!,-.
#56(13/
/"$%&-&()%*+!,-.
-&$++)%)0$#*".
'&$++)%)0$#*".
722,&1+5,(/(89+:#)/(;2<(**&#)*/
1!/=/
456(13/
-#.(/
2$33.
!"!#$%&()%*+!,-.
/"$%&()%*+!,-.
4')0,&5*0+). +$63)3.
*78)+,.
"),&9*0+).
:;.<=.
>.
;.2?3.
@A.2?3.
@;.2.
#56(13/
B!3,$"+).>/
'&)&$?5/
@3+3(/A/
(b) semantic match output for State 1 and Two-Dimensional-Move
Figure 4.15: Different degrees of match between state 1 in Figure 4.6 and theMotion-under-force and Two-Dimensional-Move concepts in the knowledgebase. The match with Motion-under-force has a higher degree of match be-cause a larger portion of Motion-under-force matches state 1. Thus, OperatorB is preferred over Operator C in Figure 4.6.
79
!"#$%&'
%()*$(++"$'
,-.'(/
*0/*'
12/"
$34'
5(#+'*"6*'
7/"
$3'
890#):
"$'
2/"$3'
#)6;
"$164''
2/"$3+<6*'
=+"9<>+"'
6"?#)@%'?
#*%&"$'
A(:
"'890#):
"$')(
:"+<6*'
)(:"
'
)(:"
B'%#):
<:#*"'
%()%"0
*'?#00<)C'
)(:"
B'2/
"$3'
?#00<)C'
+<6*'(
D'#)6;
"$#>+"'
2/"$<"6'
)(:"
B''2/
"$3'
EF'
2/"$<"6'
EF'
$"6/+*'
!"#
$%&'
()#*
+,-&.(
G"+"H#)%"'
G"#6()
"$'
)(:"
+<6*'
6($*":
')(
:"+<6*'
EF'I)D"$")%"'8)
C<)"
'
E)(;
+":C"'>#6"'
Fig
ure
4.16
:B
lock
diag
ram
ofth
equ
esti
onm
edia
tor.
80
2. Relevance reasoner - Given a list of nodes, the relevance reasoner reorders the
list to give priority to: (a) shallower nodes (to preserve completeness property
of breadth first search), (b) goal nodes, and (c) nodes that contain the fewest
number of assumptions to minimize the number of abductive inferences (which
is unsound) .
3. NodeExpander - Given a node, NodeExpander creates its successors. Each
successor node is constructed by joining the node’s minikb with a concept from
the knowledge base. NodeExpander rejects concepts that introduce redundant,
inconsistent, or unrelated information.
4. Goal test - The goal test succeeds when query triples can be projected (via
semantic matching) onto a node’s minikb to compute an answer.
5. ExpandQuery - Given the original query, ExpandQuery consults the knowl-
edge base to find a set of related queries.
6. Flexible semantic matcher - The matcher takes two representations (encoded
in a form similar to conceptual graphs [106]) and uses taxonomic knowledge to
find the largest connected subgraph in one representation that is isomorphic
to a subgraph in the other. The matcher is flexible because it is enhanced
with a set of transformation rules that might apply to both representations to
reduce the number of non-matching features between them[121].
The pseudocode for the question mediator is given in Figures 4.17-4.21.
Completeness and Termination
We know that best first search is complete and terminates if the heuristics are
only used to reorder the queue but not prune it, and when the set of states to be
searched is finite. The basic formulation of the question mediator is complete and
81
The pseudocode for the question mediator is given in Figures ?-?. The high-level
structure of the algorithm is best-first search. This figure gives a standard definition of
best-first search by Nillson[? ]. Figures 2, 3, and 4 show the instantiation of steps 6, 7 and
9 (respectively) to describe the question mediator.
1. The variables OPEN, CLOSED, QUERY, and KB, are global variables
2. Create a search graph, G, consisting solely of the start node, s. Put s on a list called
OPEN.
3. Create a list called CLOSED that is initially empty.
4. LOOP: if OPEN is empty, exit with failure.
5. Select the first node on OPEN, remove it from OPEN, and put it on CLOSED. Call
this node n.
6. If n is a goal node, exit successfully with the solution. ( GoalState-p(n, QUERY ),
see Algorithm 1)
7. Expand node n, generating the set, M, of its successors that are not ancestors of n.
Install these members of M as successors of n in G. ( ExpandNode(n), see Algorithm
4)
8. Establish a pointer to n from those members of M. Add these members of M to OPEN.
9. Reorder the list OPEN, either according to some arbitrary scheme or according to
heuristic merit. ( ReorderNodeList(OPEN)), see Algorithm 11)
10. Go LOOP
i
Figure 4.17: The pseudocode for the question mediator is given in Figures 4.17-4.21. The high-level structure of the algorithm is best-first search. This figure givesa standard definition of best-first search by Nilsson[84]. Figures 4.18, 4.19-4.20, and4.21 show the instantiation of steps 6,7, and 9 (respectively) to describe the questionmediator.
82
This page onwards list the pseudocode for step 5.
Algorithm 1 GoalState-p(node, query), (See section 4.41 on “Goal Test and the GoalState”, and section 4.42 on “Resolving differences in the Goal Test”)Given: node and queryReturn: answerquerylist
expands the user’s query with ones related to itquerylist ← ExpandQuery(query)find the subset of those queries that can be answered using the node’s minikbanswerquerylist ← GetAnswerableQuery(node, query)return answerquerylist
Algorithm 2 ExpandQuery(query)Given: queryReturn: querylist
this function queries the knowledge base for related queries.The workings of this function are described in section 4.42 on “Resolving differences inthe Goal Test”
Algorithm 3 GetAnswerableQuery(node, querylist), (See section 4.41 on “Goal Testand the Goal State”, and section 4.42 on “Resolving differences in the Goal Test”)Given: node and querylistReturn: answerquerylist, sublist of querylist that can be answered using node’s minikb.
querylist ← emptyfor all query in querylist do
mapping ← FlexibleSemanticMatch(node, query)query is answered if it can be projected onto node, i.e., non-empty mappingif ¬ empty(mapping) then
answerquerylist ← add query to answerquerylistend if
end forreturn answerquerylist
ii
Figure 4.18: Pseudocode for step 6 of Figure 4.17
83
This page onwards list the pseudocode for step 6.
Algorithm 4 ExpandNode(node), (See section 4.41 on “Control Strategy”)Given: nodeReturn: nodelist
operatorlist ← CreateOperators(node)for all operator in operatorlist do
newnode ← ApplyOperator(node, operator)nodelist ← add newnode to nodelist
end forreturn nodelist
Algorithm 5 CreateOperators(node) (See section 4.41 on “Creating Operators” andsection 4.42 on “Resolving Differences when Creating Operators”)Given: nodeReturn: operatorlist, a list of mappings, each describing how a concept in the KB can be
related to nodeoperatorlist ← emptyfor all concept in KB (global variable) do
mapping ← FlexibleSemanticMatch(node, concept)if ¬ ( IrrelevantMapping-p(mapping) ∨ RedundantMapping-p(mapping, node,concept) ∨ Inconsistent-Mapping-p(mapping, node, concept)) then
operatorlist ← add mapping to operatorlistend if
end forreturn operatorlist
Algorithm 6 ApplyOperator(node, operator), (See section 4.41 on “Applying Opera-tors”)Given: node and operatorReturn: newnode
operator contains concept and the mapping of how it can be joined to nodenewnode ← compute conceptual graph join of node and concept according to the mappinggiven by operatorreturn newnode
iii
Figure 4.19: Figures 4.19 and 4.20 are the pseudocode for step 7 of Figure 4.17 (Part1/2)
84
Algorithm 7 FlexibleSemanticMatch(Graph1, Graph2)Given: Graph1 and Graph2Return: mapping
finds the largest connected subgraph between Graph1 and Graph2.minimize the non-matching features via transformation rulesdescribed in Yeh et. al.[120]
Algorithm 8 IrrelevantMapping-p(mapping), (See section 4.43 on “Creating Opera-tors”)Given: mappingReturn: boolean
a mapping is irrelevant if no feature of the concept matches the node’s minikbreturn empty(mapping)
Algorithm 9 RedundantMapping-p(mapping, node, concept), (See section 4.43 on“Creating Operators”)Given: mapping and node and conceptReturn: boolean
if mapping describes an isomorphic graph match between node and concept thenreturn true
elsereturn false
end if
Algorithm 10 InconsistentMapping-p(mapping, node, concept), (See section 4.43 on“Creating Operators”)Given: mapping and node and conceptReturn: boolean
newnode ← ApplyOperator(mapping, node, concept)DC ← “the deductive closure of newnode”if DC is inconsistent then
return trueelse
return falseend if
iv
Figure 4.20: Figures 4.19 and 4.20 are the pseudocode for step 7 of Figure 4.17 (Part2/2)
85
This page onwards list the pseudocode for step 7.
Algorithm 11 ReorderNodeList(nodelist), (See Section 4.43)Given: nodelistReturn: sortednodelist
return Sort(nodelist :test ’node-gt)
Algorithm 12 Node-GT(node1, node2), (See Section 4.43 on “Control Strategy”)Given: node1 and node2Return: boolean
node1parent ← parent(node1)node2parent ← parent(node2)prefer node1 if it’s shallower than node2if ply(node1) < ply(node2) then
return true ELSEprefer node1 if it satisfies the goal testif goalstate-p(node1) then
return true ELSEprefer node1 if it adds fewer triples to its parent than does node 2This is because each new triple is an abductive inference(unsound).if (size(node1) − size(node1parent)) < (size(node2) − size(node2parent)) then
return trueend if
end ifend ifreturn false
v
Figure 4.21: Pseudocode for step 9 of Figure 4.17
86
terminates because it employs breadth first search to systematically explore a finite
set of concepts in the knowledge base.
The question mediator is intended for use in an interactive environment.
Thus, it has to be responsive and exhibit good performance. Ideally, questions
should be answered within 60 seconds. In this setting, the question mediator cannot
be exhaustive in its search. Therefore, my implementation of the question mediator
uses a timebound of 5 minutes.
Also, the question mediator employs heuristics to reorder operators on the
same ply of the search graph to give priority to promising ones. This reordering does
not affect the completeness or termination properties of best first search. The ques-
tion mediator also employs heuristics to reject operators that introduce redundant,
unrelated, or contradictory information into the minikb of a node. These operators
cannot lie on the shortest path to a goal (i.e., a node that answers the user’s ques-
tion). Thus, their removal from the search graph does not affect the completeness
property.
4.3 Related work
This section surveys other work related to my design of ASKME. I begin by dis-
cussing various controlled languages developed for different applications. I then
discuss related work on the domain-neutral ontology. Next, I discuss related work
on question taxonomies. Finally, I discuss related work that is similar to my question
mediator.
4.3.1 Restricted English
Controlled languages have been successfully deployed in industry and have been
shown to be useful in many technical settings (e.g., Caterpillar’s Fundamental
English[112], White’s International Language for Servicing and Maintenance[116],
87
and Perkins Approved Clear English[96]). The aerospace industry has also adopted
a common controlled language (ASD simplified technical English) in writing air-
craft maintenance documentation to facilitate their use by non-native speakers of
English[1]. The commercial success of controlled languages suggests that people
can indeed learn to work with restricted English. Although the majority of prior
work on controlled languages has been devoted to making text easier for people to
understand, there have also been several ongoing projects with computer process-
able languages to improve the computational processing of a text (e.g., Attempto
Controlled English(ACE)[45], Processable ENGlish(PENG)[103], and GINO[13]).
Recently, several controlled natural languages have been proposed for the
Semantic Web to represent formal statements with natural language, facilitating
their use by people with no background in formal methods[104].
Attempto Controlled English(ACE) is an ontology authoring language,
and is a subset of English designed “to provide domain specialists with an expressive
knowledge representation language that is easy to learn, read, and write”[45]. ACE
is characterized by a small number of construction rules that define its syntax, and
a small number of interpretation rules that disambiguate constructs that in full
English might be less clear. While looking like natural English, it can be translated
automatically and unambiguously into logic and every ACE text has a single and
well-defined formal meaning.
Processable ENGlish(PENG) is a controlled language designed for writ-
ing unambiguous and precise specification texts for knowledge representation[103].
For example, it can be used to annotate web pages with machine-processable in-
formation. While easily understood by speakers of the base language, it has the
same formal properties as an underlying formal logic language and so is machine-
processable. PENG covers a strict subset of standard English, and is precisely
defined by a controlled grammar and lexicon.
88
With some engineering work, ASKME can substitute CPL with either ACE
or PENG. This is possible because both ACE and PENG translate natural language
into OWL representations which can be mechanically converted into representations
in the KM knowledge representation language.
4.3.2 Available Ontologies
There are a variety of resources to provide the domain-neutral ontology of ASKME.
Linguistically motivated ontologies such as WordNet[41], SENSUS[64], and the Gen-
eralized Upper model[11] have been developed to support sophisticated natural lan-
guage processing tasks. These ontologies have good coverage and access by including
the variety of concepts, relations, and modifiers common in everyday text. For each
English word, these ontologies give its senses along with their definitions, parts of
speech, subclasses, superclasses, and sibling classes. Although existing linguistically
influenced ontologies provide good coverage and access, they lack deep semantics,
beyond linguistic information and definitions in free text, which limits their use for
computer programs.
Other available ontologies provide deeper semantics, but they lack the cover-
age and access of linguistically influenced versions. Ontolingua[43] has rich seman-
tics, but only for restricted domains. Verbnet[60] is a cross-lingual, broad coverage
lexicon that incorporates both semantic and syntactic information about its con-
tents. However, its coverage is limited to verbs. Although the Cyc knowledge base
contains a large number of concepts representing commonsense knowledge[69], it is
difficult to use and finding the right concept often takes a long time[90].
4.3.3 Question categories
An important first step in building a question-answering system is to identify the
supported question types and the approaches necessary to answer them. A variety of
89
taxonomies have been proposed by researchers who have studied question-answering
in the fields of artificial intelligence[68, 101], computational linguistics[54, 113], li-
brary science[34, 92], and education[12, 48, 49, 77].
WH-questions
Who, Which, What, When, Where, Why, and How? is a simple and common
classification of questions that is commonly learnt in grade school as the standard
way to construct questions in English. Robinson and Rackstraw[98, 99] provides a
thorough investigation of wh- words, the forms of questions based on these words,
and the forms of answers to these questions. Many information retrieval systems
use wh- words as the primary criterion for the analysis and logical representation of
questions[56, 65, 76]. These systems typically classify questions lexically into various
wh-word templates to retrieve answers from a set of documents. The simplicity of
classifying questions by their wh- headword is elegant, but to simply consider a ques-
tion on its form may lead to inadequate answers. In order to return a good answer,
it is important to anticipate the question asker’s requirements for the question.
QUALM
Other works on question taxonomies classify questions by considering both
the question and their expected answers. This line of work recognizes that answering
a question well requires identifying the question asker’s expectations so as to guide
the retrieval process to return relevant information. The QUALM system developed
a conceptual theory attempting to replicate the process by which humans understand
and answer questions[68]. The question taxonomy proposed in QUALM is based on
a theory of memory representation called Conceptual Dependency[100], in which the
essential meanings of actions are extracted and encoded as a smaller set of semantic
primitives. The questions answered by the QUALM system concerned short stories
of a few simple sentences, made up mostly of facts. QUALM is able to answer a
question about stories by consulting scripted knowledge that is constructed during
90
Typ
eQ
UA
LM
Gra
essn
er-P
erso
nA
SK
ME
veri
ficat
ion
Did
John
doan
ythi
ngto
keep
Mar
yfr
omle
avin
g?Is
afa
cttr
ue?
Did
anev
ent
oc-
cur?
isit
true
that
the
nucl
eus
isin
-si
deth
ece
ll?di
sjun
ctio
nW
asJo
hnor
Mar
yhe
re?
IsX
,Y
,or
Zth
eca
se?
conc
ept
com
-pl
etio
nW
hat
did
John
eat?
Who
?W
hat?
Whe
n?W
here
?W
hat
isth
ere
fere
nce
ofa
noun
argu
men
tsl
ot?
Wha
tis
the
dura
tion
ofth
em
ove?
Exa
mpl
eW
hat
isan
exam
ple
labe
lor
in-
stan
ceof
the
cate
gory
?W
hat
isan
exam
ple
ofa
prok
aryo
tic
cell?
feat
ure
spec
i-fic
atio
nW
hat
colo
rar
eJo
hn’s
eyes
?W
hat
are
the
prop
erti
esof
X?
Wha
tar
eth
epa
rts
ofa
hum
ance
ll?Q
uant
ifica
tion
How
man
ype
ople
are
here
?H
owm
uch?
How
man
y?A
hum
ance
llco
ntai
nsch
ro-
mos
omes
.H
owm
any
chro
mo-
som
es?
Defi
niti
onW
hat
isX
?W
hat
ism
eios
is?
An
enti
tyis
insi
dea
cell.
Wha
tis
the
en-
tity
?C
ompa
riso
nH
owis
Xsi
mila
rto
Y?
Wha
tis
the
sim
ilari
tybe
twee
na
plan
tan
dan
anim
al?
Inte
rpre
tati
onW
hat
does
Xm
ean?
caus
alan
-te
cede
ntW
hydi
dJo
hngo
toN
ewY
ork?
Why
/how
did
Xoc
cur?
caus
alco
nse-
quen
ceW
hat
happ
ened
whe
nJo
hnle
ft?
Wha
tne
xt?
Wha
tif
?
goal
orie
nta-
tion
For
wha
tpu
rpos
edi
dJo
hnta
keth
ebo
ok?
Why
did
anag
ent
doX
?
inst
rum
ent
/pr
ocec
ural
How
did
John
goto
New
Yor
k?H
owdi
dan
agen
tdo
X?
enab
lem
ent
Wha
tdi
dJo
hnne
edto
have
inor
der
tole
ave?
Wha
ten
able
dX
tooc
cur?
expe
ctat
iona
lW
hydi
dn’t
John
goto
New
Yor
k?W
hydi
dn’t
Xoc
cur?
judg
men
tal
Wha
tsh
ould
John
doto
keep
Mar
yfr
omle
avin
g?W
hat
doyo
uth
ink
ofX
?
requ
est
Wou
ldyo
upl
ease
pass
the
salt?
Tab
le4.
6:A
vaila
ble
ques
tion
cate
gori
esin
QU
AL
M,
Gra
essn
er-P
erso
n,an
dA
SKM
E.
For
conv
enie
nce,
we
repr
oduc
eve
rbat
imth
esa
mpl
equ
esti
ons
that
wer
epr
evio
usly
desc
ribe
din
QU
AL
M[6
8]an
dth
eG
raes
sner
-Per
son
ques
tion
taxo
nom
y[49
].
91
the story understanding. The taxonomy developed in QUALM comprises thirteen
types of questions (causal antecedent and consequent, goal orientation, enablement,
verification, disjunction, instrument/procedure, concept completion, expectation,
judgment, quantification, feature specification, and request)[68].
Graesser-Person question categories
The question taxonomy developed in QUALM forms the basis of a more spe-
cific taxonomy developed by Graesser-Person[49] (see Table 4.6). Graesser-Person’s
question taxonomy was developed by studying questions posed by individuals in a
variety of real world settings[20]. The taxonomy classifies questions according to
the nature of the information being sought that would constitute a good answer to
the question. For all the categories, Graesser conducted human studies to score the
reliability of the taxonomy, and found it to accommodate virtually all inquiries that
occur in a discourse[49].
Our eight question categories, which were derived empirically by analysing
sample AP exams and working with subject matter experts, map well with categories
1 through 8 in Graesser-Person’s question taxonomy[49].
4.3.4 Question mediator
The mechanism I have described for the question mediator is similar to model
composition[37, 97] and analogical reasoning[58, 62, 67, 108, 111].
Model composition automatically selects information from a knowledge
base to form a minimal model adequate to perform a particular task [37, 97]. The
goal of model composition is to identify the domain knowledge that is pertinent
to a particular task. In model composition, the knowledge-base consists of model
fragments that are hand-crafted by experts. Each model fragment is conditioned on
the set of assumptions that prescribe which domain objects to include in the model,
what viewpoints to impose on them, and other simplifying information. These model
92
fragments are organized in a coarse- to fine-grained manner whereby coarse grained
fragments are domain independent, and finer-grained model fragments containing
details specific to the domain. Each model fragment is carefully created by the
knowledge engineer and is tagged with pre- and post-conditions to specify how dif-
ferent model fragments can be composed together. By attending to the assumptions
accompanying each model fragment, the system ensures that the model it constructs
to solve a problem is consistent. Model composition facilitates reuse by showing that
different problems can be solved by composing different models; whereas, previous
systems would have required adding new axioms to the knowledge base to handle
different problems. A disadvantage to model composition is that it is very expen-
sive to employ knowledge engineers to craft the domain theory and to train users in
using the models.
Analogical reasoning uses a corpus of previous problem formulations and
their solutions to solve new, unseen problems[58, 62, 67, 108, 111]. A problem is
solved by finding the best example of a previous solution, which is then parameter-
ized, and instantiated with information in the new question for a reasoner to infer an
answer. Unlike model composition, analogical reasoning does not require a domain
theory (i.e., a knowledge base containing first principles). An advantage of analog-
ical reasoning approaches is their ability to solve unseen problems by looking up,
generalizing, and instantiating previous solutions. This allows analogical reasoning
approaches to avoid the expensive and complex task of carefully creating domain
theories (as required in traditional model composition approaches) to answer ques-
tions. However, a disadvantage to analogical reasoning approaches is their inability
to explain answers because they lack knowledge on a domain’s first principles.
Model composition and analogical reasoning approaches to problem solving
are at opposite ends of the spectrum with respect to their use of target knowledge to
perform problem solving. In model composition, the knowledge is a domain theory,
93
carefully created (at high cost) by experts. Whereas, in analogical reasoning there is
no domain theory (i.e., first principles) and the knowledge used to perform problem
solving are examples of previously answered questions.
My work on the question mediator aims to realize the advantages of both
model composition and analogical reasoning. The question mediator uses domain
theories, created independently by SMEs, to answer unseen questions posed by
novice users. Given a question formulation, the question mediator finds relevant
problem-solving information from the SME-authored domain theory, to create an
adequate model for the reasoner to infer an answer. My work builds upon model
composition because it is capable of using domain theories created by SMEs (which
is highly cost effective). It also answers unseen questions using domain theories (i.e.,
first principles).
4.4 Summary
This chapter discussed the ASKME approach to helping users interact with unfa-
miliar knowledge bases. The ASKME approach functions like a funnel that distills
a user’s intent, stated in natural language, into a suitable logical form that can
be used for problem solving with the underlying knowledge base. The four major
components of ASKME are: (a) a version of restricted English, (b) a general upper
ontology, (c) a set of well known question types, and (d) a software component called
the question mediator that identifies relevant information in the knowledge base for
problem solving.
The chapter also described an instantiation of the ASKME approach. This
prototype uses by several existing technologies. The version of restricted English is
called CPL and was developed by Boeing Phantom Works[30]. CPL has been de-
ployed in many industrial and academic systems. ASKME also uses The University
of Texas at Austin’s Component Library(CLib)[8] as the general upper ontology.
94
Like CPL, the CLib has been successfully used in a variety of AI applications. The
set of well known question types supported by ASKME is a portion of a question
taxonomy that has been well studied and shown to be comprehensive and useful in
a variety of settings.
Finally, I described the design of the question mediator, the software compo-
nent which finds and applies relevant information in a knowledge base for problem
solving. The question mediator was described in three parts. First, the search
controller systematically searches the knowledge base for information to answer
questions. The second part described extensions to the search controller to resolve
representational differences that may exist between a user’s question and the knowl-
edge base. Such representational differences are common when the knowledge base
is built by one group of users and used by a another. The third part is a set of
heuristics for relevance reasoning which is necessary to achieve good performance
and scalability on large knowledge bases.
95
Chapter 5
Evaluation
5.1 Introduction
Knowledge base systems are brittle: they require users to pose questions in terms
of the ontology of the knowledge base. This requirement places a heavy burden on
the users to become deeply familiar with the contents of the knowledge base. As a
result, traditional knowledge base systems have historically been built and used by
the same knowledge engineers.
Ideally, I would like to develop a knowledge base system where the users and
the builders of the knowledge base are two different group of users. Achieving this
ideal requires mitigation of the difficulties faced by novices users in using unfamiliar
knowledge bases for problem solving.
I have developed ASKME to progress toward this ideal. With ASKME, a
knowledge base user can use a variety of unfamiliar knowledge bases by posing their
questions using simplified English. ASKME will then interpret the question, identify
relevant information in the knowledge base, and produce an answer and explanation.
96
I evaluated ASKME on the task of answering AP-like1 exam questions on
the domains of college level biology, chemistry, and physics. The Project Halo team
chose the AP test as an evaluation criterion because it is a widely accepted standard
for testing whether a person has understood the content of a given subject. The
team chose the domains of college-level biology, chemistry, and physics because they
are fundamental, hard sciences and they stress different kinds of representations.
The evaluation consisted of a series of experiments to test if ASKME fulfills
this claim. The first measures ASKME’s level of performance under ideal condi-
tions where the knowledge base is built and used by the same knowledge engineers.
Subsequent experiments measure ASKME’s level of performance under increasingly
realistic conditions, where, ultimately, in the final experiment, I measure ASKME’s
level of performance under conditions where the knowledge base is independently
built by subject matter experts and the users of the knowledge base are a different
group of novice users who are unfamiliar with the knowledge base.
The first experiment establishes the level of performance for ASKME operat-
ing under ideal conditions. The conditions are ideal in that: (a) the knowledge base
being queried were built by knowledge engineers, and (b) the questions are formu-
lated by the same people who built the knowledge base. These ideal conditions are
similar to the way knowledge based question-answering systems have traditionally
been used; the people who built the knowledge base were the ones who used it. This
experiment provides a baseline for judging the contribution of ASKME under less
ideal, but typical conditions.
The second experiment presents a brittleness study to provide a fair measure
of ASKME’s ability to answer AP-like questions. As in the first experiment, the
builders and the users of the knowledge are the same group of users. However, this
experiment used a set of unseen AP-like questions. The purpose of the brittleness1“AP (Advanced Placement) exams” are nationally administered college entry-level tests in the
USA
97
study is to identify the major failures preventing questions from answering correctly
using ASKME.
The third experiment tests whether ASKME can continue to answer ques-
tions when the original knowledge base is exchanged for different ones that differ
in content and organization. This experiment provides a datapoint to judge the
generality of ASKME at answering questions using different knowledge bases.
The fourth experiment evaluates my conjecture that ASKME is able to an-
swer questions that are formulated by users who do not know the ontology of the
knowledge base to which they are posed. I describe an experiment that was con-
ducted by an independent evaluation team[72] that did not participate in the design
and development of ASKME. In the experiment, users with no training in using the
knowledge base were tasked to pose questions using ASKME to answer AP-like exam
questions. The answers and explanations returned by ASKME were then graded by
independent graders with experience at grading AP exams. This experiment pro-
vides a datapoint to judge whether novice users can successfully pose questions using
ASKME to answer questions with unfamiliar knowledge bases.
The fifth experiment evaluated ASKME’s level of performance when oper-
ating under realistic conditions, where the knowledge bases are independently built
by subject matter experts and are used by a different group of novice users who are
unfamiliar with them. The experiment was conducted by an independent evalua-
tion team[72] that did not participate in the design and development of ASKME. In
this experiment, a group of subject matter experts with little training in knowledge
representation and reasoning are tasked to create a set of knowledge bases for the
three science domains: biology, chemistry, and physics. After the knowledge bases
were built, a different group of novice users were tasked to query the knowledge
bases using ASKME to answer a set of AP-like questions. As a control, knowledge
engineers were also tasked to use ASKME to attempt question-answering with the
98
same knowledge bases. The answers and explanations returned by ASKME were
then graded by independent graders with experience at grading AP exams. This
experiment provides a datapoint to judge ASKME’s performance under realistic
conditions where the builders of the knowledge base are subject matter experts and
the users of the knowledge base are novice users.
5.2 Experiment #1: Establishing ASKME’s performance
under ideal conditions
I established ASKME’s level of performance operating under ideal conditions. The
conditions are ideal in that: (a) the knowledge base being queried were built by
knowledge engineers; and (b) the questions are formulated by the same people who
built the knowledge base, thereby avoiding the difficulty of using unfamiliar knowl-
edge bases. The evaluation seeks to answer the following questions:
1. Can knowledge engineers pose questions using ASKME to answer a variety of
AP-like questions?
2. How often do questions formulated using ASKME contain relevant information
in the knowledge base for problem solving?
3. Does ASKME’s relevance reasoning improve performance, and if so, does it
work without sacrificing coverage?
The set of knowledge bases used in the evaluation (henceforth referred to as “refer-
ence knowledge bases”) are created by knowledge engineers working in collaboration
with subject matter experts. These knowledge bases are intended to covered a syl-
labus that consist of approximately 50 pages of a college level science textbook[16,
21, 47] for each of the science domains: biology, chemistry, and physics[72]. Be-
sides the set of domain knowledge bases, the evaluation also uses a significantly
99
Reference Number of Number ofKnowledge-base Concepts Tuples
Biology 139 4789Chemistry 475 22956
Physics 65 5796Multidomain 679 33541
Table 5.1: Four knowledge bases were used in the evaluation. The knowledge basesfor each science domain were created by knowledge engineers working alongsidesubject matter experts. The significantly larger multidomain knowledge base wascreated by concatenating the contents of the individual domain knowledge bases.
larger, multidomain knowledge base that is created by concatenating the reference
knowledge bases for the three domains. The size of the reference knowledge bases
(measured by the number of concepts and tuples) used in the evaluation are listed
in Table 5.1.
The question set used in the evaluation covered a portion of an AP exam
and matched the syllabus of the reference knowledge base. These questions were
authored by teachers with experience teaching the AP curriculum in each domain.
The question set consists of 146 biology questions, 86 chemistry questions, and 131
physics questions[23].
The users participating in the question-answering exercise were the same
knowledge engineers who built the knowledge base. They pose questions using
ASKME to answer the set of AP-like questions. Altogether, the knowledge engineers
created 279 question formulations in biology, 238 question formulations in chemistry,
and 130 question formulations in physics. The number of question formulations
is larger than the number of questions because certain multiple choice questions
required several ASKME formulations to identify the correct option.
Each question formulation is tagged with appropriate answer snippets to
provide a quick, mechanical approach to test if an answer to a question is correct.
My test harness determines whether an answer is correct by comparing the generated
100
answer with answer snippets authored by subject matter experts. This enables the
test harness to automatically retry a question, causing ASKME to return different
answers, until the question is correctly answered or a time-bound is reached. The
testing harness automatically grades the answers in the question set by assigning 1
point for a correct answer and 0 points for an incorrect one.
I have also built ablated versions of ASKME (called ASKME-W/O-QM and
ASKME-BFS-QM) to answer two questions: (a) How often do questions formulated
using ASKME address relevant information in the knowledge base for problem solv-
ing? and (b) Does the relevance reasoning performed by ASKME improve perfor-
mance, and, if so, does it work without sacrificing coverage? ASKME-W/O-QM is
a version of ASKME that does not extend a question with relevant information in
the knowledge base for problem solving. It does not contain the question media-
tor component and a question will not answer if formulated without reference to
required facts and axioms in the knowledge base. ASKME-BFS-QM is a version of
ASKME that searches the knowledge base exhaustively for information and reason-
ing methods to extend a question for problem solving. It does not contain a relevance
reasoner to prioritize promising solutions from the knowledge base. Table 5.2 sum-
marizes the differences between the regular version of ASKME, ASKME-W/O-QM,
and ASKME-BFS-QM.
The standard time-bound for the question mediator is 5 minutes. To max-
imize the number of correctly answered questions for experiment #1, we increased
the time-bound to be 20 minutes.
I use ASKME-W/O-QM to answer the question, “How often do questions
formulated using ASKME address relevant information in the knowledge base for
problem solving?”. The experiment involved using ASKME-W/O-QM to attempt
the set of questions formulated by the knowledge engineers. I then compared the set
of correctly answered questions between ASKME and ASKME-W/O-QM. Our com-
101
Extends questions Performs relevanceSystem version for problem solving? reasoning?
ASKME Yes YesABLATED-W/O-QM No Not applicableABLATED-BFS-QM Yes No
Table 5.2: Summary of the differences between the regular version of ASKME,ASKME-W/O-QM, and ASKME-BFS-QM. Relevance reasoning is not applicableto ASKME-W/O-QM because it does not search the knowledge base for informationto extend a question for problem solving.
parison showed that the set of questions that answered correctly on both ASKME
and ASKME-W/O-QM were formulated with relevant information for problem solv-
ing, while the set of questions that answered with only ASKME were formulated
without relevant information for problem solving.
I conducted two tests using ASKME-BFS-QM to answer the question on
whether relevance reasoning improved ASKME’s performance. Both tests employ
the set of questions formulated by the knowledge engineers. First, I compared the
number of states explored by ASKME and ASKME-BFS-QM while answering ques-
tions using the individual domain knowledge bases. Second, I compared the number
of states explored by both versions of ASKME using the significantly larger multido-
main knowledge base. I can claim that relevance reasoning improves performance if
the number of states explored by ASKME is lower than ASKME-BFS-QM.
I answer the question of whether relevance reasoning worked without sacrific-
ing coverage by comparing the set of correctly answered questions between ASKME
and ASKME-BFS-QM. I can claim that relevance reasoning does not sacrifice cov-
erage if the set of correctly answered questions in ASKME subsumes the set of
correctly answered questions using ASKME-BFS-QM.
102
5.2.1 Results
The evaluation provides evidence concerning the following questions:
Question: What is ASKME’s level of performance when operating under ideal
conditions?
Number of questions PercentageDomain Total Correct Incorrect ScoreBiology 279 209 70 75.0%
Chemistry 238 167 71 70.0%Physics 130 121 9 93.0%
Table 5.3: Correctness scores achieved by knowledge engineers posing questionsusing ASKME on the reference knowledge base
The correctness scores achieved by knowledge engineers on the question set
are listed in Table 5.3. I am gratified to find that knowledge engineers were successful
at using ASKME to answer a variety of AP-like questions. The results show the
ASKME approach (i.e., formulating questions using a restricted English and a set of
well-known question types) to be sufficient to answer a variety of AP-like questions
in various science domains.
Question: How often do questions formulated using ASKME address relevant in-
formation in the knowledge base for problem solving?
ASKME- Answered withDomain W/O-QM ASKME Improvement xforms appliedBiology 145/279 204/279 41% 47/279
Chemistry 124/208 174/208 40% 0/208Physics 44/130 119/130 170% 0/130
Table 5.4: Effect of question mediator and xform rules on answering questions withreference knowledge-base
103
Next, I evaluated the contribution of the question mediator. We compared
the correctness scores achieved by ASKME and ASKME-W/O-QM, which does not
extend questions with information from the knowledge base for problem solving. If
a question is formulated using ASKME-W/O-QM, without addressing information
in the knowledge base necessary for problem solving, it will fail to answer as the
problem solver will lack the required axioms, facts, and reasoning methods to infer
the correct answer.
The correctness scores for both ASKME and ASKME-W/O-QM are listed
in Table 5.4. The correctness scores using ASKME-W/O-QM are significantly lower
than the unablated version of ASKME. In biology and chemistry, only about half
the questions can be answered using ASKME-W/O-QM. In Physics, an even larger
portion of the questions fail to answer with ASKME-W/O-QM. I believe this is
because many biology and chemistry questions contain domain specific terminology,
which implicitly addresses specific information in the knowledge base. In contrast,
many physics questions are stated canonically with very general terms and relations
that do not state all the facts and reasoning methods necessary to solve the problem.
The differences in correctness scores between ASKME and ASKME-W/O-
QM show that a significant number of questions that were correctly answered by
ASKME were formulated without regard for the knowledge base ontology. As a re-
sult, these questions do not reference relevant information in the knowledge base for
problem solving. Therefore, to achieve a high level of performance at question an-
swering, it is necessary for ASKME to automatically extend questions with relevant
information in the knowledge base for problem solving.
I also measured the contribution of the transformation rules used by ASKME
to resolve representational differences between questions and the knowledge base.
Column 5 in Table 5.4 reports the number of questions for which the answer im-
proved with the application of transformation rules. I found transformation rules to
104
be especially useful in answering biology questions. This is because many biology
questions query the value of specific slots and that many of these slots are related.
Thus, to mitigate the brittleness in question answering caused by users querying
the wrong slots, transformation rules can be successfully used to include additional
related slots that are likely to answer the user’s question.
Question: Does ASKME’s relevance reasoning improve performance, and if so,
does it work without sacrificing coverage?
In order to ascertain ASKME’s performance on a variety of knowledge bases
that differ in content and organization, I attempted the questions using ASKME-
BFS-QM, which extends a question by exhaustively trying all possible subsets of
the knowledge base until the question is answered or a time-bound is reached.
Table 5.5 lists the number of states explored by ASKME and ASKME-BFS-
QM when answering the set of questions posed by knowledge engineers. In gen-
eral, ASKME explored fewer states when compared to ASKME-BFS-QM. Thus,
the heuristics employed by ASKME to reject irrelevant information and give pri-
ority to information that is highly related to the question appear to be useful and
necessary to achieving good performance. Table 5.6 lists ASKME’s runtime per-
formance at answering questions in the three domains. I found that a majority of
questions to exhibit good runtime performance using the regular version of ASKME.
The correctness scores for both versions of ASKME (with and without rel-
evance reasoning) are listed in Table 5.7. The gold-standard in coverage is the set
of questions answered by ASKME-BFS-QM, which exhaustively searched the entire
knowledge base for facts and reasoning methods relevant to the question for problem
solving. We found that on all knowledge bases the questions answered by ASKME-
BFS-QM were also answered by the regular version of ASKME (i.e., with relevance
reasoning). Additionally, the regular version of ASKME also required fewer retries
105
(a) Domain knowledge-base
Question System version Number of states exploredSet avg median 75th 90th max
BiologyASKME 5 1 2 17 80
ASKME-BFS-QM 12 1 10 17 195
ChemistryASKME 8.01 1 1 1 104
ASKME-BFS-QM 28.46 1 1 1 495
PhysicsASKME 4 2 4 10 58
ASKME-BFS-QM 10 6 13 22 69
(b) multi-domain knowledge-base
Question System version Number of states exploredSet avg median 75th 90th max
BiologyASKME 5 1 2 16 185
ASKME-BFS-QM 36 1 10 92 704
ChemistryASKME 7.56 1 1 1 104
ASKME-BFS-QM 33.72 1 1 1 733
PhysicsASKME 3 2 4 6 15
ASKME-BFS-QM 97 1 232 235 243
Table 5.5: The average, median, 75th percentile, 90th percentile, and the maximumnumber of states explored by both versions of the system – with and without rel-evance reasoning – for both domain knowledge bases and the larger multi-domainknowledge base.
106
(a) Domain knowledge-base
Question System version Runtime performance (seconds)Set avg median 75th 90th max
BiologyASKME 24.13 2.37 9.44 33.83 1200.0
ASKME-BFS-QM 113.17 1.91 20.93 173.44 1200.0
ChemistryASKME 70.43 11.21 52.41 116.75 1200.0
ASKME-BFS-QM 222.73 12.71 73.38 1200.0 1200.0
PhysicsASKME 30.69 8.7 19.95 57.34 1200.0
ASKME-BFS-QM 50.06 18.4 42.12 72.88 1200.0
(b) multi-domain knowledge-base
Question System version Runtime performance (seconds)Set avg median 75th 90th max
BiologyASKME 28.59 6.12 28.54 62.78 1200.0
ASKME-BFS-QM 235.89 5.07 72.86 1200.0 1200.0
ChemistryASKME 55.37 11.49 44.66 117.64 1200.0
ASKME-BFS-QM 195.07 13.16 93.07 1200.0 1200.0
PhysicsASKME 53.85 26.16 54.68 101.7 1200.0
ASKME-BFS-QM 636.27 1200.0 1200.0 1200.0 1200.0
Table 5.6: The average, median, 75th percentile, 90th percentile, and the maximumruntime performance (seconds) on both versions of the system – with and withoutrelevance reasoning – for both domain knowledge bases and the larger multi-domainknowledge base. In some cases, the heuristics used by ASKME contributed to worseruntime performance, but they were necessary to maximize correctness scores.
107
(a) domain knowledge-base
AverageQuestion set System version % correct retries required
BiologyASKME 68.51 0.23
ASKME-BFS-QM 68.18 0.27
ChemistryASKME 73.11 0.36
ASKME-BFS-QM 64.71 0.23
PhysicsASKME 73.33 0.19
ASKME-BFS-QM 70.67 0.57
(b) multi-domain knowledge-base
AverageQuestion set System version % correct retries required
BiologyASKME 65.91 0.19
ASKME-BFS-QM 64.29 0.26
ChemistryASKME 71.01 0.29
ASKME-BFS-QM 57.56 0.18
PhysicsASKME 70.67 0.16
ASKME-BFS-QM 56.00 0.54
Table 5.7: The correctness scores for versions of the system with and without rel-evance reasoning. Both versions achieved similar correctness scores on the domainknowledge-bases. This indicates that relevance reasoning did not sacrifice correct-ness. The version of the system without relevance reasoning recorded lower correct-ness scores when used with the significantly larger multi-domain-kb when answeringphysics questions. This is due to the large number of states explored during blind-search and our evaluation setup aborting an attempt after a time-bound is reached.This result highlights the need for the system to select only the most relevant por-tions of the knowledge base to reason with.
108
to find the correct answer. This indicates that the heuristics used in ASKME do not
sacrifice coverage and, in fact, enable ASKME to find the correct answer in fewer
tries.
I was pleasantly surprised to find that ASKME answers additional questions
outside the gold-standard. This is due to the significantly larger number of states
explored by ASKME-BFS-QM and my experimental setup aborting an attempt after
a time-bound.
The lower correctness scores and the higher number of retries required by
ASKME-BFS-QM underscores the need for relevance reasoning to focus the search
on the most relevant portions of the knowledge base.
5.3 Experiment #2: Brittleness analysis
To provide a fair measure of ASKME’s ability to answer AP-like questions, I con-
ducted a brittleness study with a set of unseen AP-like questions. The purpose of
the brittleness study was as follows:
1. Establish ASKME’s level of performance at answering unseen AP-like ques-
tions using the reference knowledge base.
2. Identify the major failures preventing unseen questions from correctly answer-
ing using ASKME and the reference knowledge base.
The unseen question set was authored by teachers with experience teaching
the AP curriculum for each domain[72]. The question set consisted of 128 questions
in biology, 131 questions in chemistry, and 100 questions in physics. The ques-
tions covered a portion of an AP exam and matched the syllabus of the reference
knowledge base.
The participants in this exercise were the same knowledge engineers who
built the reference knowledge base. Their task was to use ASKME to answer the
109
set of AP-like questions with the reference knowledge base.
The answers and explanations returned by ASKME were graded by school
teachers or college professors qualified to teach and evaluate AP courses. The answer
and explanation to each question was graded in an AP style and assigned full (2
points), partial (1 point), or no credit (0 points). For full credit, ASKME produced
the correct answer, or sufficient information to allow a naive reader to unambiguously
select the correct answer from the options provided in the question[72].
5.3.1 Results
Number of Answer credit Correctness ScoreDomain Questions 0 pt 1 pt 2 pt Sum PercentageBiology 128 55 18 58 134 52.34%
Chemistry 131 93 19 19 57 21.76%Physics 100 63 1 36 73 36.5%
Table 5.8: Correctness scores achieved by knowledge engineers on the unseen ques-tion set
The correctness scores achieved by the knowledge engineers on the unseen
question set are listed in Table 5.8. The correctness scores achieved in this evaluation
provide a gold-standard approximation on ASKME’s ability to answer a variety of
unseen AP-like questions using the reference knowledge bases.
5.3.2 Failure Categories
I conducted a failure analysis to identify the major sources of failures causing ques-
tions to receive no points or only partial credit. The rest of this section considers
in detail the source of failures for each domain.
110
AffectedSource Questions Notes
(70)KB gap 35 Knowledge required is not in the KBUnsupportedquestiontypes
15 A general QA failure or the question is beyond thereasoning capabilities of QA, e.g., qualitative rea-soning or simulation
Question for-mulation
8 Difficulty in formulating question (includes missingvocabulary problems and unavoidable fidelity viola-tions)
Explanation 8 Weakness of answer presentationSyllabus 4 Question requires knowledge from outside the syl-
labus
Table 5.9: Failure analysis on why biology questions in the unseen question set failto answer
Biology
Of the 128 questions in biology, 58 questions received full points and another 18
questions received partial credit. I examined 70 questions for sources of failure
causing questions to receive either no points or only partial credit. Table 5.9 suggests
that the majority of questions that failed to receive full credit did so due to gaps in
the knowledge base. However, there are also a significant number of questions that
failed because of unsupported question types that required reasoning mechanisms
such as qualitative reasoning and simulation. I also found a variety of questions that
are difficult to formulate using restricted English.
Chemistry
Of the 131 questions in chemistry, 19 questions received full credit and another 19
questions received partial credit. I examined 112 questions for sources of failure
causing questions to receive either no points or partial credit. Table 5.10 suggests
that the main problem in Chemistry is an incomplete knowledge base. Although the
111
AffectedSource Questions Notes
(112)KB gap 88 Knowledge required is not in the KBUnsupportedquestiontypes
18 A general QA failure or the question is beyond thereasoning capabilities of QA, e.g., qualitative rea-soning and simulation
Question for-mulation
2 Difficulty in formulating question (includes missingvocabulary problems and unavoidable fidelity viola-tions)
Explanation 2 Weakness of answer presentationSystem bug 1 Problem with question asking interface (question
answers correctly if Answer button is pressed with-out first hitting Enter key)
Syllabus 1 Question requires knowledge from outside the syl-labus
Table 5.10: Failure analysis on why chemistry questions in the unseen question setfail to answer
reference KB is known to be incomplete, it is possible that some affected questions
might still not answer due to bugs and other shortcomings in ASKME.
Physics
Of the 100 questions in physics, ASKME answered 36 questions correctly and one
question received partial credit. I examined 64 questions for sources of failure caus-
ing questions to receive either no points or partial credit. Table 5.11 suggests that
the main problem is missing problem solvers; ASKME lacked mechanisms to qual-
itatively reason about equations. One question was found to be malformed, and
having an incoherent description. The other 23 incorrect questions were due to gaps
in the knowledge base and bugs in ASKME.
112
AffectedSource Questions Notes
(64)KB gap 16 The required knowledge does not appear (or is not
correct) in the reference KB (mostly questions in-volving strings under tension).
Lacked prob-lem solver
40 A general QA failure or the question is beyond thereasoning capabilities of QA, e.g., qualitative rea-soning and simulation
System bug 7 A bug in QA (in the CPL post-interpreter) preventsQA from recognizing queries involving coefficientsof friction. QA is unable to solve equations withmultiple unknowns chained from different concepts.QA problem with collating equation sets preventeduse of appropriate KB content.
Bad question 1 Incoherent question that does not make sense.
Table 5.11: Failure analysis on why physics questions in the unseen question set failto answer
5.3.3 Discussion
The Halo pilot study is the first phase of a projected multi-phase effort by Vulcan
Inc., whose ultimate goal is the creation of a “Digital Aristotle”, an expert tutor
in a wide variety of subjects[9, 44, 114]. The Halo pilot was a six-month effort
intended to assess the state-of-the-art in question answering with an emphasis on
deep reasoning. The Halo pilot phase demonstrated systems that were collabora-
tively built by different knowledge engineers. The effort was structured around the
challenge of answering a variety of AP Chemistry questions that focused on a por-
tion of the AP syllabus. The answers generated by the systems participating in
the Halo pilot were evaluated by graders with advanced expertise according to the
directives of the Educational Testing Service. Three contending systems were built
by Cycorp Inc., SRI International, and Ontoprise. The competing systems attained
scores ranging between 38% to 52%, which is comparable to the mean human scores
113
in AP Chemistry[9, 44].
There are several differences between my question-answering evaluation on
the unseen question set and the Halo pilot. First, the questions in my evaluation are
posed using natural language and a significant number of the questions are posed
without references to the knowledge base for the reasoner to easily infer an answer.
Whereas in the Halo pilot, questions were formulated as logical expressions that
directly addressed information in the knowledge base for problem solving. Second,
my evaluation covered a non-trivial syllabus comprising 50 pages of a science text
book in three different science domains: biology, chemistry, and physics. In contrast,
the Halo pilot focused only on the chemistry domain. Third, the question set used in
my evaluation (consisting of 128 biology questions, 131 chemistry questions, and 100
physics questions) is significantly larger that the 50 questions used in the Halo pilot.
All things considered, ASKME’s performance on the unseen question set compares
favorably to the results achieved by the systems participating in the Halo pilot.
The failure analysis indicates that a majority of questions that fail to an-
swer in the unseen question set did so because of gaps in the knowledge base or
unsupported question types that required qualitative reasoning and simulation. Ad-
dressing these limitations is beyond the scope of this dissertation. The results show
ASKME to be effective at answering a variety of AP-like questions. First, I found
the available question types supported by ASKME to cover the set of AP-like ques-
tions used in the study. Second, the observed correctness scores show the available
question types to produce effective answers and explanations to answer the variety
of AP-like questions.
114
5.4 Experiment #3: Can questions continue to answer
with different knowledge bases?
Authors of knowledge bases make many modeling decisions during knowledge engi-
neering. Users of knowledge bases make assumptions during question formulation.
The result is a mismatch between questions and the representations required to
answer them. One of my claims is that ASKME can bridge the gap between the
formal representations of knowledge bases captured from domain users and the rea-
soners attempting to use those representations to answer questions. Specifically, I
claim that ASKME finds information to answer questions on different knowledge
bases whose content and organization are not known in advance (and may vary
significantly).
I assess this claim by measuring ASKME’s performance at answering ques-
tions using knowledge bases authored by different users. The set of knowledge bases
used in this exercise are independently built by subject matter experts (SME-built)
to cover the same syllabus as the reference knowledge base in each of the science
domains. These subject matter experts are graduates or post-graduate students in
biology, chemistry, and physics. They undergo one week of training and are given
five weeks to build and test their knowledge bases. The subject matter experts
worked independently with little interaction with other users and with only occa-
sional help when faced with technical difficulties using the knowledge formulation
tool. The size of the different SME-built knowledge bases is listed in Table 5.12.
The experiment uses the set of question formulations authored by knowledge
engineers in experiment #1. The set of question formulations used in the experi-
ment consists of 308 question formulations in biology, 238 question formulations in
chemistry, and 150 question formulations in physics. Each question formulation is
tagged with appropriate answer snippets which enables my test harness to determine
whether an answer is correct by comparing the generated answer with the answer
115
Domain Knowledge-base Created by # Concepts # Tuples
BiologyReference KE 139 4789SME #1 SME 126 4505SME #2 SME 226 8240SME #3 SME 167 6016
ChemistryReference KE 475 22956SME #1 SME 251 11830SME #2 SME 282 7589SME #3 SME 420 12818
PhysicsReference KE 65 5796SME #1 SME 20 3734SME #2 SME 8 1058SME #3 SME 15 1461
Table 5.12: The knowledge bases used in the study on ASKME’s ability to answerquestions using a variety of knowledge bases that differ in content and organiza-tion. Aside from the reference knowledge bases, which was created by knowledgeusers working closely with subject matter experts, the other knowledge bases wereindependently authored by subject matter experts.
snippets. In the experiment, the test harness will automatically retry a question,
causing ASKME to return different answers, until the question is correctly answered
or a time-bound is reached.
For each knowledge base, in addition to measuring the number of questions
correctly answered when posed using ASKME, I also use ASKME-W/O-QM to iden-
tify the number of questions that directly referenced information in the knowledge
base for the problem solver to infer the correct answer. The differences in correct-
ness scores between ASKME and ASKME-W/O-QM indicates ASKME’s ability to
bridge the gap between the formal representations of knowledge bases captured from
domain users and the reasoners attempting to use those representations to answer
questions.
116
Correctness scores (%)Domain Knowledge Improvement
base ASKME-W/O-QM ASKME
BiologySME #1 21.42 35.71 68%SME #2 21.1 36.69 74%SME #3 23.7 44.48 88%
ChemistrySME #1 34.02 52.52 54%SME #2 37.82 59.66 58%SME #3 13.03 29.41 126%
PhysicsSME #1 10.67 62.67 487%SME #2 10.0 25.33 153%SME #3 4.67 29.33 528%
Table 5.13: Correctness scores when reference question formulations are attemptedon different KBs independently built by subject matter experts.
5.4.1 Results
Table 5.13 lists the correctness scores observed in the experiment. I found the
correctness scores achieved with different SME-built knowledge bases varied sig-
nificantly and were lower than the correctness scores achieved with the reference
knowledge bases that were reported in Table 5.4. This was expected because the
SME-built knowledge bases were found to vary significantly in size and were found
to have less coverage than the reference knowledge bases authored by knowledge
engineers.
I found ASKME to achieve significantly better scores than ASKME-W/O-
QM in all domains and knowledge bases. This suggests that ASKME is capable of
locating relevant information in a variety of knowledge bases that differ in content
and organization, to extend a question for problem solving. The techniques to
include related and finer-grained queries also improved answers to biology questions.
Thus, the results from the experiment support the claim that ASKME is useful in
finding information to answer questions on different knowledge bases.
117
5.5 Experiment #4: Can users pose questions using
ASKME to successfully query knowledge bases with
which they are unfamiliar?
I next test whether novice users can use ASKME to successfully interact with un-
familiar knowledge bases. This evaluation was conducted by an independent evalu-
ation team[72] that did not participate in the design and development of ASKME.
In this evaluation I am interested in answering the following questions:
1. How do novice users perform in the evaluation?
2. How do the performance of novice users compare with knowledge engineers?
3. Are there significant differences in their levels of performance across domains?
4. Is there any evidence of difficulties faced by novice users in using ASKME to
interact with knowledge bases with which they are unfamiliar?
The evaluation uses the same set of reference knowledge bases that were
created in experiment #1. The question set used in the evaluation is a subset of
the unseen question set that was first introduced in experiment #2. The choice of
questions by the external evaluator “was not random, nor were questions selected
to completely and evenly test the contents of the reference knowledge bases”[72].
Rather, the selection favored questions that were at least partially answerable by
ASKME using the reference knowledge base. The selection was also “calibrated to
avoid having too many questions that were easily answerable (ceiling effect) or too
many questions that were unanswerable (floor effect)”[72]. Altogether, 50 questions
were selected to form the question set used in this evaluation. Additional details on
the question selection process is described in [72].
118
Nine undergraduates were recruited to participate in this evaluation: three
users in biology, three users2 in chemistry , and three users in physics. These under-
graduates had little experience in knowledge representation and had no familiarity
with the knowledge base being queried. They underwent a four-hour training ses-
sion on the basics of using ASKME to pose and receive answers to AP-like questions
using simplified English[72]. After completing the training, the participants were
given 35 hours (over a three week period) to formulate questions in the evaluation
question set[72]. These novice users worked independently and did not discuss their
efforts with other users. They were also not allowed to solicit help except when
faced with technical problems related to the software.
The answers and explanations returned by ASKME were graded by two
independent graders on a scale of 0 to 2 points[72]. Each grader was a high school
AP teacher with at least three years of experience. Graders were asked to assign
full (2 points), partial (1 point), or no credit (0 points). The criteria for full credit
was that ASKME “produced the correct answer, or sufficient information to allow a
naive reader to unambiguously select the correct answer from the options provided
in the question”[72]. In cases where graders differed significantly, “a third grader
examined the question and graded the answer, otherwise the scores assigned by both
graders were averaged”[72].
5.5.1 Results
Question: How do novice users perform in the evaluation?
Table 5.14 lists the correctness scores observed in the evaluation and Fig-
ure 5.1 shows the distribution of answer credit scores for both novice users and
the knowledge engineers. The result shows that novice users can achieve scores2One of the chemistry participants did not finish and eventually dropped out of the evaluation.
119
!"#$%&'(&%)*+'
,''-.%#/0"#'
1'
2'
31'
32'
41'
42'
51'
52'
1'6+' 172'6+' 3'6+' 372'6+' 4'6+'
89'
:*0;0<=>?#%&>3'
:*0;0<=>?#%&>4'
:*0;0<=>?#%&>5'
(a) biology
!"#$%&'(&%)*+'
,'
-'
.,'
.-'
/,'
/-'
,'0+' ,1-'0+' .'0+' .1-'0+' /'0+'
23'
45%6*#+&789#%&8.'
45%6*#+&789#%&8/'
:'';<%#=>"#'
(b) chemistry
!"#$%&'(&%)*+'
,'
-'
.,'
.-'
/,'
/-'
0,'
0-'
,'1+' ,2-'1+' .'1+' .2-'1+' /'1+'
34'
567#*(#89#%&8.'
567#*(#89#%&8/'
567#*(#89#%&80'
:'';<%#=>"#'
(c) physics
Figure 5.1: Answer credit distributions among novice users and the knowledge en-gineer in experiment #4
120
Biology domain Chemistry domain Physics domainKE 76.0% KE 42.0% KE 69.5%
Biology-User-1 74.5% Chemistry-User-1 50.5% Physics-User-1 33.0%Biology-User-2 79.5% Chemistry-User-2 48.5% Physics-User-2 34.5%Biology-User-3 71.5% Chemistry-User-3 N.A. Physics-User-3 52.5%
Table 5.14: Answer credit scores observed in experiment #4’s question-answeringevaluation
that were comparable to those achieved by the knowledge engineers in the biology
and chemistry domains. This result suggests that novice users with four hours of
training in using ASKME, can effectively use the system to interact with unfamiliar
knowledge bases.
I was pleasantly surprised to find instances in biology and chemistry where
novice users achieved results that were better than those achieved by the knowledge
engineers who built the knowledge base. We believe that “instances where biology
and chemistry novice users achieve better scores than knowledge engineers was be-
cause they had more time and were instructed to try as many question formulations
as they needed to, in order to elicit from the system answers and explanations that
addressed the original English question”[72]. Novice users were also told that if
ASKME could not answer a question correctly, they should ask simpler questions
and try to probe the extent of ASKME’s knowledge about the topic. In contrast,
the knowledge engineers’ attempts at question formulation focused on achieving a
high fidelity with the original questions (i.e., generic and faithful), and “did not nec-
essarily spend much effort “backing off” to ask increasingly simple questions about
the topic”[72].
The phenomenon of novice users achieving better results than knowledge
engineers was not detected in physics because question answering is more dependent
on rules/reasoning rather than facts. There were few questions in the evaluation
question set that can be answered with partial answers. So novice users were not able
121
to elicit significant amounts of useful information using simple queries. Questions
in physics also required deep domain knowledge to fill in necessary assumptions
expected by the domain models in the knowledge base.
The results show ASKME to be usable by novice users to interact with
unfamiliar knowledge bases for the task of answering AP-like exam questions.
Question: Are there significant differences in performance among novice users and
the knowledge engineer?
I answer this question by testing if there are significant differences in the
answer credit scores among novice users and the knowledge engineers.
Biology: An ANOVA3 found no significant differences in the distribution of answer
credit scores among the three novice users and the knowledge engineer (F (3, 196) =
0.53, p > 0.10). By analysing the CI4 for the difference between the mean of the
novice users and the mean of the knowledge engineer, I found with 95% certainty
that the knowledge engineer has no more than 12.3% advantage at employing the
knowledge base for answering biology questions. This result shows that deep famil-
iarity with the knowledge base is not a critical advantage in achieving good results
for the biology domain, and that novice users are capable of achieving comparable
results to knowledge engineers.
Chemistry: An ANOVA found no significant differences in the distribution of an-
swer credit scores among the two novice users and the knowledge engineer (F (2, 147) =
0.29, p > 0.10). By analysing the CI for the difference between the mean of the novice
users and the mean of the knowledge engineer, I found with 95% certainty that the
novice users has no more than 19.7% advantage at employing the knowledge base3Analysis of variance4Confidence Interval
122
for answering biology questions. This result shows that deep familiarity with the
knowledge base is not a critical advantage in achieving good results for the chemistry
domain, and that novice users are capable of achieving results that surpass those
achieved by the knowledge engineers.
Physics: An ANOVA found significant differences in the distribution of answer
credit scores among the three novice users and the knowledge engineer (F (3, 196) =
8.02, p < 0.05). A priori contrast tests (Bonferroni corrected T-Tests) show the
distributions to be significantly different between the following pairs of users:
KE > Physics-User-1, (t(98) = 4.31, p < 0.05)
KE > Physics-User-2, (t(98) = 4.26, p < 0.05)
There is no significant difference between the knowledge engineer and Physics-User-
3, or among Physics-User-1, Physics-User-2, and Physics-User-3. The significant
differences in the correctness scores between the knowledge engineer and Physics-
User-1 and Physics-User-2 suggests that deep familiarity with the domain or the
physics knowledge base is necessary to achieve good results. This is because many
questions in the physics domain required the user to specify key facts about the
problem (e.g., making explicit the acceleration of a move when there is negligible
air resistance) that might not be obvious to someone not trained in physics or
someone who is unfamiliar with the knowledge base being queried.
Question: Do novice users use ASKME equally well across domains?
An ANOVA found significant differences in the distribution of answer credit
scores across the three domains (F (2, 147) = 14.94, p < 0.05). Contrast tests (Bon-
ferroni corrected T-tests) show that the distributions are significantly different be-
123
tween the biology and chemistry domains (t(98) = 3.89, p < 0.05), and between
the biology and physics domains (t(98) = 6.10, p < 0.05). There is no significant
difference between the chemistry and physics domains (t(98) = 1.28, p > 0.1).
I believe the differences between biology and chemistry and between biology
and physics is due to the nature of biology questions, which rely heavily on query-
ing facts and less on performing complex reasoning. These questions involve simple
inference and often return sufficient information to satisfy partial or full credit. Ad-
ditionally, I have also observed users posing follow-up questions to elicit additional
details to further improve their answers. This is in contrast to physics questions,
where a majority of questions involved equation solving. Such questions are brittle
in the sense that ASKME requires all facts expected by the applicable equation to
be explicitly stated to compute an answer; otherwise, no solution is returned. This
explanation accounts for why there are few physics questions that received partial
credit and why the differences in correctness scores between novice users and the
knowledge engineer is significantly large for the physics domain.
Question: Were there noticeable difficulties faced by novice users in using ASKME?
How can the individual novice users achieve better results?
I first examined the distribution of answer credit scores among the different
users. The questions answered by the novice users varied in distribution. There
were many instances where a question was answered by a particular user but not
others. The results suggests that users could potentially achieve higher levels of
performance if they cooperated on the question-answering task (e.g., by sharing
information on best practices). For example, if I aggregated the highest scores for
each question achieved among the three biology users, the resulting score is 93.5%,
which is significantly higher than the scores achieved by individual novice users, the
average achieved by the novice users, or the knowledge engineer (see Table 5.15).
124
Calculation method Biology Chemistry Physicsaggregated (union) 93.5 57.5 64.5
averaged 75.17 49.5 40.0
Table 5.15: Aggregated and average correctness scores achieved by novice users inexperiment #4
User #Formulations #Minutes(Ranked by score) Score Sum Avg Median Sum Avg Median
Bio-User-2 79.5% 870 17.4 15 678 13.56 8Bio-User-1 74.5% 453 9.06 7 253 5.06 4Bio-User-3 71.5% 311 6.22 4 1013 20.3 7.5
Chem-User-1 50.5% 576 11.52 11 950 19.0 13.5Chem-User-2 48.5% 504 10.08 6.5 754 15.08 4.5Phys-User-3 52.5% 884 17.68 13.5 901 18.02 14Phys-User-2 34.5% 1121 22.42 19 1610 32.2 24Phys-User-1 33.0% 787 15.74 15 1044 20.08 19.5
Table 5.16: Data on number of formulations and time spent per question by noviceusers in experiment #4
Although users were able to create a correct formulation of the question in
many cases, they were rarely able to do so in their first attempt. I next examine
the amount of effort spent by users at getting questions to answer using ASKME.
Besides correctness scores, I have found four other metrics useful in quantifying the
difficulties faced by the user: the mean number of question attempts; the median
number of question attempts; the time spent by the user formulating a question
before getting a satisfactory answer; and the time spent by the user formulating a
question before giving up. Table 5.16 shows these statistics for the different novice
users participating in the evaluation.
Among the three domains, it appears that users spent significantly more
effort formulating physics questions that answered when compared to biology and
chemistry users. One explanation is that a majority of physics questions were com-
plex “story problems” which required more effort to rephrase using restricted En-
125
glish. Another explanation is that physics questions often contain implicit informa-
tion and because ASKME cannot automatically identify and install assumptions,
users have to invest additional effort to manually identify and state implicit facts
during question formulation.
I also investigated whether there were significant differences in the number
of attempts and time spent per question among users in each domain.
Biology: An ANOVA found significant differences in the number of formulations
attempted (F (2, 147) = 22.05, p < 0.05). A priori contrast tests (Bonferroni cor-
rected T tests) show the distributions for formulations attempted are significantly
different for the following pairs of users:
Biology-User-2 > Biology-User-1, (t(98) = 4.31, p < 0.05)
Biology-User-2 > Biology-User-3, (t(98) = 5.87, p < 0.05)
An ANOVA found significant differences in time spent per question among the
three novice users in biology (F (2, 147) = 6.61, p < 0.05). A priori contrast tests
(Bonferroni corrected T-tests) show the distributions for time spent per question
are significantly different for the following pairs of users:
Biology-User-2 > Biology-User-1, (t(98) = 4.29, p < 0.05)
Biology-User-3 > Biology-User-1, (t(98) = 3.18, p < 0.05)
At face value, I can claim that additional efforts by biology users in terms of the
number of formulations and time spent per question translate into higher correctness
scores. However, an earlier ANOVA did not find the differences in the correctness
scores among the novice users to be significant. Thus, I am gratified to report that,
126
even though there were significant differences in the amount of effort put forth by
the biology users, the resulting correctness scores were similar and are comparable
to the knowledge engineer. This result suggests that ASKME can be used effectively
by novice users to query unfamiliar biology knowledge bases after only four hours
of training in using the system.
Chemistry: A T-Test found no significant differences in the number of formulations
attempted between the two novice users in chemistry (t(98) = 0.78, p > 0.1). A T-
Test also found no significant differences in time spent per question between the
two novice users (t(98) = 1.01, p > 0.1). An earlier ANOVA in experiment #4
found no significant differences in the distribution of answer credit scores between
novice users and knowledge engineer in the chemistry domain. Since there are no
significant differences between number of formulations attempted, time spent per
question, or achieved answer credit scores, I conclude that ASKME can be used
effectively by novice users to query unfamiliar chemistry knowledge bases after four
hours of training in using the system.
Physics: An ANOVA found significant differences in the number of formulations
attempted among the three novice users in physics (F (2, 147) = 3.66, p < 0.05).
An ANOVA also found significant differences in time spent per question among the
three novice users (F (2, 147) = 8.01, p < 0.05). A priori contrast tests (Bonferroni
corrected T-tests) show that the average number of formulations attempted to be
significantly different between:
Physics-User-2 > Physics-User-1, (t(98) = 2.65, p < 0.05)
A priori contrast tests (Bonferroni corrected T-tests) show that the average time
spent per question is significantly different between the following pairs of users:
127
Physics-User-2 > Physics-User-1, (t(98) = 2.91, p < 0.05)
Physics-User-2 > Physics-User-3, (t(98) = 3.53, p < 0.05)
In biology and chemistry, I found that users who used more effort trying different
question formulations achieved better scores. Oddly enough, this wasn’t the case
in physics. The top physics user achieved significantly better scores with fewer
formulations. In my efforts to understand why this was the case, I found the top
performing physics user to have a deep curiosity at understanding the internals of the
reference knowledge base and ASKME. Through his experimentations, he quickly
became familiar with the reference knowledge base and gained a good understanding
on how ASKME worked. I believe his advanced expertise at using ASKME and the
reference knowledge base for problem solving accounts for his level of performance,
which far exceeds his peers.
5.6 Experiment #5: Establishing ASKME’s performance
under production conditions
My final evaluation tests ASKME in a setup that mimics the conditions of the
the ideal knowledge base system described in Chapter 1. The conditions mimic a
production system because: (a) the knowledge bases were built independently by
subject matter experts without any help from knowledge engineers; and (b) the
questions are formulated by a different group of novice users who possess limited
domain expertise and are unfamiliar with the contents of the knowledge base being
queried. These conditions create a substantial challenge for automated question
answering systems: to be able to answer questions that are formulated without
regard for (or even familiarity with) the knowledge base that is expected to answer
them. This evaluation was conducted by the same independent evaluation team[72]
128
who were involved in experiment #4. In this evaluation I am interested in answering
the following questions:
1. What is the performance of novice users and knowledge engineers at using
ASKME to query unfamiliar knowledge bases?
2. Are there significant differences between novice users and knowledge engineers?
3. Are there any significant differences in their levels of performance across do-
mains?
As I discuss the results of this evaluation, I will also relate the results to
conclusions from the previous evaluation of whether novice ASKME users can suc-
cessfully query reference knowledge bases with which they are unfamiliar.
The set of knowledge bases used in this evaluation are created independently
by subject matter experts without any help from knowledge engineers[72]. They are
intended to cover the same syllabus as the reference knowledge bases in each of the
science domains. These subject matter experts are graduates or post-graduate stu-
dents in biology, chemistry, and physics. They undergo one week of training and are
given five weeks to build and test their knowledge base. The subject matter experts
worked independently and had little interaction with other users. They received
“only occasional help when faced with technical difficulties using the knowledge
formulation tool”[72].
The evaluation uses the same set of questions from experiment #4. Each
domain was assigned two novice users to attempt question-answering. The novice
users underwent a four-hour training session on the basics of using ASKME to
pose questions and receive answers to AP-like questions using simplified English[72].
After completing the training, the participants were given 35 hours (over a three
week period) to formulate questions in the evaluation question set[72]. These novice
users worked independently and did not discuss their efforts with other users. They
129
were only allowed to solicit help when faced with technical problems related to the
software[72].
As a control, knowledge engineers were also tasked to perform question-
answering on the independently built knowledge bases. These knowledge engineers
are unfamiliar with the knowledge base being queried; however, they enjoyed a num-
ber of advantages over the novice user. First, the knowledge engineers are trained in
knowledge representation and reasoning. Second, the knowledge engineers, having
designed and implemented ASKME, are deeply familiar with using it for question-
answering. Third, the knowledge engineers cooperated in the question-answering
task.
5.6.1 Results
Question: What is the performance of novice users and knowledge engineers at
using ASKME to query unfamiliar knowledge bases?
Biology domain Chemistry domain Physics domainKE 34.5% KE 46.0% KE 54.0%
Biology-User-A 54.0% Chemistry-User-A 34.5% Physics-User-A 13.5%Biology-User-B 47.5% Chemistry-User-B 55.0% Physics-User-B 18.0%
Table 5.17: Answer credit scores achieved by knowledge engineers and novice userson knowledge bases that are independently authored by subject matter experts.
Table 5.17 lists the correctness scores observed in the experiment and Figure
5.2 shows the distribution of answer credit scores among novice users and the knowl-
edge engineers. The results continue to show ASKME’s ability to help novice users
to query unfamiliar knowledge bases. The biology result is surprising, as it appears
to show that novice users can use ASKME to achieve better results than knowledge
engineers. One explanation for this surprising result is that the knowledge engi-
130
!"#$%&'(&%)*+'
,'
-'
.,'
.-'
/,'
/-'
,'0+' ,1-'0+' .1,'0+' .1-'0+' /1,'0+'
23'
4*565789:#%&9;'
4*565789:#%&94'
<''=>%#?5"#'
(a) biology
!"#$%&'(&%)*+'
,'
-'
.,'
.-'
/,'
/-'
0,'
0-'
,'1+' ,2-'1+' .2,'1+' .2-'1+' /2,'1+'
34'
56%7*#+&89:#%&9;'
56%7*#+&89:#%&9<'
=''>?%#@A"#'
(b) chemistry
!"#$%&'(&%)*+'
,'
-'
.,'
.-'
/,'
/-'
0,'
0-'
1,'
,'2+' ,3-'2+' .3,'2+' .3-'2+' /3,'2+'
45'
678#*(#9:#%&9;'
678#*(#9:#%&9<'
=''>?%#@A"#'
(c) physics
Figure 5.2: Score distribution among novice users and the knowledge engineer inexperiment #5
131
neers worked quickly to finish the question-answering task, while the novice users
were asked to exhaustively spend more time trying a variety of question attempts
to maximize their scores.
Question: Are there significant differences in performance among novice users and
the knowledge engineer?
I answer this question by testing if there are significant differences in the
answer credit scores among novice users and the knowledge engineers.
Biology: An ANOVA found marginally significant differences in performance among
novice users and the knowledge engineers for the biology domain (F (2, 147) =
2.95, 0.05 < p < 0.10). By analysing the CI for the difference between the mean
of the novice users and the mean of the knowledge engineer, I found with 95% cer-
tainty that the novice users has no more than 29.1% advantage at employing the
knowledge base for answering biology questions.
Chemistry: An ANOVA also found marginally significant differences in perfor-
mance among novice users and the knowledge engineer for the chemistry domain
(F (2, 147) = 2.48, 0.05 < p < 0.10). By analysing the CI for the difference between
the mean of the novice users and the mean of the knowledge engineer, I found with
95% certainty that the knowledge engineer has no more than 16.39% advantage at
employing the knowledge base for answering chemistry questions.
Physics: An ANOVA found significant differences between the knowledge engineer
and the novice users for the physics domain (F (2, 147) = 18.15, p < 0.05). This
result is consistent with my earlier finding that familiarity with the knowledge base
is necessary to achieve good results in physics. A priori contrast tests (Bonferroni
corrected T-tests) show the distributions to be significantly different for the following
132
pairs of users:
KE > Physics-User-A, (t(98) = 5.34, p < 0.05)
KE > Physics-User-B, (t(98) = 4.50, p < 0.05)
Question: Are there any significant differences in their levels of performance across
domains?
In the experiment, an ANOVA found significant differences in the distribution
of answer credit scores across the three domains (F (2, 147) = 16.08, p < 0.05). Con-
trast tests (Bonferroni corrected T-tests) show that the distributions are significantly
different between biology and physics domains (t(98) = 4.53, p < 0.05), and between
the chemistry and physics domains (t(98) = 5.96, p < 0.05). There is no significant
difference between the biology and chemistry domains (t(98) = 0.81, p > 0.1). This
result is consistent with my earlier result in experiment #4 and supports my obser-
vation that biology is an easier domain, with many opportunities for partial credits,
and that novice users can successfully use ASKME to query biology knowledge bases.
Question: Were there noticeable difficulties faced by novice users in using ASKME?
How can the individual novice users achieve better results?
Calculation method Biology Chemistry Physicsaggregated (union) 67.5 60.5 28.5
averaged 50.75 44.75 15.75
Table 5.18: Aggregated and average correctness scores achieved by novice users inexperiment #5
Results from the evaluation provide further anecdotal evidence that novice
users could have achieved significantly better results if they had cooperated, or
133
were given more time for the question-answering task. The questions answered
by the novice users varied in distribution and there were many instances where a
question was answered by a particular user but not others (see Table 5.18). I found
the aggregated scores, calculated by picking the highest achieved score for each
question, to be significantly higher than the scores achieved by individual users
and their average. Thus, I believe if the users were to share information on best
practices, they would be likely to achieve even better results.
5.7 Summary
I assessed the performance of ASKME on the task of answering questions like those
found in the AP exam. The question-answering task involved users posing questions
to the system to retrieve solutions from the knowledge base to answer a set of AP-like
questions. The evaluation consists of successive experiments to test if ASKME can
help novice users to use unfamiliar knowledge bases for problem solving. The first
experiment measured ASKME’s level of performance under ideal conditions, where
the knowledge base is built and used by the same knowledge engineers. Subsequent
experiments measured ASKME’s level of performance under increasingly realistic
conditions. In the final experiment, I measured ASKME’s level of performance
under conditions where the knowledge base is independently built by subject matter
experts and it’s users are are a different group of novice users unfamiliar with the
knowledge base. Results from the evaluation show that ASKME works well on
different knowledge bases and answers a broad range of questions that were posed
by novice users in a variety of domains.
134
Chapter 6
Contributions and Future Work
This dissertation has presented an approach to address the difficulty faced by novice
users in using unfamiliar knowledge bases for problem solving. This approach was
implemented in a system, called ASKME, for novice users to answer AP-like exam
questions using unfamiliar knowledge bases originally built by subject matter ex-
perts. This chapter summarizes the ASKME approach’s research contributions as
well as various avenues for future research suggested by the experience of developing
ASKME.
6.1 Contributions
Authors of knowledge bases make many modeling decisions during knowledge en-
gineering. Knowledge base users make assumptions during question formulation.
The result is a mismatch between questions and the representations required to an-
swer them. Systems that reason over formal representations of domain knowledge
often impose strict requirements on the form and content of the representations.
Deviations from the expected form cause many automated reasoners to fail. This
brittleness makes knowledge base systems difficult to build and imposes a steep
135
learning curve on the user.
6.1.1 Prior art
Early knowledge base systems responded to questions with answers explicitly stored
in databases. These systems work either by pattern matching keywords in a user’s
question with terms in the database or by translating the question into a particular
set of commands to navigate the information in the database. The utility of early
knowledge base systems was limited to answering questions whose answers were
explicitly stored in the database. Thus, questions that required reasoning with
domain expertise could not be answered by these database accessor systems.
The next generation of systems - aptly called expert systems - added in-
ference rules to reason with domain expertise. Expert systems mimic the problem
solving behavior of experts and are typically used to solve problems that do not have
a single correct solution that can be encoded in a conventional algorithm or stored
in a database. The knowledge base in an expert system is narrowly focused and en-
gineered to perform well on specific questions predetermined at design time. Users
interact with expert systems using an interface that restricts the questions that can
be posed to the system. When users pose questions via the user interface, the re-
quired logical forms are then generated and used to answer the queries. Arguably,
the performance of expert systems would have suffered if a different set of users, un-
familiar with knowledge representation or the knowledge base being queried, posed
the questions. The usability of early expert systems has not been evaluated, but
generally these systems were used only by the engineers who built them.
6.1.2 A General Framework for querying unfamiliar knowledge bases
My work advances the state of the art in knowledge based question answering by
addressing the brittleness due to the arms-length separation between the builders
136
and the users of the knowledge base. This is especially challenging when the system
is meant to answer questions posed by users who are unfamiliar with knowledge
representation or the content and organization (i.e., the ontology) of the knowledge
bases.
I studied ASKME for the task of answering questions posed by users who
are unfamiliar with the knowledge base. To make the task easier for computers,
ASKME requires users to formulate their questions with a version of restricted
English and a limited number of question types. The ASKME approach works
for the following reasons. First, it has been shown that users can quickly learn
and effectively use simplified English. Second, studies have also found that users
can easily use a small vocabulary to represent a variety of meaning. Third, it has
been shown that a small set of well-known question types is sufficient to answer a
wide variety of questions. Fourth, I have designed a computer program (called the
question mediator) to answer user’s questions by identifying relevant information in
the knowledge base for problem solving.
ASKME has two advantages when compared to existing systems. First, an
ASKME user can query a variety of unfamiliar knowledge bases by posing questions
using simplified English. Second, because ASKME does not require users to formu-
late their questions in terms of the underlying knowledge base, their questions can
continue to answer on a variety of knowledge bases having different ontologies.
6.1.3 An Application of ASKME for a system answering AP-like
questions
ASKME was studied in the context of Project Halo. A goal of Project Halo is to
develop a knowledge based question answering system capable of answering ques-
tions posed by untrained non-experts, using knowledge bases built by subject matter
experts in a variety of domains [114]. This creates a substantial challenge for auto-
137
mated question answering systems: answering questions that are formulated without
regard for (or even familiarity with) the knowledge base that is expected to answer
them. A question answering system has successfully addressed this challenge if it
can be coupled with any of a variety of knowledge bases, each with its own inde-
pendently built ontology, and it can answer questions without requiring users to
reformulate the questions for each knowledge base that is used.
I studied ASKME as part of the larger AURA system developed by a team of
researchers to achieve the goals of Project Halo. The AURA system enables SMEs
to create knowledge bases using concept maps, equations and tables, all of which are
converted automatically to computational logic [25]. In addition, the AURA system
enables a different set of users, who have limited domain expertise or familiarity with
the knowledge base, to pose questions, like those found in Advanced Placement(AP)
exams, and to receive coherent answers and explanations.
I assessed ASKME’s performance on the task of answering questions like
those found in the AP exam. The question-answering task involved users posing
questions to the system to retrieve solutions from the knowledge base to answer a
set of AP-like questions. The set of knowledge bases used in the evaluation covered
portions of a college-level science textbook in biology, chemistry, and physics. The
question set used in the evaluation cover a portion of an AP exam and matched the
syllabus of the knowledge base. The Project Halo team chose the AP test as an
evaluation criterion because it is a widely-accepted standard for testing whether a
person has understood the content of a given subject. The team also chose the do-
mains of college level biology, chemistry, and physics because they are fundamental,
hard sciences and they stress different kinds of representations.
The evaluation consists of a series of experiments to test if ASKME can help
novice users to use unfamiliar knowledge bases for problem solving. The initial
experiment measures ASKME’s level of performance under ideal conditions where
138
the knowledge base is built and used by the same knowledge engineers. Successive
experiments measure ASKME’s level of performance under increasingly realistic
conditions. Ultimately in the final experiment, I measure ASKME’s level of per-
formance under conditions in which the knowledge base is independently built by
subject matter experts and its users are a different group of novice users who are un-
familiar with the knowledge base. Results from various experiments show ASKME
to work well on different knowledge bases and to answer a broad range of questions
that were posed by novice users in a variety of domains.
6.2 Future Work
The implementation of ASKME demonstrates that a system consisting of a restricted
English, a domain-neutral ontology, and a set of mmechanisms to handle a small set
of well-known question types to be effective at helping users achieve a high level of
performance at querying unfamiliar knowledge bases. I next discuss some issues that
must be addressed for ASKME to be extended into a production system. Finally, I
describe an application of ASKME beyond the particular application of answering
AP-like questions.
6.2.1 Identifying unstated assumptions
Besides missing knowledge or bad question interpretations, another reason that
ASKME fails to answer questions is when it fails to identify relevant information in
the knowledge base for problem solving. This happens when assumptions, expected
by the knowledge base being queried, are not stated by the user or captured by
ASKME processing. These assumptions often relate information in a question to a
problem-solving model or are default values. Examples include relating the height
of a building to the initial-y-position of a Fall from the building or installing 0
meters as the value for the initial-position of a Move starting from rest [85].
139
It is a difficult task for novice users to recognize missing assumptions or
to steer ASKME into using specific information in the knowledge base. Doing so
requires detailed knowledge of the internal workings of the system and the contents
of the knowledge base being queried. I plan to improve the question-answering
performance of ASKME by investigating mechanisms for ASKME to perform an
automatic failure analysis when it fails to answer a question [79, 97]. The analysis
will query different problem-solving mechanisms (e.g., the deductive reasoner, the
analogical reasoner, or the equation solver) to identify promising sub-goals to help
users install necessary assumptions in their question formulations and to guide the
problem-solving process.
6.2.2 Sanity Checking
Ideally, ASKME will have the ability to check the sanity of an answer. This can be
in the form of simple heuristics to quickly check if an answer makes sense. More
elaborate approaches involve querying large commonsense knowledge bases [69] or
integrating a domain-specific qualitative reasoner to gather “ballpark” answers [89].
Another approach is to query the Internet for evidence to determine if the derived
answer is reasonable [7].
6.2.3 Debug Tool
Helping users determine why ASKME returned an incorrect answer will enable them
to improve the knowledge base and fix problematic question formulations. I have
developed a prototype debugging facility to help knowledge engineers debug the
causes for incorrect answers. The prototype debugger helps the user to:
1. Observe the pieces of knowledge selected by ASKME
2. Examine why a piece of knowledge is selected or rejected by ASKME
140
3. Examine how a piece of knowledge is used by ASKME
4. Explore the question representation examined by ASKME
5. Repair the question representation examined by ASKME
Helping novice users debug the cause of incorrect answers will require the
debugger to clearly show the domain and control knowledge is used by ASKME.
Towards this end, I plan to enhance the debugger to support the common human
problem solving strategy of decomposing a problem into simpler parts and then
solving these parts individually [83, 91]. I plan to enhance the debugger to allow
novice users to specify subgoals and inspect the results in order to check the inter-
mediate problem solving steps. Ideally, the proposed debug tool will help ASKME
users to gain a deeper understanding of why questions fail to answer correctly and
will reduce the time required to debug an incorrectly answered question.
6.2.4 Episodic Memory and Transfer Learning.
Humans make use of past experiences to improve both performance and compe-
tence. The problem-solving experience and assumptions identified in solving other
questions provides guidance when new questions are attempted. I believe leverag-
ing past problem-solving experience can improve scalability and allow ASKME to
answer additional questions that it could not answer previously. A generic episodic
memory has been integrated into ASKME to improve problem solving performance
by learning and transferring control knowledge in solving AP-like physics questions
[108]. A separate AP physics question answering system by [63] demonstrated learn-
ing and transfer of domain knowledge to answer unseen questions by relating them
to previously answered questions. The results from both efforts are encouraging
and merit further investigation on approaches to identify the types of control and
domain knowledge that can be learned and transferred when ASKME attempts a
141
variety of problems in different domains.
6.2.5 Machine Reading Application
Machine reading automatically constructs a knowledge base by processing English
text [36]. Due to the size and complexity of automatically generated knowledge
bases, it may be difficult or tedious for users to understand the contents and orga-
nization of the learned knowledge well enough to query them. ASKME can be used
to pose and receive answers to questions with automatically generated knowledge
bases. I have integrated ASKME into the machine reading system described in [10].
The results have been encouraging. In the future, I plan to further improve and
evaluate the performance of ASKME in a variety of machine reading applications.
142
Bibliography
[1] ASD Simplified Technical English. Technical report, Aerospace and Defence
Industries Association of Europe, 2005. Specification ASD-STE100.
[2] Liane Acker and Bruce W. Porter. Extracting viewpoints from knowledge
bases. In National Conference on Artificial Intelligence, pages 547–552, 1994.
[3] Jan S. Aikins. Prototypical knowledge for expert systems. Artificial Intelli-
gence, (20):163–210, 1983.
[4] I. Androutsopoulos, G.D. Ritchie, and P. Thanisch. Natural language inter-
faces to databases – an introduction. Journal of Natural Language Engineer-
ing, 1(1):29–81, 1995.
[5] S. Auer and J. Lehmann. What have Innsbruck and Leipzig in common?
Extracting semantics from Wiki content. Lecture Notes in Computer Science,
4519:503, 2007.
[6] S Auer, C Bizer, J Lehmann, G Kobilarov, R Cyganiak, and Z Ives. DBpedia:
A Nucleus for a Web of Open Data. In In Sixth International Semantic Web
Conference, Busan, Korea, pages 11–15. Springer, 2007.
[7] Michele Banko, Michael J. Cafarella, Stephen Soderland, Matt Broadhead,
and Oren Etzioni. Open information extraction from the web. In Proceedings
of the 20th International Joint Conference on Artificial Intelligence, 2007.
143
[8] Ken Barker, Bruce Porter, and Peter Clark. A library of generic concepts for
composing knowledge bases. In Proceedings of the First International Confer-
ence on Knowledge Capture, 2001.
[9] Ken Barker, Shaw Yi Chaw, James Fan, Bruce Porter, Dan Tecuci, Peter
Yeh, Vinay K. Chaudhri, David Israel, Sunil Mishra, Pedro Romero, and Pe-
ter E. Clark. A question-answering system for AP chemistry: Assessing KR&R
technologies. In Principles of Knowledge Representation and Reasoning: Pro-
ceedings of the Ninth International Conference, 2004.
[10] Ken Barker, Bhalchandra Agashe, Shaw-Yi Chaw, James Fan, Noah Fried-
land, Michael Glass, Jerry Hobbs, Eduard Hovy, David Israel, Doo-Soon Kim,
Rutu Mulkar, Sourabh Patwardhan, Bruce Porter, Dan Tecuci, and Peter Z.
Yeh. Learning by reading: A prototype system, performance baseline and
lessons learned. In Proceedings of the Twenty-Second National Conference on
Artificial Intelligence, 2007.
[11] John A. Bateman, Renate Henschel, and Fabio Rinaldi. Generalized Upper
Model 2.0: documentation. Technical report, GMD/Institut fur Integrierte
Publikations- und Informationssysteme, Darmstadt, Germany, 1995.
[12] I.L. Beck. Questioning the Author: An Approach for Enhancing Student En-
gagement with Text. Order Department, International Reading Association,
800 Barksdale Road, PO Box 8139, Newark, DE 19714-8139, 1997.
[13] Abraham Bernstein and Esther Kaufmann. GINO - A Guided Input Natural
Language Ontology Editor. In Proceedings of the Fifth International Semantic
Web Conference, pages 144–157. Springer, November 2006.
[14] B.S. Bloom et al. Taxonomy of educational objectives, handbook I: Cognitive
domain, 1956.
144
[15] Ron J. Brachman and James G. Schmolze. An overview of the KL-ONE
knowledge representation system. Cognitive Science, 9:171–216, 1985.
[16] T.L. Brown, T.E. Brown, H.E. LeMay, B.E. Bursten, C. Murphy, and
P. Woodward. Chemistry: The Central Science. Pearson Education Inter-
national, 2008.
[17] B. G. Buchanan and E.H. Shortliffe. Rule-Based Expert Systems: The MYCIN
Experiments of the Stanford Heuristic Programming Project. Addison-Wesley,
Reading, MA, 1984.
[18] Alan Bundy, George Luger, Martha Palmer, and Robert Welham. Mecho:
Year one. In Proceedings of the Second AISB Conference, 1976.
[19] Alan Bundy, L. Byrd, George Luger, Chris Mellish, R. Milne, and Martha
Palmer. Mecho: a program to solve mechanics problems. Technical report,
Department of Artificial Intelligence, Edinburgh University, 1979. Working
paper 50.
[20] J. Burger, C. Cardie, Vinay Chaudhri, R. Gaizauskas, S. Harabagiu, David
Israel, C. Jacquemin, C.Y. Lin, S. Maiorano, G. Miller, et al. Issues, tasks
and program structures to roadmap research in question & answering (Q&A).
Document Understanding Conferences Roadmapping Documents, 2001.
[21] N.A. Campbell and J.B. Reece. Biology. Pearson Education International,
6th edition, 2001.
[22] B. Chandrasekaran. Towards a taxonomy of problem solving types. AI mag-
azine, 4(1):9, 1983.
[23] Vinay Chaudhri. Failure analysis report for the refinement phase. Technical
report, SRI International, 2009.
145
[24] Vinay Chaudhri and Richard Fikes. Question answering systems: Papers from
the 1999 fall symposium. Technical report, AAAI, 1999. FS-98-04.
[25] Vinay Chaudhri, Bonnie John, Sunil Mishra, John Pacheco, Bruce Porter,
and Aaron Spaulding. Enabling Experts to Build Knowledge-bases from Sci-
ence Textbooks. In Proceedings of the Fourth International Conference on
Knowledge Capture, 2007.
[26] P. Clark and P. Harrison. Boeings NLP System and the Challenges of Semantic
Representation. In Semantics in Text Processing. STEP 2008 Conference
Proceedings, Venice, Italy. Citeseer, 2008.
[27] Pete Clark and John Thompson. Why is it hard to understand original english
questions? Technical report, Boeing Phantom Works, 2009. Working note 32.
[28] Pete Clark, Ken Barker, Bruce Porter, Vinay Chaudhri, Sunil Mishra, and
Jerome Thomere. Enabling domain experts to convey questions to a machine:
a modified, template-based approach. In Proceedings of the Second Interna-
tional Conference on Knowledge Capture, 2003.
[29] Peter Clark and Bruce Porter. KM - The Knowledge Machine: Ref-
erence manual. Technical report, University of Texas at Austin, 1998.
http://www.cs.utexas.edu/users/mfkb/km.html.
[30] Peter Clark, Phil Harrison, Tom Jenkins, John Thompson, and Rick Wojcik.
Acquiring and using world knowledge using a restricted subset of English.
In Proceedings of the 18th International FLAIRS Conference (FLAIRS’05),
2005.
[31] Peter Clark, Shaw-Yi Chaw, Ken Barker, Vinay Chaudhri, Phil Harrison,
James Fan, Bonnie John, Bruce Porter, Aaron Spaulding, John Thompson,
146
and Peter Z. Yeh. Capturing and Answering Questions Posed to a Knowledge-
Based System. In Proceedings of the Fourth International Conference on
Knowledge Capture, 2007.
[32] Paul R. Cohen, Robert Schrag, Eric K. Jones, Adam Pease, Albert Lin,
Barbara Starr, David Gunning, and Murray Burke. The DARPA high-
performance knowledge bases project. AI Magazine, 19(4):25–49, 1998.
[33] Halo 2 Evaluation Committee. The large-scale evaluations of Halo 2: Guiding
questions and recommended procedures, 2005.
[34] M. Conner. What a reference librarian should know. The Library Journal, 52
(8):415–418, 1927.
[35] EN Efthimiadis. Query expansion. Annual review of information science and
technology, 31:121–187, 1996.
[36] Oren Etzioni, Michele Banko, and Michael J. Cafarella. Machine reading. In
Proceedings of the Twenty-First National Conference on Artificial Intelligence.
AAAI Press, 2006.
[37] Brian Falkenhainer and Kenneth D. Forbus. Compositional modeling: finding
the right model for the job. Artificial Intelligence, 51(1-3):95–143, 1991.
[38] James Fan and Bruce W. Porter. Interpreting loosely encoded questions. In
Proceedings of the Nineteenth National Conference on Artificial Intelligence.
AAAI Press, 2004.
[39] James Fan, Ken Barker, and Bruce W. Porter. The knowledge required to
interpret noun compounds. In Proceedings of the Eighteenth International
Joint Conference on Artificial Intelligence. Morgan Kaufmann, 2003.
147
[40] James Fan, Ken Barker, and Bruce Porter. Indirect anaphora resolution as
semantic path search. In Proceedings of the Third International Conference on
Knowledge Capture, pages 153–160, New York, NY, USA, 2005. ACM Press.
[41] C. Fellbaum. WordNet: An Electronical Lexical Database. The MIT Press,
Cambridge, MA, 1998.
[42] D. Fensel, E. Motta, F. Van Harmelen, V.R. Benjamins, M. Crubezy,
S. Decker, M. Gaspari, R. Groenboom, W. Grosso, M. Musen, et al. The
unified problem-solving method development language UPML. Knowledge
and Information Systems, 5(1):83–131, 2003.
[43] Richard Fikes and Adam Farquhar. Distributed repositories of highly expres-
sive reusable ontologies. IEEE Intelligent Systems, 14(2):73–79, 1999.
[44] Noah Friedland, Paul Allen, Gavin Matthews, Michael Witbrock, Jon Cur-
tis, Blake Shepard, Pierluigi Miraglia, Angele Jurgen, Steffen Staab, Eddie
Moench, Henrik Oppermann, Dirk Wenke, David Israel, Vinay Chaudhri,
Bruce Porter, Ken Barker, James Fan, Shaw-Yi Chaw, Peter Yeh, Dan Tecuci,
and Peter Clark. Project Halo: Towards a Digital Aristotle. AI Magazine,
2004.
[45] Norbert E. Fuchs, Uta Schwertel, and Rolf Schwitter. Attempto controlled
english (ace) language manual. version 3.0. Technical report, University of
Zurich, 1999.
[46] David Genest and Michel Chein. An experiment in document retrieval using
conceptual graphs. In ICCS ’97: Proceedings of the Fifth International Con-
ference on Conceptual Structures, pages 489–504, London, UK, 1997. Springer-
Verlag.
148
[47] Douglas C. Giancoli. Physics: Principles with Applications. Prentice Hall, 5th
edition, 1998.
[48] Arthur C. Graesser and Person N.K. Question asking during tutoring. Amer-
ican Educational Research Journal, 31(1):104, 1994.
[49] Arthur C. Graesser, Person N., and Huber J. Mechanisms that generate ques-
tions. Questions and information systems, pages 167–187, 1992.
[50] Arthur C. Graesser, Yasuhiro Ozuru, and Jeremiah Sullins. What is a good
question? In M. McKeown, editor, Festscrift for Isabel Beck. Erlbaum, Mah-
wah, NJ, 2009.
[51] B. F. Green, A. K. Wolf, C. Chomsky, and K. Laughery. Baseball: An auto-
matic question answerer. In B. J. Grosz, K. Sparck Jones, and B. L. Webber,
editors, Natural Language Processing, pages 545–549. Kaufmann, Los Altos,
CA, 1986.
[52] Thomas R. Gruber. A translation approach to portable ontology specifications.
Knowledge Acquisition, 5(2):199–220, 1993.
[53] Nicola Guarino, Claudio Masolo, and Guido Vetere. Ontoseek: Content-based
access to the web. IEEE Intelligent Systems, 14(3):70–80, 1999.
[54] SM Harabagiu, SJ Maiorano, and MA Pasca. Open-domain question answer-
ing techniques. Natural Language Engineering, 1:1–38, 2002.
[55] P. Harrison and M. Maxwell. A new implementation of GPSG. In Proc. 6th
Canadian Conf on AI, pages 78–83, 1986.
[56] E. Hovy, U. Hermjakob, and C.Y. Lin. The use of external knowledge in
factoid QA. NIST SPECIAL PUBLICATION SP, pages 644–652, 2002.
149
[57] T. W. C. Huibers, Iadh Ounis, and Jean-Pierre Chevallet. Conceptual graph
aboutness. In ICCS ’96: Proceedings of the 4th International Conference on
Conceptual Structures, pages 130–144, London, UK, 1996. Springer-Verlag.
[58] Kolodner J. Cased-Based Reasoning. Morgan Kaufmann, San Mateo, CA,
1993.
[59] Doo-Soon Kim and Bruce Porter. KLEO: A Bootstrapping Learning-by-
Reading System. In AAAI’09 Spring Symposium on Learning by Reading
and Learning to Read, 2009.
[60] Karen Kipper. VerbNet: A broad-coverage, comprehensive verb lexicon. PhD
thesis, University of Pennsylvania, 2005.
[61] R.I. Kittredge. Sublanguages and controlled languages. The Oxford Handbook
of Computational Linguistics, pages 430–447, 2003.
[62] Matthew Klenk. Using Analogy to Overcome Brittleness in AI Systems. PhD
thesis, Northwestern University, Evanstown, IL, 2009.
[63] Matthew Klenk and Ken Forbus. Measuring the level of transfer learning by
an AP Physics problem-solver. In Proceedings of the Twenty-Second National
Conference on Artificial Intelligence. AAAI Press, 2007.
[64] Kevin Knight and Steve K. Luk. Building a large-scale knowledge base for
machine translation. In AAAI ’94: Proceedings of the twelfth national confer-
ence on Artificial intelligence (vol. 1), pages 773–778, Menlo Park, CA, USA,
1994. American Association for Artificial Intelligence.
[65] KL Kwok and L.D. Grunfeld. N & Chan, M.(2001) TREC 2001 Question-
answering, web and cross language track experiments using PIRCS. In In-
formation Technology: The Tenth Text Retrieval Conference, TREC, pages
500–250, 2001.
150
[66] G. Lakoff and M. Johnson. Metaphors We Live By. University of Chicago
Press, Chicago, 1980.
[67] D.B. Leake. Case-based reasoning: Experiences, lessons and future directions.
MIT Press Cambridge, MA, USA, 1996.
[68] Wendy Lehnert. The Process of Question Answering. PhD thesis, Yale Uni-
versity, New Haven, CT, 1977.
[69] Douglas B. Lenat and R. V. Guha. Building Large Knowledge-Based Systems.
Addison-Wesley Publishing Company, Inc., Reading, Massachusetts, 1989.
[70] Mark T. Maybury. Question answering: An introduction. In New Directions
in Question Answering, pages 3–18. AAAI Press, 2004.
[71] John McDermott. R1: An expert in the computer systems domain. In Proceed-
ings of the First National Conference on Artificial Intelligence, pages 269–271.
AAAI Press, 1980.
[72] David McDonald, Alice Leung, David Getty, and Brett Benyo. Halo Phase II
Evaluation: Final report. Technical report, BBN Technologies, 2009.
[73] O. Medelyan and C. Legg. Integrating Cyc and Wikipedia: Folksonomy meets
rigorously defined common-sense. In Proceedings of Wikipedia and AI work-
shop at the AAAI-08 Conference. Chicago, US, July, volume 12, 2008.
[74] O. Medelyan, D. Milne, C. Legg, and I.H. Witten. Mining meaning from
Wikipedia. International Journal of Human-Computer Studies, 2009.
[75] Cynthia Matuszek Michael, Michael Witbrock, Robert C. Kahlert, John
Cabral, Dave Schneider, Purvesh Shah, and Doug Lenat. Searching for com-
mon sense: Populating cyc from the web. In In Proceedings of the 20th Na-
tional Conference on Artificial Intelligence, 2005.
151
[76] D. Moldovan, S. Harabagiu, M. Pasca, Rada Mihalcea, R. Goodrum, R. Girju,
and Vasile Rus. Lasso: A tool for surfing the answer net. NIST SPECIAL
PUBLICATION SP, pages 175–184, 2000.
[77] P.B. Mosenthal, H. Hall, and L. Green. Understanding the strategies of docu-
ment literacy and their conditions of use. Journal of Educational Psychology,
88(2):314–332, 1996.
[78] Erik T. Mueller. Natural Language Processing with ThoughtTreasure. Signi-
form, New York, USA, 1998.
[79] Ken Murray. Basic Problem Solver Module in AURA. Private communication,
2008.
[80] S H Myaeng and A Lopez-Lopez. Conceptual graph matching: A flexible
algorithm and experiments. Journal of Experimental and Theoretical Artificial
Intelligence, (4):107–126, 1992.
[81] V. Nastase and M. Strube. Decoding Wikipedia categories for knowledge
acquisition. In Proceedings of the AAAI, volume 8, 2008.
[82] Allen Newell and C Ernst. The search for generality. In Proc. IFIP Congress
65, pages 17–24, 1965.
[83] Allen Newell and Herbert A. Simon. Human Problem Solving. Prentice-Hall,
Englewood Cliffs, NJ, 1972.
[84] NJ Nilsson. Principles of artificial intelligence, Tioga Pub. Co., Palo Alto,
CA, 1980.
[85] Gordon S. Novak. [Halo] Rules. Private communication, 2007.
[86] Gordon S. Novak. Computer understanding of physics problems stated in
natural language. American Journal of Computational Linguistics, 1976.
152
[87] Gordon S. Novak and William C. Bulko. Understanding natural language
with diagrams. In Proceedings of the Eighth National Conference on Artificial
Intelligence, 1990.
[88] Gordon S. Novak and Won H. Ng. Rule engine. Private communication, 2006.
[89] Praveen Paritosh. The heuristic reasoning manifesto. In Proceedings of the
20th International Workshop on Qualitative Reasoning, 2006.
[90] Aarati Parmar. The representation of actions in KM and Cyc. Technical
report, Stanford University, 2001.
[91] George Polya. How to solve it. Princeton University Press, New Jersey, USA,
1945.
[92] J. Pomerantz. A linguistic analysis of question taxonomies. Journal of the
American Society for Information Science and Technology, 56(7):715–728,
2005.
[93] S P Ponzetto and M Strube. Deriving a large scale taxonomy from Wikipedia.
In In Proceedings of the 22nd National Conference on Artificial Intelligence,
pages 1440–1445, 2007.
[94] Jonathan Poole and J. A. Campbell. A novel algorithm for matching con-
ceptual and related graphs. In In G. Ellis et al eds, Conceptual Structures:
Applications, Implementation and Theory, pages 293–307. Springer-Verlag,
LNAI, 1995.
[95] P Procter. Longman Dictionary of Contemporary English, 1978.
[96] Peter J. Pym. Simplified english and machine translation. Technical report,
Perkins Engines UK, 1990.
153
[97] Jeff Rickel and Bruce Porter. Automated modeling for answering prediction
questions: selecting the time scale and system boundary. In AAAI’94: Pro-
ceedings of the twelfth national conference on Artificial intelligence (vol. 2),
pages 1191–1198, Menlo Park, CA, USA, 1994. American Association for Ar-
tificial Intelligence.
[98] W.P. Robinson and S.J. Rackstraw. A question of answers, volume 1. Rout-
ledge & Kegan Paul Books, 1972.
[99] W.P. Robinson and S.J. Rackstraw. A question of answers, volume 2. Rout-
ledge & Kegan Paul Books, 1972.
[100] Roger C. Schank. Conceptual Dependency: A Theory of Natural Language
Understanding. Cognitive psychology, 3(4):552–631, 1972.
[101] Roger C. Schank. Explanation patterns: Understanding mechanically and cre-
atively. Lawrence Erlbaum Associates, 1986.
[102] Robert Schrag, M. Pool, Vinay Chaudhri, R. C. Kahlert, J. Powers, Paul R.
Cohen, J. Fitzgerald, and Sunil Mishra. Experimental evaluation of subject
matter expert-oriented knowledge base authoring tools. Technical report, In-
formation Extraction and Transport, Inc., August 2002. Proceedings of the
2002 PerMIS Workshop, August 13-15, 2002, NIST Special Publication 990,
pp. 272-279.
[103] Rolf Schwitter. English as a formal specification language. In DEXA ’02: Pro-
ceedings of the 13th International Workshop on Database and Expert Systems
Applications, pages 228–232, Washington, DC, USA, 2002. IEEE Computer
Society.
[104] Rolf Schwitter, Kaarel Kaljurand, Anne Cregan, Catherine Dolbear, and Glen
154
Hart. A comparison of three controlled natural languages for OWL 1.1. In
OWL: Experiences and Directions (OWLED), 2008.
[105] Push Singh, Thomas Lin, Erik T. Mueller, Grace Lim, Travell Perkins, and
Wan Li Zhu. Open mind common sense: Knowledge acquisition from the
general public. In On the Move to Meaningful Internet Systems, 2002 -
DOA/CoopIS/ODBASE 2002 Confederated International Conferences DOA,
CoopIS and ODBASE 2002, pages 1223–1237, London, UK, 2002. Springer-
Verlag.
[106] John F. Sowa. Conceptual Structures: Information Processing in Mind and
Machine. Addison-Wesley, 1984.
[107] F.M. Suchanek, G. Kasneci, and G. Weikum. Yago: a core of semantic knowl-
edge. In Proceedings of the 16th international conference on World Wide Web,
pages 697–706. ACM New York, NY, USA, 2007.
[108] Dan G. Tecuci. A Generic Memory Module for Events. PhD thesis, The
University of Texas at Austin, Austin, TX, 2007.
[109] Teknowledge Corporation. Rapid Knowledge Formation project, 2002.
http://reliant.teknowledge.com/RKF.
[110] John Thompson and Peter Clark. Guide for CPL users, version 7.5. Technical
report, The Boeing Company, 2006.
[111] M. Manuela Magalhaes A. Veloso. Learning by analogical reasoning in general
problem-solving. PhD thesis, Carnegie Mellon University, Pittsburgh, PA,
USA, 1992.
[112] Charles A. Verbeke. Caterpillar fundamental english. Training and Develop-
ment Journal, 27(2):36–40, February 1973.
155
[113] E.M. Voorhees. The TREC question answering track. Natural Language En-
gineering, 7(04):361–378, 2002.
[114] Vulcan Inc. Project Halo, 2003. http://projecthalo.com.
[115] Daniel S. Weld, Raphael Hoffmann, and Fei Wu. Using Wikipedia to bootstrap
open information extraction. SIGMOD Rec., 37(4):62–68, 2008.
[116] E. N. White. International language for servicing and maintenance. University
of Wales - Institute of Science and technology, 1974.
[117] Wikipedia. Wikipedia, the free encyclopedia, 2009. URL
http://en.wikipedia.org/.
[118] Willian A. Woods, R.N. Kaplan, and B.N. Webber. The Lunar Sciences Nat-
ural Language Information System: Final Report. Technical report, Bolt
Beranek and Newman Inc., 1972. BBN Report 2378.
[119] M. Montes y gomez, A. Gelbukh, A. Lopez-lopez, and R. Baeza-yates. Flex-
ible comparison of conceptual graphs. In In Proceedings of the 12th Interna-
tional Conference and Workshop on Database and Expert Systems Applica-
tions, pages 102–111. Springer, 2001.
[120] Peter Z. Yeh, Bruce W. Porter, and Ken Barker. Using transformations to
improve semantic matching. In Proceedings of the Second International Con-
ference on Knowledge Capture, 2003.
[121] Peter Z. Yeh, Bruce W. Porter, and Ken Barker. A unified knowledge based
approach for sense disambiguation and semantic role labeling. In Proceedings
of the Twenty-First National Conference on Artificial Intelligence, 2006.
[122] C. Zirn, V. Nastase, and M. Strube. Distinguishing between instances and
156
classes in the Wikipedia taxonomy. Lecture Notes in Computer Science, 5021:
376, 2008.
157
Vita
Shaw Yi Chaw was born in Singapore on December 6, 1977. He graduated from the
National University of Singapore with his Bachelor’s degree in Computer Science in
2002. Shaw Yi began his graduate studies in the Computer Sciences Department of
the University of Texas at Austin in 2003. He successfully graduated with a Ph.D.
degree in 2009. Shaw Yi will join IBM T.J. Watson Research Center to work on a
project whose goal is to beat the human champion in the Jeopardy! quiz show.
Permanent Address: 263, Bishan Street 22
#24-261
Singapore 570263
Singapore
This dissertation was typeset with LATEX 2ε1 by the author.
1LATEX 2ε is an extension of LATEX. LATEX is a collection of macros for TEX. TEX is a trademarkof the American Mathematical Society.
158