14
PROSPECTS FOR KNOWLEDGE-BASED CUSTOMIZATION OF NATURAL LANGUAGE QUERY SYSTEMS FRED J. DAMERAU Thomas J. Watson Research Laboratory, P.O. Box 218, Yorktown Heights, NY 10598, U.S.A. (Received 20 January 1988; accepted in final form 20 Apt-3 1988) Abstract-This article discusses the potential sources of knowledge for customizing trans- portable natural language query systems and provides a rough quantification of the importance of each source. The major potential knowledge source would be a very large, sophisticated dictionary. Inferences from database content are much less important. The human database expert is the only source for a considerable amount of the required information. 1. INTRODUCTION One of the recent trends in the design of natural language database query systems has been the provision for portability of a query system from one database to another without inter- vention from system designers. This has been accomplished by providing a base system that is then tailored to a particular database by a human database expert, using a tool provided with the system. A number of such systems have been built (references to a representative set are given below}. Although these systems differ greatly in their design, in part because of differences in the design of the target query system, the kinds of information that must be provided to a natural language query system by a database expert are very similar across all systems. These include, but are not limited to: 1. A name for each entity described in the database 2. A name for each attribute of each entity 3. Natural language equivalents for coded data values 4. Connection of the natural language expressions to the database constructs 5. Structural properties of the database Some short quotations from relevant literature on building domain knowledge into natural language query systems show the commonality of approach quite clearly: * From discussion of the LDC (Layered Domain Class) system developed at Duke University: The initial interaction between a user and LDC, which involves telling the system about a new domain, consists of a knowledge-acquisition session with the preprocessor, which we call “Prep.” In particular, Prep asks for (1) the names of each type of “entity” (object) of the domain; (2) the nature of the relationships among entities; (3) the English words that will be used as nouns, verbs, and modifiers; and (4) morphological and semantic properties of these new words. ([Z], pA 93) * From a description of the IRUS (Information Retrieval Using the RUS parser) sys- tem, developed at BBN:

Prospects for knowledge-based customization of natural language query systems

Embed Size (px)

Citation preview

Page 1: Prospects for knowledge-based customization of natural language query systems

PROSPECTS FOR KNOWLEDGE-BASED CUSTOMIZATION OF NATURAL

LANGUAGE QUERY SYSTEMS

FRED J. DAMERAU Thomas J. Watson Research Laboratory, P.O. Box 218, Yorktown Heights, NY 10598, U.S.A.

(Received 20 January 1988; accepted in final form 20 Apt-3 1988)

Abstract-This article discusses the potential sources of knowledge for customizing trans- portable natural language query systems and provides a rough quantification of the importance of each source. The major potential knowledge source would be a very large, sophisticated dictionary. Inferences from database content are much less important. The human database expert is the only source for a considerable amount of the required information.

1. INTRODUCTION

One of the recent trends in the design of natural language database query systems has been the provision for portability of a query system from one database to another without inter- vention from system designers. This has been accomplished by providing a base system that is then tailored to a particular database by a human database expert, using a tool provided with the system. A number of such systems have been built (references to a representative set are given below}. Although these systems differ greatly in their design, in part because of differences in the design of the target query system, the kinds of information that must be provided to a natural language query system by a database expert are very similar across all systems. These include, but are not limited to:

1. A name for each entity described in the database 2. A name for each attribute of each entity 3. Natural language equivalents for coded data values 4. Connection of the natural language expressions to the database constructs 5. Structural properties of the database

Some short quotations from relevant literature on building domain knowledge into natural language query systems show the commonality of approach quite clearly:

* From discussion of the LDC (Layered Domain Class) system developed at Duke University:

The initial interaction between a user and LDC, which involves telling the system about a new domain, consists of a knowledge-acquisition session with the preprocessor, which we call “Prep.” In particular, Prep asks for (1) the names of each type of “entity” (object) of the domain; (2) the nature of the relationships among entities; (3) the English words that will be used as nouns, verbs, and modifiers; and (4) morphological and semantic properties of these new words. ([Z], pA 93)

* From a description of the IRUS (Information Retrieval Using the RUS parser) sys- tem, developed at BBN:

Page 2: Prospects for knowledge-based customization of natural language query systems

652 FRED J. DAMERAU

There are three kinds of domain dependent semantic knowledge: Domain Model. What kinds of things will be talked about in the domain? What gen- eralizations of the classes will be useful? What are the relations between classes? . . . Interpretability . . . What are the selectional restrictions, the kind of thing that should fill each syntactic slot in a phrase? . . . Predication. How should the constituents be combined to form the interpretation of a phrase [ 11, p. 624)

l From the TQA (Transformational Question Answering) system, developed at IBM Research:

The application-specific portion of the lexicon in TQA has a number of different kinds of entries. Obviously, the English equivalents of column names and their synonyms are part of the lexicon. In addition to these, it is sometimes also necessary to include column values and their synonyms in the dictionary. . . . If the generd dictionary does not indicate that there is an associated verb, the customization program wili ask if there is one. . . . It asks, using examples, what type of verb this is (tr~sitive, etc.) and then asks what columns can provide arguments for the allowable set of cases for the verb type. . . .” ((81, pp. 171-174)

l From discussion of the TEAM (Transportable English database Access Medium) system, developed at SRI International:

The acquisition component of TEAM is responsible for gathering information from the DBE (data base expert) about the structure of the database, the words that refer to objects in the database, and the relations between them. This information is incor- porated into the lexicon and into the conceptual and database schemata. . . . The lin- guistic knowledge the DIALOGIC needs must be inferred from answers to questions that tap a layman’s linguistic competence without recourse to the terminology of a trained linguist. ([12], p. 574)

l From discussion of the USL (User Specialty Languages) system, developed at the IBM Scientific Center in Heidelberg:

. . . the User Specialty Languages System assumes a view of data in which each noun, verb, or adjective addresses a relation. These relations are normally defined as virtual relations, or views, on the user’s base relations. The columns in these virtual relations or views have standard role-names that correspond to the cases, prepositions, or adver- bials that are governed by the word as used in the application. . . . The words that shall address the relations are defined via a prompting routine. . . . If users are defining a noun, they are asked for the singular form and for the article. . . . If a verb is being defined, the verb type is elicited. (1171, pp. 13-15)

Research on the systems cited above and other similar ones demonstrates that, at least in favorable test cases, elicitation from database administrators can generate domain dependent query systems from domain independent bases. However, the problem of trans- portability is by no means soIved.

The intent of this article is to identify some of the remaining problems of transport- ability, their importance, and the prospects for their solution.

Experience with the TQA system has shown that the process of customization is both tedious and error prone. As the foregoing quotations show, database query customization systems all seem to rely heavily on tapping not only the database knowledge but also the grammatical knowledge of the experienced database expert, rather than encoding that knowledge into the customization system. Database experts, however, are not usually grammar experts and may find it difficult to understand what the system requires. Given the current emphasis on knowledge-based systems, it is worth investigating what sorts of knowledge would aid the customization task, how it might be structured, and where it might be obtained.

Page 3: Prospects for knowledge-based customization of natural language query systems

Natural language query systems

2. PRELIMINARIES: INTRODUCTION TO DATABASE QUERY

AND THE TQA SYSTEM

653

As illustrative examples will be drawn from the TQA system, a brief description of the system appears appropriate. The main components of the TQA system are a preproces- sor, a transfo~ation~ parser 1131, a Knuth-style semantic interpreter [I 1,141, and an SQL to English translator, Fig. 1.

The preprocessor performs input tokenization and dictionary lookup. The parser is separable into four phases: (1) application of a set of string transformations for local re- arrangement [ 161, (2) context-free parsing to produce surface structures, (3) application of the main transformational component to produce canonical underlying structures, and (4) application of datable-oriented transfo~ations to produce structures that reflect the con- tent of the database. The interpreter transforms the canonical sentence representation pro- duced by the grammar first into a logical form (an expression in the domain relational calculus [9]), and then into an SQL expression. The SQL-to-English translation compo- nent provides feedback in English to users as to how the system has understood the orig- inal query. An example of the TQA answer panel is shown in Fig. 2.

All of the component programs of TQA are table driven. That is, the parser, inter- preter, etc., do not change from application to application but the contents of the diction- aries, system tables, and in a few cases grammar rules differ to some extent from database to database. Database-specific extensions to all of these files are produced by a customi- zation program during a dialogue with a database administrator and integrated by a com- pilation program into the TQA base system. Examples of portions of a customization dialogue are given as follows.

3. KNOWLEDGE SOURCES

As an initial step in determining the relative importance of the kinds of information needed by a natural language query system, three categories of knowledge source were considered:

1. Information that can only come from the database administrator doing customi- zation. This includes what tables (relations) are in the database interface now being constructed (to simplify discussion, an underlying relational database is assumed), what the names of the entities are that these tables describe, by what natural lan- guage expressions the column names (attributes) are referred to, what the displayed titles should be for each column, how data values should be formatted, etc. This information should be relatively easy for a database expert to supply, and will not be discussed further.

2. Information which can come from suitably complete ~ctionaries. This includes part-of-speech, argument structure for nouns and verbs, related words and syno- nyms, and the like. The dictionary category also includes some information that might be classed as world knowledge (e.g., whether natural language equivalents for database field values belong to some semantic category such as human, orga- nization, etc., since such information is often represented in dictionaries). It is information of this type that may be particularly difficult for a database expert to supply. No existing computer-usable dictionary contains all of this information for a significantly large subset of English vocabulary. In what follows, a subcategory of information that is obtainable from current dictionaries is identified.

3. Information that can be read from or deduced from the database catalogues (or data dictionary) and the database contents. This includes candidate keys for each table (i.e., the set(s) of columns that are unique in each row of a table), subset in- formation, domain information (i.e., which columns have values drawn from the same underlying set of values), and the like. Obtaining some of this information by database processing may be very expensive computationally and might require verification by the database expert.

Page 4: Prospects for knowledge-based customization of natural language query systems

654 FRED J. DAMERAU

-j--D ict ionary

Lexical Trees I I

Surface Grammar

Transformational (Inverse) Transformations

Underlying Structure(s) I I

i Semantic

I ft-

Attribute Interpreter Grammar

Data Base Description

t SQL Expression(s)

I

English Paraphrase

Fig. 1. TQA

Answer

natural language processor.

4. POTENTIAL CONTRIBUTION FROM EACH KNOWLEDGE SOURCE

It is desirable to estimate how much each of the knowledge sources listed above might contribute to the solution of the customization problem. In TQA, as each database is cus- tomized, a record is maintained by the customization tool of every request by the system

Page 5: Prospects for knowledge-based customization of natural language query systems

Natural language query systems 655

Question:

what parts are red

Interpretation:

Find the parts whose color is red. Show the part numbers and types of the parts.

SELECT DISTINCT A.PNO, A.TYPE FROM TQASQL.PARTS A WHERE A.COLOR = #RED I :

Part Type ---- --_-- Pl NUT P4 SCREW P6 COG

Fig. 2. Example of a TQA answer panel.

to the user and of the user’s response. This record serves a number of purposes including system debugging, rerun of a customization with minor changes, and others. For present purposes, it permits quantification, of a sort, of the value of each potential knowledge source. That is, each question asked of the user can be identified as a request for dictionary information, grammatical information, database structure information, etc. and classified into the categories: full dictionary (i.e., including complete information on subcategori- zation, part of speech, usage, etc.), current dictionary (i.e., only part of speech and per- haps semantic class and a few other pieces of information), database inference, or necessary user input. Types of dictionary information that can be requested from the data- base expert include required arguments and their database reference, Fig. 3; morphology, Fig. 4; part of speech, Fig 5; and synonyms, Fig. 6.

Information obtainable from current dictionaries includes synonyms, parts-of-speech, and similar elementary information. Potential database inferences include keys, Fig. 7; domains, Fig. 8; and padding, Fig. 9.

Necessarily supplied by the database expert are entity names, Fig. 10; print headings, Fig. 11; cover terms for coded values, Fig. 12, and similar data.

The number of interactions of each type can be determined for those databases for which the trace record is still available. (Unfortunately, the trace record is not time- stamped, so it is not possible to estimate the time spent in customization for each category.) An interaction is taken to be a user response to a panel requesting a piece or pieces of information. These interactions are not all of the same level of complexity. Some require only that a key be depressed, whereas others might require typing half a dozen or more pieces of information into a form. In the absence of any principled way to normalize inter- actions, all responses are given the same weight. There is some justification for this, in that each presented panel is intended to determine a single piece of information (e.g., the sub- ject of a verb, instead of the complete argument structure of the verb all at one time). (The verb subject might, however, optionally come from several columns, so more than one piece of data might be necessary in response.) A summary of interactions in each of the categories listed above for five sample databases is shown in Fig. 13.

Databases DBl and DB2 were constructed simply for the purposes of testing natural language query systems. They are small but interesting toys, the first on parts and sup- pliers, the second on programs, modules, and programmers. DB3 is a real database of

Page 6: Prospects for knowledge-based customization of natural language query systems

6.56 FRED J. DAMERAU

___--___^_-_.._-f__-- TQA Customization Subject Column PAN261

Column: SNO Table: ZS Verb: supply Class: Transitive

You assigned class of transitive verbs, having

'supply' to the the question patterns?

What Xs supply/supplied Y? What Ys are/were supplied by X?

Identify below the subject or agent XI

Column Table

SXIO 2s

column fs) that can correspond to the

Column Table

sname ZS -

Fig. 3. Request for argument reference.

-------___-___------ TQA Customization Verb Form Verification PAN28

Column: SNO Tabler ZS Verb: supply

The two forms 'supplied' and *,supplied" axe assumed ;5; be the p:st tense and past paxtrclple, respectively, f;=kvTxb supply The corresponding forms,:;;e,t$e ve?b

are 'too<' (as in ‘I took') and (as In 'I have taken').

Correct the forms shown below, if necessary, then press ENTER.

supplied supplied

Fig. 4. Request for morphological information.

planning data [7], DB4 is one large table from a real personnel file, and DB5 is the orga- ~za~iontelephone book. The numbersinthe tablegivethe number of interactions in each category. The subcategory “Current Dictionary” indicates the number of interactions that could be eliminated by use of a large dictionary containing information that could, in prin- ciple, be obtainable from existing dictionaries today (Le., part of speech, synonyms, and proper names but not detailed argument structure or s~rna~t~c class). Obviously, these numbers would differ from system to system, but given the similarity between systems, it seems unlikely that order of magnitude differences are involved.

As can be seen from Fig, 13, nearly half of the required information can only come from the database expert. At first glance, this is quite discouraging. However, although there are at present no available data, it seems likely that this information is the kind that

Page 7: Prospects for knowledge-based customization of natural language query systems

Natural language query systems 657

-__-_---__-_--_--_-_ TQA Customization Grammatical Category .Selection PAN69

Columnr SNO Tables ZS Sample Valuer S3

The values in this column may themselves be English words, or may be abbreviations or codes that can either be used as words or be referred to by corresponding English expressions.

Mark the grammatical category that applies to these words or expressions:

- 1. Noun - 2. Adjective - 3. Verb (or form of verb) - 4. Both Noun and Verb _ 5. Other

Fig. 5. Request for part of speech.

--_--_____-----__--_ TQA Customization Other Entity Name PAN751

Column: SNO Table: ZS Previous Entity Equivalent: supplier

Processing of the English equivalent 'supplier' for the table entity has been completed. If another name for the entity should be it below:

included in the dictionary, enter

provider

Fig. 6. Request for synonyms.

--------_--___---__- TQA Customization Key Identification PAN98

Table: ZS

Select the column or columns representing the key to this table2

x SNO -SNAME - STATUS _ CITY

Fig. 7. Request for table key column(s).

Page 8: Prospects for knowledge-based customization of natural language query systems

658 FRED J. DAMERAU

------_-______------ TQA Customization Domain Definition PAN85

Column 2 SNO Table: 2s

Do you wish to define a domain in connection with this column?

_ Yes x No

Fig. 8, De~~rmi~in~ if more cofumns are over this domain,

TQA Customization Padding Character Verification PAN521

Column: CITY Tablet 2.3 Sample Value: 'ATHENS t

The padding character identified for this column is displayed below, enclosed in quotation marks. Please correct it if necessary.

""

Fig. 9. Determination of database column padding.

------__-II-____..._-_

TQA Customization Entity Naming PAN95

Table: ZS

This is the beginning of customization for a new table -- ZS.

Enter the primary term for the individual entities described by the rows of this table. Use the singular form.

supplier

Fig. 10. Eliciting a unique name for each entity.

is easiest for the database expert to supply, as it is directly within his or her field of com- petence. That is, this information is of a kind which the expert should, in principle, be able to supply easily (Le., column name equivalents, names for code tables, and the like). One can speculate that the effort in supplying this information is less than is indicated by count- ing the number of interactions instead of measuring their duration. Even if this is wrong,

Page 9: Prospects for knowledge-based customization of natural language query systems

659 Natural language query systems

Fig. 11. Eliciting report heatings.

_--_--_----mm_-w-s--

TQA Customization Column Display Heading PAN92

Column: SNO Table: ZS

What heading should be displayed for this column in output tables and reports?

supplier number

------_---_--__--_-- TQA Customization Cover Term Inclusion PAN13

Column: SNO Table: ZS

Do any sets of values from this column have corresponding English cover terms that should be included In the dictionary? (For example, various secretarial

a set of distinct job codes for levels could be referred to

collectively by the cover term 'secretary'.)

- Yes x No

Fig. 12. Determining if there are English expressions for sets of values.

I I I I I I I I I I I

Fig. 13. Proportion of information for customization obtainable from each category of knowledge source.

substantial improvements in query system customization are possible if the other two cat- egories of information can be supplied from general knowledge sources or from inference on existing knowledge sources.

4.2 Database inference About 10 percent of the required information is available from the database itself,

either in its tables or in the system catalog (data dictionary). Some of this information is directly represented and some can be derived from inferences, both inductive and deduc- tive, on explicitly present data. Consider the sample database in Fig. 14 ([9], p. 97).

Page 10: Prospects for knowledge-based customization of natural language query systems

660 FRED _I. DAMERAU

Supplier Table (ZS)

Part Table (ZP)

Shipment Table (ZSP)

Fig. 14. Suppliers an d parts of tables [9].

Induction on the character string representations of the values in the supplier num- ber and part number fields allows us to derive recognition rules for the entries in these col- umns so that they need not be stored in the application dictionary [8]. Similarly, but not illustrated here, suitable programming can determine in most cases what pattern (i.e., year- month-day, month-day-year) is being used for the storage of date values.

Certain other necessary information can also be obtained by deduction. For example, a rule “KEY,” in some suitable rule language (e.g., PROLOG) might tell us that a set of columns, Coll, is a key for table Tl which was created by Cl, if 11 is a unique index over

Page 11: Prospects for knowledge-based customization of natural language query systems

Natural language query systems 661

the columns in Coll. The set of columns might have only one member, for tables with a single column key. Single column keys are important because they normally tell us that a table describes some database entity, rather than a relationship between entities. Database management system processing can tell us which other columns in the database have val- ues that are subsets of these key columns. In general, these will be foreign keys ([9], p. 250) and therefore in the same domain as the given key. In some cases, particularly when val- ues are short codes and the same codes occur with different meaning in different columns, this deduction might be wrong, and it should therefore be verified by the database expert. Another rule might tell us that if a table contains a single column key, and there is another column in this table that is in the same domain as the key, the two columns must refer to different concepts. In the case of TQA, information that two columns over the same domain refer to different concepts is important for generating correct English equivalents from SQL expressions. In general, it means that columns, even though they are over the same domain, will have different natural language referring expressions. An example would be a table having columns for employee number and manager number. Both are over the domain of employee numbers, but the concepts and therefore the referring expressions are different.

4.3 Dictionaries The exact contents of “suitably complete dictionaries” is not at all easy to determine.

Consider what needs to be obtained from such dictionaries for the query application. Let us suppose that we have a table of information on suppliers of parts, as above. We must rely on the database expert to give us the name of the entity that the table describes, “sup- plier.” However, from this piece of information alone, we want the dictionary to provide closely related words, like the verb “supply,” and more distantly related words like “pro- vide, ” “send,” “ship,” and so on. Moreover, for each of these words we need complete information on its argument structure. For example, consider a portion of a possible case grammarlike entry for “supply,” Fig. 15. (No commitment to a case frame representation is implied by the example. The particular set of semantic and case frame tags is illustra- tive only. Notice that this information would translate easily into a semantic net represen- tation, a predicate calculus representation, etc. Structures like this seem to be in the spirit of those used by many researchers engaged in lexical research and particularly in the con- struction of general purpose dictionaries, cf. Boguraev, p. 12 [4]. None of the systems men- tioned above uses a dictionary of exactly this form. The intent of such a dictionary would be to provide maximal information, so that any existing or proposed system could extract and reformat some portion of the full entry for its own purposes. This goal might, of course, be unrealizable.)

Although there is obviously more information about the concept “supply” and its related words than is shown here that is necessary for a sophisticated understanding sys- tem to have, we already have exceeded what we can hope to find in existing dictionaries. Consider simply the category of “semantic domain” in the previous entry. In a report from one of the most active groups investigating the utility of existing dictionaries, we find: “Out of the 55,000 entries in the dictionary, 18,000 are marked as having specialized subject codes, with an average of 1.3 subject codes per word” ([15], p. 75), where “subject codes” correspond to what were called semantic domain identifiers. Finding the complicated pat- terns of argument structures that are required is even more problematic ([5], p. 8). This is not to imply that the dictionary researchers are wasting their time; quite to the contrary, whatever can be gathered from existing compilations will be helpful. However, we cannot hope that automatic processing of existing dictionaries will provide all the needed information.

It does seem likely that a very large lexicon could, even at present, be constructed that would provide at least the common parts of speech (i.e., noun, verb, adjective, etc., related terms and synonyms, identification of proper names, identification of place names, and perhaps some other things) to a reasonably high degree of completeness. There still remains the question, at the moment unexplored, as to how this general information can be spe-

Page 12: Prospects for knowledge-based customization of natural language query systems

662 FRED J. DAMERAU

Concept: SUPPLY X supplies Y to Z from W

agent : human/organization patient goal

t physical object : h~an/organi~ation/pla~e : money

source f place semantic domain: commercial,inventory

.

.

SUPPfY POS: verb l

synonyms ; provide, ship, . . .

supplier Concept: supply supplier of Y of: patient role: agent POS: common noun

.

.

SUPPlY Concept: supply supply of Y of: patient role: patient POS: common noun

.

.

Fig. 15. Example of concept dictionary entry for supply.

cialized to a particular application without introducing unwanted synonyms and selectional patterns given that there is substantial lack of agreement on the operational definition of even a common term like “synonymy.”

5. DISCUSSION

If dictionary input is not reliably obtainable from database experts, and if large, com- plex dictionaries are not likely to be available soon, it may be necessary to rethink parts of the design of query systems. One possibility might be to incorporate a learning compo- nent into the query system itself. A large amount of the information to come from diction- aries consists of selectional restrictions, Fig. 16.

These numbers are, again, counts of panels seen by the database expert during the cus- tomization process. By this measure, half or more of the information that might come from a dictionary regards selectional restrictions. If the query system dictionary lacks these restrictions, the effect will be increased ambiguity coming from the parser (i.e., more que- ries will have more than one parse). A certain number of input queries to a database sys-

Fig. 16. Percenr of dictionary information for selectional restrictions.

Page 13: Prospects for knowledge-based customization of natural language query systems

Natural language query systems 663

tern are and will be inherently ambiguous (i.e., will have more than one valid interpretation in the database). In the parts and suppliers case, the query “How many parts are there?” has two readings, corresponding to “How many part types are there?” answerable by counting the number of distinct PNOs in the PART table, and “What is the number of parts?” corresponding to summing the QTY field in the SF &able ([lo], p. 18). The TQA system, therefore, has a provision for asking the user which interpretation is meant before a database access is made. (Other systems must have a similar facility.) Consider now the query “What are the locations of parts weighing 10 grams ?” In the absence of selectional restriction information, this would also have two readings, corresponding to “locations weighing IO grams” and “parts weighing 10 grams.” (Johnson [IO], p. 14, discusses how TQA uses selectional restrictions to resolve the attachment problem.) If the query system were designed to ask the user to identify not only the desired reading, but also impossi- ble readings, and fed this information back into the lexicon, it should be possible over time to collect the necessary selectional restrictions for the vocabulary actually required by users. (Note that the user must be careful to mark readings that are truly impossible, and not sim- ply presently unwanted. Because of the possibility of mistakes, it would probably be unwise to introduce a selectional restriction on the basis of a single example.) The learning pro- cess is obviously not perfectly straightforward. In the previous example, values from both PNO and PNAME are permissible as subjects for “weight.” A single input or small set of inputs might not reveal this. Also, permissibility of certain arguments may be dependent on the presence or absence of other arguments. Nonetheless, this way of getting at argu- ment patterns does appear to deserve further research, although many questions regard- ing its efficacy remain open. Being heavily restricted, it should be a more tractable problem than the general language learning problem, for example 131. Of course, it may be that users will not be willing to spend time on an activity that does not directly and immedi- ately contribute to problem solution.

6. CONCLUSION

From Fig. 13, it is clear that even extensive, sophisticated knowledge sources would stiI1 leave a considerable amount of information to be supplied by the database admin- istrator. It is also clear that a major gain in ease of customization could come from the availability of large, rich, theory-neutral lexicons (although considerable work might be necessary to apply such lexicons in individual applications). Hopefully, current research on methods for constructing such dictionaries will be vigorously pursued. Otherwise, proj- ects to improve database access, and many other natural language computer interface sys- tems, will find themselves either tapping grammatical knowledge from linguistically naive users, as at present, or mounting a large effort to provide a system-specific lexicon for each new application, or, in favorable cases, incorporating a learning component into the sys- tem delivered to end users.

~c~~ow~~g~e~r_This article has been greatly improved by discussions with Eric Mays, Stanley Petrick, and Warren Plath, and comments of the anonymous referees. Remaining problems are, naturally, my own.

REFERENCES

1.

2.

3. 4.

5, 6.

Bates, M; Moser, M.G.; Stallard, D. The IRUS transportable natural language database interface. In: Kersch- berg, Larry, editor. Expert Database Systems, Proceedings from the First International Workshop. Menlo Park, CA: The Benjamin Publishing Co.; 1986; 617430. Ballard, B.W.; Tinkham, N.L. A phrase-structured framework for transportable natural language processing. Computational Linguistics, lO(2): 81-96; 1984.

- -.

Berwick, R.C. The Acauisi&ion of Syntactic Knowledae. Cambrgdne, MA: The MIT Press; 1985. Boguraev, B.K. The de‘finitional power of words. In:Wiiks, Yorick, editor, Theoretical Issues in Natural Language Processing-3, Position Papers. Computing Research Laboratory, New Mexico State University, Las Cruces, NM; 1987; 11-15. Byrd, R.J. Dictionary systems for office practice, IBM Research Report RCll872; May 1986. Byrd, R.3.; Cafzolari, N; Chodorow, MS.; Klavans, J.L.; Neff, MS.; Rizk, O.A. Tools and methods for computational lexicology. Computational Linguistics, to appear.

Page 14: Prospects for knowledge-based customization of natural language query systems

664 FRED J. DAMERAU

7. Damerau, F.J. Operating statistics for the transformational question answering system. American Journal

8.

9.

10.

11. 12.

13.

14.

15.

16.

17.

of Computational Linguistics, 7: 30-42; 1981. Damerau, F.J. Problems and some solutions in customization of natural language database front ends. ACM Transactions on Office Information Systems, 3(2): 1655184; 1985. Date, C.J. (1986). An Introduction to Database Systems, 4th edition. Reading, MA: Addison-Wesley Pub- lishing Co.; 1986. Johnson, D.E. Design of a robust, portable natural language interface grammar. IBM Research Report RC10867; December 1984. Knuth, D.E. Semantics of context-free languages. Math. Syst. Theory, 2(June): 127-145; 1968. Martin, P.; Appelt, D.; Pereira, F. Transportability and generality in a natural-language interface system. Proc. Eighth International Joint Conference on Artificial Intelligence, 1983 August 8-12; Karlsruhe, West Germany; 573-581; 1983. Petrick, S.R. A recognition procedure for transformational grammars. Ph.D. dissertation. Cambridge, MA: Massachusetts Institute of Technology; 1965. Petrick, S.R. Semantic interpretation in the request system. In: Computational and Mathematical Linguis- tics, Proceedings of the International Conference on Computational Linguistics, pp. 585-610. Pisa, Italy; 1977. Walker, D.E.; Amsler, R.A. The use of machine-readable dictionaries in sublanguage analysis. In: Grish- man, R.; Kittredge, R, editors. Analyzing Language in Restricted Domains, pp. 69-83. Hillsdale, NJ: Lawrence Erlbaum Associates; 1986. Plath, W.J. (1974). String transformations in the REQUEST system. American Journal of Computational Linguistics, Microfiche 8. Zoeppritz, M. Syntax for German in the User Specialty Languages System. Tuebingen, West Germany: Max Niemeyer Verlag; 1984.