Text searching retrieval of answer-sentences and other answer-passages

Text Searching Retrieval of Answer-Sentences and Other Answer-Passages*

Some new text searching retrieval techniques are described which retrieve not documents but sentences from documents and sometimes (on occasions determined by the computer) multi-sentence sequences. Since the goal of the techniques is retrieval of answer- providing documents, “answer-passages’’ are retrieved. An ”answer-passage’’ i s a passage which i s either answer-providing or “answer-indicative,(’ i.e., it permits inferring that the document containing it is answer-provding. In most cases answer-sentences, i.e., single-sentence answer-passages, are retrieved. This has great advantages for screening retrieval output.

Two new automatic procedures for measuring closeness of relation between clue words in a sentence are described. One approximates syntactic closeness by counting the number of intervening “Syntactic joints” (roughly speaking, prepositions, conjunctions and punctuation marks) between successive clue words. The other measure uses word proximity in a new way. The two measures perform about equally well.

The computer uses “enclosure” and “connector

Introduction

More and more machine-readable bibliographic data bases which include abstracts are becoming available. Moreover, at least one operational retrieval system already searches full texts of documents by computer ( 1 ) . Such full text retrieval will become more widely feasible as more machine-readable text becomes available as a byproduct of manuscript typing or printing, and as computer and memory costs continue decreasing.

“What are the best techniques to use for retrieval by text searching?”

This urgently raises the question:

words“ for determining when a multi-sentence passage should be retrieved. However, no procedure was found in this study for retrieving multi-paragraph answer- passages, which were the only answer-passages occurring in 6 yo of the papers.

In a test of the techniques they failed to retrieve two answer-providing documents (7% of those to be retrieved) because of one multi-paragraph anrwer-passage and one complete failure of clue word selection. For the other answer-providing documents they retrieved at all recall levels with greater precision than SMART, which has produced the best previously reported recall-precision results.

The retrieval questions (mostly from real users) and documents used in this study were from the field of information science. The results of the study are surprisingly good for retrieval in such a “soft science,” and it i s reasonable to hope that in less “soft” sciences and technologies the techniques described will work even better. On this basis a dissemination and retrieval system of the near future i s predicted,

JOHN O’CONNOR

Center for Znformation Science Lehigh University Bethlehem, Pennsylvania 18016

Various procedures which have been known for more than ten years are widely used, and usually produce, at best, about 50% recall and 50% precision, with higher recall (if available a t all) accompanied by lower precision. Several other text searching techniques, such as relevance feedback and statistical association, have been developed during the last ten years, but have not produced better recall-precision results.

Retrieval precision of 50% is less than desirable when the total output contains anywhere from ten to more

This work was supported by the National Beienee Foundation under grant GN-28805.

Journal of the American Society for Information ScienceNovember-December 1973 445

than 100 documents, as is often the case in operational representing relations between terms in a document situations. For example, in the MEDLARS tests by index set?” is a document containing the following Lancaster, the average number of documents retrieved per request, and at an average precision of 50%, was In the communicable disease literature, for example,

over 3,000 documents have been indexed by means 150 (8). Further, high recall is sometimes important; e.g., for writing a review, gathering information of of the telegraphic abstract cum semantic code certain kinds to teat a hypothesis, or determining what techniques.

passage:

has been reported about the health-or environmental safety-of a particular substance or process. In such cases many relevant documents might be retrieved, and the precision of much less than 50%, which now usually accompanies high recall, means that a large number of unwanted documents will also be retrieved. Thus great improvement in precision at both moderate (50%) recall and high recall is important for users.

The study reported here was undertaken to develop new text searching retrieval procedures which will produce much better recall-precision results than do existing text searching techniques, preferably precision of at least 80% for recall of 50% to 100%. To the extent that this was not achieved, the study also undertakes to describe the retrieval failures as a basis for further work.

Procedures

The study was concerned with retrieval of one kind of relevant documents-“answer-providing” documents. Therefore, something will be said here about documents of that kind.

I t is commonly said that a document retrieval system should retrieve “relevant” documents. But there are a number of different ways in which a document can be relevant. For instance, a retrieval system user with a definite question might infer an answer to his question from a document, or judge that the document helps toward finding an answer, or have his attention shifted by the document away from his original question. Alternatively, a user might not have a definite question but want to learn “about a subject” or “browse,” and thus find a document relevant because he judges it about that subject or he is interested in it. It is commonly assumed that any document retrieval system should be able to retrieve most or all of these different kinds of relevant documents. But better document retrieval systems are more likely to be developed if some research and development for the purpose investigates these different kinds of retrieval separately.

Consider retrieval of the first kind of relevant document. More precisely specified, that retrieval function is the following: for any input question, the output is a set of documents, from each of which an answer to that question can be inferred. Such documents might be called “answer-providing” for the question. An example of an answer-providing document, for the information science question: “What techniques have been used for

An answer-providing document contains at least one answer-providing passage (long or short), i e . , a passage from which an answer to the input question can be inferred. An example is the passage quoted above. An answer-providing document might also contain at least one “answer-indicative passage,” i.e., a passage which is not answer-providing but from which it can be inferred that the document is answer-providing. An example for the question about relational indexing quoted above is the following passage:

In this section we shall describe the methods we have used for exprewing relations between descriptors assigned to a document.

A passage which is either answer-providing or answer- indicative will be called an answer-passage. For further discussion of answer-providing documents see (3-6).

A set of question- (answer-) providing document pairs was available from an earlier investigation ( 4 ) . There were 18 questions, with an average of 3.7 answer- providing documents each in a corpus of 82 documents. The document corpus was one that had been used in the SMART experiments involving information science materials (6), and six of the questions were also from those SMART experiments. While those six questions were invented, the other twelve questions were real, from Lehigh faculty and graduate students (“. . . a question which has come up in the course of your work that you think might be answered by some document in the collection”). The answer-providing documents were found by two information scientists who independently read the corpus exhaustively for every question and then resolved all disagreements by discussion ( 4 ) . The answer-providing documents in this corpus present an interesting challenge for retrieval, since in about 60% of the answer-providing documents the answer-passages do not include any part of the title, abstract, headings or figures, and in about 30% of the documents the answer-passages are confined to a single paragraph of the main text.

From the 18 available questions nine were randomly selected for use in the “development phase” of the study reported here. The purpose of the development phase was to develop’ text searching techniques that will produce better recall-precision results than do present methods. The other nine questions were reserved for the “test phase,” to test the techniques developed. The development phase questions are those numbered 9,12, 16,17,24,25,28,29,30 in Appendix B of ( 4 ) . There were a total of 37 answer-providing documents for these nine

446 Journal of the American Society for Information Science-November-December 1973

questions. For the test phase questions (10,13,14,15,19, 21,23,26,27) there were a total of 29 answer-providing documents.

Only man-machine text searching techniques were considered in the study, specifically those in which the human participation consists of constructing a search formulation, including selection of “clue words,” for input t o the computer (as opposed to inputting the natural language question as in fully automatic text searching). This has the advantage that a thesarus for computer use does not have to be constructed. It should be noted that human construction of a search formulation for a particular question amounts to developing those parts of a thesaurus for machine use needed for that question, including, perhaps, adapting them to the context of the question. All search formulations in the study were constructed by the author.

The investigation was concerned to develop methods for computer search of document texts-in response to an input search formulation-which output not whole documents (or references to them), but particular passages from documents. Retrieval of document passages has two advantages over retrieval of whole documents as units: 1) for a correctly retrieved document it presents the user directly with a document passage of interest to him; and 2) for a falsely retrieved document it permits quick rejection of the false retrieval since the particular passage or passages which led the computer to retrieve it are immediately presented for screening. It might be noted that it appears much more feasible to retrieve document passages by text searching than by retrieval which uses prior human indexing of documents; for the latter would require that an indexer assign each term not simply to a document but rather to particular passages in the document, which might greatly increase indexing costs. Passage retrieval has already been the subject of some experimentation (7,8) and is used in the operational text searching retrieval system LITE ( 1 ) . This study was intended to advance the state-of-the-art of passage retrieval.

The retrieval methods developed produce a ranked list of passages as a retrieval output. An answer-providing document can be said to be retrieved at the point in the ranked list at which the first answer-passage from that document appears. Any non-answer-passage from any document which is ranked higher can be said to be “falsely retrieved.” However, a non-answer-passage from an answer-providing document A is only counted as a false retrieval (with respect to any lower ranked answer-providing document B) if it appears above every answer-passage from document A; since once an output screener has identified document A as answer- providing he need examine no other passages from A for retrieval purposes.

Development Phase

ANSWER-SENTENCES AND MULTI-SENTENCE ANSWER- PASSAGES

A retrieval output needs to be screened by the user or a surrogate to distinguish the correct from the incorrect retrievals. When the output consists of document passages these should be the shortest possible, consistent with such screening, in order to save human screening time (and possibly computer time as well). For this reason techniques were developed which retrieve single sentences in most cases, only retrieving multi-sentence passages when the constituent sentences are computer- determined to be “connected” (by techniques described later in this article).

This development was helped by the fact that in 80% of the cases an answer-providing document contained at least one answer-passage consisting of a single sentence. About half of these sentences were answer- providing and the other half were answer-indicative. Such sentences will be called “answer-sentences.” Of the answer-indicative sentences, 70% either occurred in paragraphs which were answer-providing or else were titles or abstract sentences of papers containing many answer-providing passages. Thus an answer-indicative sentence would not only permit identifying a document as answer-providing, but would usually lead readily to an answer-providing passage in the document.

SEARCH FORMULATIONS

A search formulation for a retrieval question was humanly constructed, as noted earlier, using the procedures and aids described below.

If a question contained negative or redundant phrases these were removed. For example, the question: “how can actually pertinent data, as opposed to references or entire articles themselves, be retrieved automatically in response to information requests?” was transformed into : “How can data be retrieved automatically?”. For each specific content word in the question (e.g., “data,” “retrieved” and “automatically”) a list of “clue words” (words, stems and phrases) was selected. The search formulator also specified an “obligatory match” condition, stating for which clue word lists a sentence must contain matches in order to be retrieved. For example, for the automatic data retrieval question a sentence had to contain a “data” clue word and a “retrieval” clue word or else a “data” clue word and an “automatic” clue word.

Several aids to selecting clue words were used. These are described in this section.

Early in the development phase the author scanned an alphabetical frequency list of all the word types occurring in the document corpus and assigned each word to one or more of the following categories:

Journal of the American Society for Information Science-November-December 1973 447

information science computers

subjects applied in mathematics information science linguistics

time-terms, eg. age, annual value-term, eg. achieve, advan-

tase subjects to which information science chemiatry, medicine, etc. is applied

not otherwise mtegorbed, eg. abandon, above-mentioned ; moat proper names (but not, el., Boole)

To select clue words for a given question only words in the categories judged appropriate were examined. For example, for the question, “How can data be retrieved automatically?”, the information science word l i t was scanned for “data” and “retrieved” and the computer word list for “automatically.” However, for the question word, “diagnosis,” in, “What future is there for automatic medical diagnosis?”, no categories were appropriate. Rather than scan the four thousand entries in the (hot otherwise” categorized list, thesauri and glossaries were used to find clue words for this question word, in ways described a bit later. The same thing was true of the question words “drawing” and “structures” in the question, “What methods are there for encoding, automatically matching and/or automatically drawing structures extended in two dimensions, like the structural formulas for chemical compounds?”

For two answer-providing documents a clue word necessary for high-ranked retrieval was on a scanned list but had not been selected as a clue word. Both cases involved the stem “structure. . .” as a clue word for “data” in the question on automatic data retrieval, i.e., ‘‘Structures” such as chemical structures are a kind of data. Each of the two documents was answer-providing for the question only if the latter could be understood to accept answers prefaced by, “for the special case of data which are structures” (see 4, p. 314).

For four other answer-providing documents a clue word needed for high-ranked retrieval was not on the lists scanned. In three of these cases the clue word was a phrase, for example, “process text” for the question word “automatically,” and “information center” for the question word “retrieval.” A phrase could not appear in a categorized word list because those lists were selected from a lit of single words. In the remaining case, for the question, “What criteria have been developed or suggested for the evaluation of information retrieval and/or dissemination systems?,” the answer- sentence was :

There have always been in the information market the multitudes who demand an immediate direct answer to what is, at least superficially, a simple question.

And the needed clue word “demand” (for the question phrase, “evaluation criteria”) was not included in the search formulation. The “evaluation criteria”-“demand” relationship is that a demand for a certain kind of response rather than another implies use of a criterion for the value of systems which produce responses.

One other answer-providing document was low-ranked in retrieval because the search formulation lacked a neceawlry clue word. In this case, for the question about retrieval evaluation criteria, the only answer-sentence in the document was: “The basic problem of information retrieval is the representation of knowledge in a form which can be matched by subsequent representations of enquiries.” From the sentence can be inferred the retrieval system evaluation criterion that a retrieval system should represent “knowledge” (in storing it) so it can be matched by subsequent representation of retrieval requests. It does not appear that a clue word approach could satisfactorily retrieve this passage, or the document containing it.

Because selection of clue words by scanning categorized word Sits was not expected to work perfectly, some thesauri were also used to find clue words. Both information science thesauri and more general ones were employed; they are listed in Appendix A. However in no case was a needed clue word not obtained from the categorized word list found by means of a thesaurus. In addition, if thesaurus clue words alone had been used in search formulations, seven answer-pamages, and the corresponding documents which were high-rank retrieved by clue words selected from the categorized lists, would not have been retrieved. Some examples of necessary clue words not found by thesaurus are (question word followed by clue word in each case) ; weighting (of search terms)-levels, meaning (of index terms)- scope, retrieval-questions. On the other hand, thesauri led to necessary clue words for “diagnosis,” “drawing” and “structures,” for which there were no appropriate categorized word liits to scan. Thus the thesauri were a useful, though incomplete, supplement to the categorized word lists, but could not replace them without significant loss.

A number of information science glossaries and several general dictionaries were also used as a source of clue words; they are listed in Appendix B. A question word for which clue terms were to be found was looked up in the glossary or dictionary. Each word or phrase beginning with that question word (or its right truncated stem) was found and its definition(s) scanned for clue words. For example, the needed clue word “pattern” for the question word “structure” was found from the glossary entry for “structural information,’’ “Specifying the number of independently variable features or degrees of freedom of a pattern.” (Appendix B, Casey). As another example, the necessary clue word “analy . . .” for the question word “diagnosis” was obtained from the following dictionary definition of “diagnosis”: “2a. an analysis of the nature of something” (Appendix B,


American Heritage Dictionary). As these examples suggest, the glossaries and dictionaries were a satisfactory supplement to the categorized word lists for those question words for which there were no appropriate categorized word lists to scan. However, they did not otherwise lead to any needed clue words not found by means of the categorized word lists. In addition, if the clue words found by means of the glossaries and dictionaries had been the only ones used in search formulations, 12 answer-providing documents would not have been retrieved that were retrieved by means of the categorized word list clue words. These 12 documents are the seven that would be missed by the thesaurus clue words alone and five others. (Examples of needed questions word-clue word matches involved in the latter are: (retrieval) query-request, data-content, and medical-electrocardiograms. Thus the glossaries and dictionaries were, like the thesauri, a useful though incomplete supplement to the categorized word lists but could not replace them without significant loss.

After a clue word was selected, its contexts in a concordance to the document corpus were examined. Specifically, the word immediately to the left of the clue word and the word to its immediate right were examined. This is essentially equivalent to the use of a “compact KWIC,” described by Choueka et aZ., as an aid to constructing search formulations for text searching retrieval (8). This procedure led to discovery of contexts in which a normally good clue word was not a good clue word. For example, “program” was a good clue word for “automatic,” but not in such contexts as “graduate program,” “educational program” or “televi- sion program.” As another example, “fact’’ was a good clue word for “data” in the question about automatic data retrieval but not in the context, ‘‘in fact.” When a poor context, such as “graduate program,” was found for a good clue word such as “program,” to the clue word “program” in the search formulation was added the restriction, “not ‘graduate program.’ ” This procedure reduced false retrievals by about 35%, at no cost in correct retrieval.

Some question words occurred in more than one question (neglecting inflectional variations), for example, “retrieve” and “automatic.” In each of these cases an independent clue word list was formed each time the word occurred in a question. However, if the clue word lists developed for first occurrences of question words had been used for subsequent occurrences of those question words in later questions, instead of independently developed clue word lists, then 70% of the documents high-rank retrieved by the independently developed clue word lists would have been high-rank retrieved. As an example, if the clue word list for “retrieved” in, “HOW can data be retrieved automatically?” had been used for “retrieval” in the question about retrieval system evaluation, the necessary question word-clue word match, “retrieva1”-“request,” would have been missed. However if the question words

“query,” “search” and “retrieve” were treated as equivalent for purposes of counting recurring question words, and similarly for “automatic” and “computer,” then the 70% figure above would be changed to 85%. As an illustration, referring to the “retrieval”-“request” example mentioned above, “request” had been a clue word for the question word “search” in a question about weighting search terms. Using clue word lists from earlier questions in this way did not increase false retrieval.

About 25% of the false retrievals were caused by ambiguous clue words. For example, for the question, “What future is there for automatic medical diagnosis?,” a false retrieval was the sentence, “The Department of Pathology and the Department of Physics of the Uni- versity of Texas M.D. Anderson Hospital and Tumor Institute desired literature retrieval programs in the fields of radiobiology, radiation physics, . . .” The only clue word for “automatic” in this sentence is “programs” which in this sentence does not mean computer programs. As another example, for the question, “How can data be retrieved automatically?,” a false retrieval was the sentence, “These citations are retrieved to answer search questions. . . .” The only clue word for “data” in this sentence was “answer”. However in this sentence “answer” means simply “respond to,” not “answer” in the sense of a question-answering system as opposed to a document retrieval system. Most of the false retrievals that would otherwise have been caused by ambiguous clue words were prevented by the concordance scan for bad contexts of clue words which was described earlier.

MEASURING CLUE WORD RELATIONS BY COUNTING SYNTACTIC JOINTS

In this section and the next only retrieval of answer- sentences will be considered. Multi-sentence answer- passages will be discussed later.

For any question the search formulation input to the computer is a set of clue word lists and an obligatory match specification. The simplest requirement for a sentence to be retrieved is that it contain clue words satisfying the search formulation. However, this produced too many false retrievals from the document corpus, as expected. Various word-proximity measures of relation among clue words were tried, but none worked well. That a syntactic measure of relation would work better was suggested by such examples as the following, for the question, “What procedures have been used or suggested for weighting (W) terms (t) in search (S) specifications?” (clue words and obligatory matches indicated by lower and upper-case letters respectively) :

(answer-sentence) This permits searching (S) the collection under either or both levels (W) of any descriptor (t) . (non-answer-sentence) When you have questions


(S) involving many descriptors ( t ) , or when the descriptors (t) have many abstract numbers (W) postedunderthem, . . .

words at the beginning and end of the sentence; in that cme the whole sentence is the sequence. 2.2. The total number of clue words in the sequence.

Syntactic analysis of natural language by computer is usually expensive. However the following syntactic measure of relation should be relatively modest in computer cost: measure closeness of relation between successive clue words in a sentence by the number of intervening “syntactic joints,” where syntactic joints are prepositions, conjunctions and punctuation marks.

Some prepositions and conjunctions were omitted from the syntactic joint l i i because of their frequent use with other meanings, e.g., We’’ because of its frequent use as a verb, “that” because of its frequent use as a determiner. “And,” “orJJ and “either” were omitted from the conjunction list because of the frequent occurrence of such constructions as “scientists and engineers,” “tape or print.” A phrasal preposition, e.g., “in addition to,” counted as one syntactic joint. Colons were not included in the punctuation marks counted as syntactic joints because a colon used to introduce a list functions more as a “breath pause” than as a separation of con- cepts. The complete list of syntactic joints is given in Appendix C.

Several refinements of the syntactic joint measure were used.

1. Material within parentheses, and the parentheses, may be (but need not be) skipped in determining syntactic joint distance between clue words. 2 . One pair of commas or dashes and the material between them may be (but need not be) skipped in determining syntactic joint distance between clue words. This was to make some allowance for nested expressions set off by commas. 3. A conjunction in the first seven words of a sentence, and any immediately adjacent commas, were not counted as syntactic joints, e.g., in “The search terms, however, . . .” 4. A concatenation of syntactic joints, e.g., a comma immediately followed by a conjunction, counted as only one syntactic joint.

Using the syntactic joint measure of relation between clue words, the following procedure was developed for assigning a score to a sentence which contains clue words satisfying the search formulation:

1. Find in the sentence the longest sequence (if any) of clue words which both satisfies the search formulation and has at most one syntactic joint between each successive pair of clue words. Clue words in any part of the sentence skipped in accordance with preceding conditions 1 or 2 may not be included in the sequence. 2 . On the basis of properties of this sequence assign to the sentence a score with the following components:

2.1. -4 “sentence bonus” S if the sequence covers all of the sentence except, nt most, B total of four

23 . The ratio of number of clue words to n-umber of all words in the sequence. (In counting for (2.2) and (2.3), the clue words and other words in any part of the sentence which where skipped in accordance with preceding conditions 1 or 2 are counted.)

The score for a sentence is that of its highest score sequence, which will be called “the score sequence” of the sentence. In representing the score for a sentence it is convenient to write not the ratio (2.3) but rather the denominator of that ratio. For example, the score for the answer-sentence: “This permits searching the collection under either or both levels of any descriptor.,” would be written S13,13. In addition, if there is no sentence bonus score, nothing is written to represent that fact.

One answer-providing document in the development, phase contained no answer-sentence which would be assigned a score by the procedure described in the preceding paragraph, but did contain an answer-sentence which satisfied the search formulation. For the question, ‘What future is there for automatic (A) medical (M) diagnosis (D)?,” with an obligatory match to M and (A or D), the answer-sentence was, “A Medical (M) and Health (M) Related Sciences Thesaurus has been compiled as an indexing guide and entered on computer (A) for updating and periodic print-outs.” The sentence is an answer-sentence on one interpretation of the question (see 4, p. 314). To allow for the possibility of successive needed clue words being separated by more than one syntactic joint, as in this example, the scoring procedure described in the preceding paragraph was extended in the following way. Let J,=one less than the number of syntactic joints between the ith and ( i+l)s t clue words in a sequence, and let J,=O if there are no syntactic joints between the two clue words. Let J , the sum of all J i s for a sequence, be included in the score of that sequence. A sequence of the kind described in the preceding paragraph has J = O , and this is indicated by simply not including a J value in the score written for that sequence. The score of the answer- sentence quoted above is J=1,3,17.

By means of the scoring procedure described, sentences can be ranked for retrieval output. Sentences with J=O precede those with J = l , which precede those with J = 2 , etc. Ties in J value are broken in favor of sentences with the sentence bonus S. Ties in J and S values are broken by number of clue words in the sequence. Ties in all the preceding values are broken by the ratio of number of clue words to number of all words in the sequence. As an example, consider the answer-sentence and the non-answer sentence quoted several paragraphs earlier. Since they have scores of


S,3,13 and J=1,4,12 respectively, the former is ranked much above the latter.

The scoring and ranking procedures described above can be extended to sentences which do not satisfy the search formulation but do contain some clue words, in the following way. Rank below all sentences which satisfy the search formulation, first all sentences which lack one obligatory match, then all sentences which lack two obligatory matches, etc. (if the search formulation specifies more than two obligatory matches). Sentences which lack the same number of obligatory matches are ranked by the scores of those sequences in them which come the closest to satisfying the obligatory matches. Answer-sentences which do not satisfy search formulations, such as the seven described in the previous section which lacked some necessary clue words, can be included in the ranked passage output by this procedure. In particular, by this ranking procedure those sentences were each outranked by from 300-900 falsely retrieved sentences. The score for such a sentence will be written, e.g., not-R,3,9, meaning that the sentence lacks the obligatory match to an R term, and the best scoring sequence which otherwise satisfies the search formulation has a score of 3,9.

It should be noted that the syntactic joint measure and scoring procedure described in the preceding paragraphs have not been programmed, because they evolved throughout most of the investigation. Their performance was determined by manually simulated computer processing of the sentences satisfying search formulations. The latter were found and their clue words marked by computer.

Only 25% of the false retrievals were caused by nm- biguous clue words, as noted in the previous section. The other 75% of false retrievals were caused by incorrect relations of clue words within a sentence. An example, for the question, “What procedures have been used or suggested for weighting (W) terms ( t ) in search (S) specifications?,” is the sentence, “There are applica- tions, however, where retrieval (S) systems using a modest number (W) of subject terms (t) will prove valuable.” In such a case the clue words are incorrectly related within the sentence in the sense that an answer to the question, or a statement that the document is nnswer-providing, cannot be inferred from the sentence. The variety of relations among clue words within a sentence which do permit such inferences is illustrated by the answer-sentences quoted earlier, by examples in (3, Appendix A) and by the following pairs of questions and answer-sentences :

What future is there for automatic medical diagnosis ?-In recent years the trend in modern medical information handling projects has been toward rapidly expanding experimental uses of computers as follows: . . . . 4. in “pilot” or “demonstration” studies pointing toward eventual actual computer assisted diagnosis. How can data be retrieved automatically ?-The

object of Project LOGOS is to construct English-like languages exemplifying these devices but supplied with an algorithm which will translate them back into ordinary logical symbolism for use in machine deduction or other forms of content retrieval. What criteria have been developed or suggested for the evaluation of information retrieval and/or dissemination system?- (document sentence). How may the effectiveness of an information retrieval system be measured?

Further prevention of false retrieval caused by incorrect relations of clue words requires analysis of the variety of kinds of correct clue word relationships. Such a study might better be pursued with questions and documents from a less “soft” science than information sci- cnce, and perhaps with questions of a more restricted kind. The author is currently conducting such an investigation as part of a study of text searching retrieval of biomedical answer-passages in response to questions about cardiovascular effects of drugs.

It might be noted that more refined syntactic analysis is not the answer to the problem discussed in the preceding paragraph. For example, for the question, “What procedures have been used or suggested for weighting (W) terms (t) in search (S) specifications?,” consider the falsely retrieved sentence, “To allow for this devia- tion, our dissemination program is geared to P factors based on the number (W) of terms ( t ) in the profile (S) . . ,”. The clue words in this sentence are very closely related syntactically, but they are incorrectly related.

On the other hand, some further refinement of syntactic analysis can help in some cases. For example, for the question, ‘What future is there for automatic medical diagnosis? ,” consider the answer-sentence, “A Medical (M) and Health (M) Related Sciences Thesaurus has been compiled as an indexing guide and entered on computer (A) for updating and periodic print-outs.” The syntactic joint measure assigns a J = 1 rather than J = O score to this sentence, failing to recognize the syntactic closeness of “has been” and “entered” because of the prepositional phrase, “as an indexing guide,” which is dependent on the other com- ponent of the compound verb. Such cases and some analogous ones might be satisfactorily handled by adding to the syntactic joint measure the further refinement that a verb cancels the closest preceding syntactic joint if the latter is a preposition. This condition could also help in cases where the subject and the object of a sentence are incorrectly separated by the syntactic joint measure because of prepositional phrases dependent on the subject. How much additional false retrieval would result from this condition would depend partially on how few false identifications of verbs were produced by the verb identification program used. There was not time to investigate the matter in this study.

Journal of the American Society for Information ScienceNovember-December 1973 451

WORD-PROXIMITY APPROXIMATION TO THE SYNTACTIC .1OINT-MEABURE

Answer-sentences and non-answer-sentences satisfying search formulations were further studied to try to develop an adequate word-proximity measure of clue word relation, since such a measure might cost still less in computer processing than the syntactic joint procedure.

I t was found that, almost always, a successive pair of clue words in the score sequence for a sentence which were separated by, at most, one syntactic joint were also separated by, at most, seven intervening words. The only exceptions were one pair of clue words in each of three different answer-sentences separated by eight, nine and thirteen words respectively. The thirteen word case was the only answer-sentence with a syntactic joint value J greater than zero (J=1) which as quoted in the previous section (beginning: “A Medical and Health Related Science Theraurus . . .”) The nine word case was, for the weighted search term question, I ‘ . . . the requestor (S) may specify that certain of his queries (S) will produce references which are either more or less likely (W) to be relevant to his interests.” The eight word case was the sentence quoted earlier which involved the clue word “demand” (section before last, sixth paragraph), with the eight words lying between the R clue words “answer” and “question”.

These facts suggested a scoring procedure like that using the syntatic joint measure, except that the Ji score for the distance between the ith and ( i+l)s t clue words in a sequence is seven less than the number of words between the two clue words, and Ji=O if there are fewer than seven words between the two clue words. To avoid confusion, the J6 and J scores calculated in this way will be called W, and W (for word proximity) scores. The W score of the three answer-sentences referred to in the preceding paragraph are W = l , W=2 and W=6.

Use of this measure instead of the syntactic joint measure did not increase false retrieval on the average (see section on Performance Figures, ninth paragraph), but it did change, somewhat, which sentences were falsely retrieved. An example of a non-answer-sentence with a W=O score and a J > O score is the sentence quoted in the previous section, second paragraph. An example of a non-answer-sentence, with a J=O score and W>O score is the following, for the weighted search term question: “It is also anticipated that other depart- ments will wish to participate in this information retrieval (S) system and that their specialized terminology (e.g., chemotherapy) can be added (W) with a mini- mum (W) of effort.. .”

The score of one answer-sentence (actually the sentence in an answer-paragraph which gave the paragraph its score-see next section-) was significantly increased by use of the word-proximity measure instead of the syntactic joint measure. The question involved was, “What

programmer’s (C) techniques have been used or suggested for organizing and/or searching a word (W) dictionary (L) for text-processing by computer (C)?,” and the sentence was, “After the text (W) is sorted (C) in alphabetic order, a special program (C) phase compares the remaining words (W) with a special lit (L) and eliminates them from the text (W).” By syntactic joint measure the score of this sentence is 4,16, but by word-proximity measure it is S,6,26. This change reduced the number of falsely retrieved sentences ranked above the sentence from 34 to three.

b T R I E V I N Q MULTI-SENTENCE ANSWER-PASSAGES BY ENCLOSURE

Eight of the 37 development phase answer-providing documents contained no answer-senenees, but only multi- sentence answer-passages. In five of the eight cases a multi-sentence answer-passage was contained within a single paragraph. After examination of two of these passages, techniques were developed which retrieved them and two of the other three single-paragraph passages. Those retrieval techniques are described in this section and the next.

One answer-providing paragraph consisted of four sentences, the first and fourth of which satisfied the search formulation, with scores of 3,6 and 4,16. This suggested the following simple procedure for selecting a sequence of sentences within a paragraph. If at least half of the sentences in a sentence sequence within a paragraph satisfy the search formulation with J = O (or W=O) scores, including the first and last sentences of the sequence, and there is no larger sentence sequence in the same paragraph which includes that sequence and has the same property, then retrieve that sequence

. and assign it the score of its highest score sentence, e.g., 4.16. This procedure will be referred to as retrieving a multi-sentence passage by “enclosure.” The sentences in a sequence selected by enclosure which satisfy the search formulation will be called “enclosing” sentences.

Enclosure retrieved two multi-sentence answer-passages and part of a third (the rest of which was retrieved by a technique described in the next section). Enclosure also increased false retrieval by 5%, because in a few cases it changed false retrieval of a single sentence into false retrieval of a passage containing several sentences.

RETRIEVING MULTI~ENTENCE ANSWER-PASSAQES BY CONNECTOR WORDS

There are certain kinds of words whose occurrence early in a sentence is good evidence of close connection with a preceding sentence or sentences. Some examples me “these,” “instance,” “additional,” “accordingly,” “however,” “at,” “on the other hand.” As these examples indicate, such words may be pronouns, nouns, adjectives, adverbs, conjunctions, prepositions or phrases.


Lists of such words and phrases were derived from several earlier discussions (9,10,11) and from examination of documents involved in this study. The list and further details are in Appendix D. Such words and phrases will be called “connector words” or “connectors.” It should be noted that in Appendix D connectors are grouped as “Prepositions,” “Conjunctions,” etc., only for convenience of reading. In using these words as connectors the computer only has to recognize their presence, not determine their syntactic roles.

Some connector words are good evidence of inter- sentence connection only if they occur a t the beginning of a sentence, e.g., prepositions, while others are good evidence even if they occur a bit later in the sentence, e.g., pronouns. Some crude estimates were made for various words and various classes of words of how early in the sentence a word must occur to be counted as a connector. These estimates were also helped by study of the references cited above. The estimates are given in Appendix D. As examples, a preposition must occur as the first word of a sentence and there must also be a comma, at most, seven words from the beginning of the sentence, ordinal number adjectives such as “second” must occur within the first four words of the sentence, and some words such as “further,” “consequently” and “however” must occur within the first seven words of the sentence.

A sequence of sentences within the same paragraph will be said to be “connected” if every sentence in the sequence (except perhaps the first) conitains a connector word sufficiently near its beginning (in the sense described above). A connected sequence of sentences can be given the score of its highest-score sentence and thereby assigned a place in the ranked retrieval output. However, use of this procedure would have increased the number of falsely retrieved sentences by 100% because many falsely retrieved sentences were linked to other sentences by connector words. Therefore the following further condition was specified for retrieving a connected sequence of sentences: each sentence in the sequence must contain at least two stem-distinct clue words (e.g., not “index” and “indexer”) separated by, at most, one syntactic joint. Such a sentence sequence will be said to be “connected with respect to the search formulation,” i.e., with respect to the clue words of the search formulation. With the addition of this requirement, the increase in number of falsely retrieved sentences caused by connector words was changed from 100% to 10%. Requiring a word proximity of, at most, seven intervening words between the clue words rather than a separation by, a t most, one syntactic joint had the same effect.

One multi-sentence answer-passage and part of another were retrieved by retrieving sentence sequences connected with respect to the search formulation. In the latter case the rest of the passage was retrieved by enclosure, as noted in the previous section. Although it was not required for the combined case just referred

to, the enclosure technique described in the previous section can be modified in the following way to allow for connected sentences. A sentence counts as an enclosing sentence if it satisfies the search formulation and has a J=O score or if it is connected with respect to the search formulation to an enclosing sentence. This modification did not increase the number of falsely retrieved sentences.

FURTHER ON SINGLE-PARAGRAPH ANSWER-PASSAGES

The single-paragraph answer-passage not retrieved by the techniques described in the preceding two sections was an answer-passage for the question, “What criteria have been developed or suggested for the evaluation of information retrieval and/or dissemination systems?” Within that passage, the sentence, “Several criteria are utilized to evaluate these systems,” would be an answer- sentence if “these systems” were replaced by “these dissemination systems,” which is what the former expression meant in this context. Moreover, there was a computer-determinable relation between “these systems” and the expression, “dissemination of information systems, (SDI-1 and 2),” which occurred eight sentences earlier in the paragraph. Specifically, the only other occurrences of “system” or “systems” between these two expressions were each preceded by “the” or “these.” The computer-determinable relation, expressed in general terms, is the following: if an occurrence E, of an expression is immediately preceded by “the,” “this,” “these,” “that” or ‘ithose,” then the nearest earlier occurrence (if any) E, of that expression (discounting plural differences) which is not immediately preceded by one of these words is to be regarded as being part of an antecedent expression to which E, refers. Use of this idea for passage retrieval requires specifying which part of the context of E, should be “read into” E,, and this need some investigation. For instance, in the “systems” example above, “reading in” the context of E, as far to the left and right as prepositions or punctuation marks would (because of the “of” to the left and the comma to the right of the word “information”) fail to ‘lread into” the context of E, the needed clue word “dissemination” or “SDI”. There was not time during this study to investigate the point thoroughly. It is being considered further in the author’s current study of text searching retrieval of biomedical answer-passages.

The answer-passage retrieved by the connected sentences technique warrants some further discussion. It was, for the retrieval evaluation question, the following:

Despite Hensley’s ( 1 ) warning of implementation dficulties, many organizations have joined in a stampede to develop mechanized Selective Dis- semination (R) of Information (SDI(R)) or In- formation Retrieval (IR(R)) systems. If one as- sumes that (based on existing techniques and practices) all of these fledgling systems could be designed to produce results of a uniformly high quality (E), it follows that speed and cost (E) be-


come the significant (E) measures (E) of system effectiveness (E).

The words “systems” and “system” in the second sentence of this passage mean dissemination and retrieval systems, and that sentence would be an answer-sentence and would satisfy the search formulation (obligatory matches for E and R) if either of those meanings could be made explicit. The procedure described in the preceding paragraph of this section would be applicable to “these fledgling systems,” if “fledgling” could some- how be ignored. This might be satisfactorily accom- plished by means of a list of “non-subject-matter” adjectives but the idea needs study. It is not clear how to make “system” in “system effectiveness” more explicit. I t should be noted that if “fledgling systems” could be replaced by “dissemination and retrieval systems,” the sentence containing that expression would be an answer- sentence, but obligatory clue words in it would not be close, being separated by two syntactic joints and eleven words. On the other hand, if “system effectiveness” could be replaced by “dissemination and retrieval system effectiveness” the sentence would have values of J = O , w=o.

As it stands, the second sentence of the quoted passage has a score of not-R,5,14 and is connected to the first sentence (which contains two clue words separated by one syntactic joint, “Retrieval” and “IR”) by the sentence connectors “one” and “these.” Thus this two- sentence answer-passage is retrieved as a unit. However because its score (not-RJ5,14) is low, it is outranked in the passage output for the question by about 400 falsely retrieved sentences. It should be noted that sequences of two sentences which jointly satisfy a search formulation were studied early in the investigation, but the subject was not pursued because of the problem of a reasonable inter-sentence measure of clue word relation, and because retrieving such two-sentence sequences seemed rarely to be helpful. However this approach should not be permanently exclude from consideration.

MULTI-PARAGRAPH ANSWER-PASSAGES

There were three multi-paragraph answer-passages. No procedure for retrieving any of them was found during the study.

In one case, a single sentence would be an answer- sentence if the word “terms” in it were “index terms,” which is what it meant in that context. However that sentence would have to be accompanied by some part of the preceding five paragraphs of the document to make this clear.

For the question, “How can data be retrieved nuto- matically?,” the first sentence of a multi-paragraph answer-passage described storage of data on punched cards. Two paragraphs later occurred several sentences of which each described data being retrieved. It would be clear to a human reader that the data recorded on

punched cards were the data being retrieved, therefore presumably retrieved automatically.

In another answer-providing document for the same question, the title and some later paragraphs of the document described automatic retrieval procedures. A sentence in the second paragraph of the paper indicated that “entities” whose retrieval was being considered included sentences, which can be regarded aa “data” in the sense of the question. It would be clear to a human reader that the retrieval of these entities was to be ac- complished by the automatic retrieval procedures referred to in the title and described later in the document.

USE OF CITED TITLES

The answer-passage described in the first paragraph of the section before last could be retrieved by an ad hoc procedure. The sentence, which would be an answer- sentence if its expression, “these systems” were explicitly “these dissemination systems,” cited a reference whose title began “Selective Dissemination. . . .” An ad hoc way of relating the title of the cited reference to the citing sentences is to permit the former to be adjoined at the beginning or the end of the latter (in this case, at the beginning), a t a cost of one syntactic joint (or perhaps three words, if word-proximity is used as a measure of relation) and then to treat the combination as a single sentence. This procedure assigned a score of 4, 10 to the sentence-reference combination. It caused no additional false retrieval in the development phase, perhaps because the documents in the corpus have few references (an average of 3.5 per paper). The procedure will be called “reference adjacency.”

In one answer-providing document a cited title had a much higher score than the answer-sentence citing it. This suggested that each title in a cited reference be given a score, that the score be attributed to any citing sentence (unleas the sentence has a higher score by itself or by reference adjacency), and that the retrieval output consist of the citing sentence plus the cited reference. This procedure will be called “reference-scoring.” It caused no additional false retrieval in the development phase.

PERFORMANCE FIGURES

The development phase results suggest the following man-machine retrieval procedures.

1. Humanly construct a search formulation with the aid of categorized word lists, and either thesauri or glossaries and dictionaries in cases where no categorized word lists are appropriate.

2. Automatically score sentences, in relation to the eearch formulation, using either the syntactic joint or word-proximity measure of clue word relation.

3. Automatically score rnulti-sentence passages by means of enclosure and connected sentence techniques.

4. Automatically score sentence-reference combina-

454 .Tournal of the American Society for Information Science-November-December 1973

tions by reference adjacency and reference scoring. 5. Automatically output passages in descending order

of score. Performance figures for these retrieval procedures in

the development phase will be given in this section. The options involved in the procedures are to use either thesauri or glossaries as supplements to the categorized word lists, and either syntactic joints or word-proximity. The results for the thesaurus and syntactic joint combination will be given first. It should be noted that all results suppose that the search formulations lack the clue words needed for high-rank retrieval of seven documents which they are described as lacking in section on Search Formulations, fifth-seventh paragraphs.

Before presenting the performance figures, the performance measure used will be explained. Suppose there are three answer-providing documents for a question and in the output from a retrieval procedure the highest ranked answer-passage from one document is outranked by one falsely retrieved sentence, the highest-ranked answer-passage from the second by nine more falsely retrieved sentences, and the highest-ranked answer-passage from the third by 23 additional falsely retrieved sentences. Then in screening cutput, a recall of 33% is achieved at a cost of screening one falsely retrieved sentence per answer-providing document retrieved. A recall of 67% is achieved at a cost of screening five falsely retrieved sentences per answer-providing document re-

trieved (?). And a recall of 100% costs 11 falsely retrieved sentences per answer-providing document re-

trieved (-). By interpolation, the costs for the ten recall levels of lo%, 20%, . . . 100% are (rounded to integers) 1,1,1,2,3,4,5,7,9,11 falsely retrieved sentences per answer-providing document retrieved. If screening costs at the ten recall levels have been calculated in this way for each of a set of retrieval questions and a particular retrieval procedure, the screening costs for the different questions at each recall level can then be aver- aged, to give average screening costs at recall levels of lo%, 20%, . . . 100% for that set of questions and that retrieval procedure.

For seven development phase questions, excluding tem- porarily the two questions for which some answer-passages were not retrieved ( Multi-Paragraph Answer-Pas- sages), the average screening costs at recall levels of lo%, 20%, . . . 100% were 5,5,5,5,7,11,14,16,18,19 falsely retrieved sentences per answer-providing document retrieved.

The best recall-precision results so far reported for document retrieval by text searching appear to be those of SMART, including average precision values running from 90% to 40% for recall levels of 10% to 100% (12) . Therefore, to provide a comparison of the passage- retrieval results described above with document retrieval results, searches of the document corpus for the develop-

ment phase questions were run on the SMART system.” That SMART retrieval option was used which in SMART experiments on this corpus had produced the best recall- precision results, i.e. a t recall levels of lo%, 20%, , . , l00%, precision values of 90%, SO%, 70%, 70%, SO%, 50%, SO%, 40%, 40%, 40%, 30%. That option involves some human pre-search modification of the question (e.g., removing negative phrases) followed by SMART automatic searching of full documents using a thesaurus (13) . For the seven development phase questions discussed above the SMART recall-precision results were the following: for recall levels of lo%, 20%, . . . lO%, precision values of 50%, 50%, 50%, 50%, 40%, 40%, 40%, 40%, 30%, 30%. The SMART performance can also be described in another way which makes it more directly comparable to the passage retrieval performance described above. For recall levels of lo%, 20%, . . . l00%, the average screening costs for SMART retrieval were 3,3,3,4,5,5,5,5,5,5 falsely retrieved documents per answer-providing document retrieved. Comparing these latter results with those given for passage retrieval in the preceding paragraph it can be seen that, depending on recall level, passage retrieval requires screening one to four falsely retrieved sentences for every falsely retrieved document that would have to be screened if document retrieval were used. Thus, for the seven questions under discussion, passage retrieval is superior in terms of screening costs at every recall level.

However it should be noted that for one question, the question about retrieval evaluation, at recall levels of 60% to 100% the screening cost per answer-providing document retrieved ranged from 40 to 80 falsely retrieved sentences, while for SMART document retrieval the cost ranged from two to four falsely retrieved documents. If the average retrieved document (containing about 70 sentences) can be screened for being correctly or falsely retrieved by examining fewer than 20 sentences, then for this question document retrieval performed better than passage retrieval. Most of the inadequacies of passage retrieval for this question were caused by clue word problems (see section on Search Formulation, sixth- seventh paragraphs). In a less “soft” science than information science clue words might be more tractable.

The two development phase questions not included in the results given above only permitted retrieval up to 70% recall, because they involved the unretrieved multi- paragraph answer-passages (see section on Multi-Para- graph Answer Passages). For these two questions, at recall levels of lo%, 20%, . . . 70%, passage retrieval costs were 8,9,10,10,27,51,47 falsely retrieved sentences per answer-providing document retrieved. The costs of SMART document retrieval for these two questions at recall levels of lo%, 20%, . . . 100% were 4,3,4,6,5,5,5, 5,5,6 falsely retrieved documents per answer-providing document retrieved. Thus at recall levels of 10%-40%,

The author wishes to thank Gerard Salton, Robert Crawford, and Barbara Galaska of the Department of Computer Science, Cornell Uni- versity. for their cooperation in providing these SMART search results.


passage retrieval clearly costs leas in false retrieval, since it requires screening two to three falsely retrieved sentences for every falsely retrieved document that would have to be screened if document retrieval were used. At 50%-70% recall, passage retrieval is leas costly in false retrieval if it costs less to screen five to ten falsely retrieved sentences than one falsely retrieved document. (The decline in passage retrieval “precision” performance from 10%-40% recall to 50%-70% recall waa caused by clue word problems.) At recall levels of 80%-100% the SMART document retrieval was superior, of course, for these two questions, since it retrieved and the passage retrieval did not, because techniques for multi-paragraph answer-passages were not developed in the present study.

For completeness it should be noted that the passage retrieval for all nine development phase questions at recall levels of lo%, 20%, . . . 70% retrieved at costs of 6,6,6,6,11,20,22 falsely retrieved sentences per answer- providing document retrieved, while SMART retrieved at costs of 3,3,3,4,5,5,5 falsely retrieved documents per answer-providing document retrieved. Thus for these questions and recall values, passage retrieval requires screening two to five falsely retrieved sentences for every falsely retrieved document that would have to be screened if document retrieval were used.

If word-proximity instead of syntactic joints had been used as a measure of clue word relation, the figures given above would be changed in the following ways. For the seven questions permitting 100% recall, the average screening costs at recall levels of 19%, 20%, . . . 100% were 1,1,1,1,3,8,12,15,18,22 falsely retrieved sentences per answer-providing document retrieved. For all nine questions the average screening costs at recall levels of 19%, 20%, . . . 70% were 2,3,3,3,9,19,21 falsely retrieved sentences per answer-providing document retrieved. Both these sets of figures are much better than those for syntactic joints at recall levels of 10%-40%. The reason is that the one answer-providing document for which the word-proximity score was much higher than the syntactic joint score (see section on Word-Proximity Approxima- tion to the Syntactic-Joint Measure, last paragraph), was the only answer-providing document for that question, so that the results for that question were much improved at all recall levels, while the answer-providing documents ranked lower for several other questions by word-proximity affected the results only at higher recall levels. However, even, at higher recall levels the average screening costs for the word-proximity measure were better than, or approximated, the costs for the syntactic joint measure.

If glossaries and dictionaries were used, instead of thesauri, to search for clue words in cases where no categorized word lists were appropriate, the results given above would not be changed.

Test Pham

PERFORMANCE FIGURES

The retrieval procedures summarily described in the previous section were applied to the test phase questions (except that it was unnecessary to supplement the categorized word liits with use of thesauri or glosssries for any question). The results are described below.

When the syn!actic joint measure of clue word relation was used, for the nine development phase questions the average screening costa at recall levels of lo%, 20%, . . . 100% were 10,9J9,10,12,13,16,23,-,- falsely retrieved sentences per answer-providing document retrieved. The -’s for 90% and 100% recall were caused by an unretrieved multi-paragraph answer-passage for one question, which limited recall for that question to SO%, and an unretrieved answer-sentence (no clue words in the search formulation) for another question, which limited recall for that question to 80%. The screening costs for eight questions and for seven questions were respectively ll,lO, 10,11,13,15,18,25,35,- and 10,10,10,11,12,14,19,27,39,49 falsely retrieved sentences per answer-providing document retrieved.

When the word-proximity measure of clue word relation was used instead of the syntactic-joint measure, the results were approximately the same, with word-proximity screening costa about 10% less at recalls of 80%- l00%, and about 10% more at lower recall levels. One answer-sentence with a much higher score by word- proximity than by syntactic joints caused the first of these effects, and a general tendency of word-proximity to permit a bit more false retrieval than syntactic joints caused by the second.

SMART searches were also made for the test phase questions to provide a comparison between document retrieval and passage retrieval. For all nine test phase questions, for recall levels of lo%, 20%, . . . loo%, the average screening costs for SMART retrieval were 5,5,4, 4,5,6,7,7,7,7 falsely retrieved documents per answer-providing document retrieved. Thus, for recall levels of lO%-SO%, passage retrieval required screening two to three falsely retrieved sentences for every falsely retrieved document that would have to be screened if document retrieval were used. Consequently, a t those recall levels passage retrieval was superior to document retrieval in screening costs. However, a t recall levels of 90% and loo%, the passage retrieval techniques so far developed did not retrieve for one and for two questions respectively.

For the seven questions for which the passage retrieval techniques retrieved at recall levels up to lOO%, SMART’S screening costs were 5,5,6,6,6,7,8,8,8,7. Thus for those questions passage retrieval required screening two to seven falsely retrieved sentences for every falsely retrieved document that would have to be screened if document retrieval were used. Finally, for the eight questions for which the passage retrieval procedures re-


trieved up to 90% recall, SMART’S screening costs were 5,5,5,6,6,7,8,8,8,7. For this question set and the two kinds of retrieval the ratio of falsely retrieved sentences to falsely retrieved documents ranged from two to three.

SPECIFIC RESULTS

As noted in section above, one answer-sentence was unretrieved because it contained no clue word in the search formulation. The question involved was, “If given a reference to an article published in Russian, how can I find whether this article has been translated and/or republished in English, and if so where so?” The answer- sentence was, “The first objective is met by three semimonthly abstract journals . . . and International Aero- space Abstracts. . . .” This sentence is an answer-sentence for the question if the latter can be understood as accepting an answer prefaced by, “for the special case of an article covered in International Aerospace Ab- stracts’’ (see 4, p. 314). The word “international” in this sentence is a possible clue word for the question word, “translated,” and was on one of the categorized word lists scanned, “Foreign languages and geographic re- gions,” but was not selected.

Another answer-passage was outranked by about 600 falsely retrieved sentences because the second of its two sentences lacked an obligatory clue word. The question involved was, “What possibilities are there for verbal communication between computers and humans, that is, communication via the spoken word?” The answer-passage was the following:

Another interesting possibility is audio interrogation and reply, involving telephone lines and tape re- corders. Of course, any combination of these modes is possible-a store might be interrogated orally and a reply obtained visually, etc.

If in this passage the ambiguous word “store” is understood to mean computer store, then the passage is answer-providing for the question (see 4, pp. 314-5). The search formulation specified as obligatory the presence of a clue word for “computer” but the only possible clue word in the second sentence of the passage is “store” which is not a very reasonable clue word for searching a document collection much concerned with storage (not necessarily in computer files) of information. Thus the score of the second sentence of the passage was not-C,1,1 (for “oral”). The first sentence of the passage is connected to the second by means of “Of” and “these” in the second sentence and the clue words “audio” and “telephone” in the first.

One other answer-sentence lacked an obligatory match because the clue word, “reword,” for the question word “negotiating” (of retrieval requests) was not in the search formulation. It was not in the search formulation because it was not on the categorized word list for “Information Science” which had been scanned. The score of the answer-sentence was not-N,5,13. However, since the total frequency of all N(“negotiating”) clue words was about

100, only 23 sentences outranked it by satisfying the search formulation and 11 more by otherwise having R higher score.

Some question words, such as “index,” “automatic,” “weighting,” had already occurred in development phase questions. If in each such case the union of the development phase clue word lists for that question word had been used in the test phase (treating “automatic” and “computer” as equivalent, and “search,” “query” and “retrieval” as equivalent), the test phase results described in the previous section would be unchanged. This suggests that a transition can be made, for recurring question words, from man-machine text searching to fully automatic text searching.

Concerning clue word relation measures, also as noted in the last section, one answer-sentence had a much higher score by word-proximity than by syntactic joints. The question involved was, “What procedures have been used or suggested for weighting (W) terms (t) in index (N) sets for documents?” The answer-sentence was the following:

The role (N) indicator, delta (or asterisk on the 407 listing), affixed to a keyword (N) indicates that that keyword (N) is not included for retrieval as a significant (W) keyword (N), hence should not be permuted, and is included only for cross-reference or eoordinative keyword (N) purposes.

The syntactic joint score of this sentence was 3,16, with the score sequence beginning a t “significant.” However, the word-proximity score for the sentence was 4,14, with the score sequence beginning at the first occurrence of “keyword” in the sentence. The syntactic joint score would be 5,28 if the interrupting prepositional phrase, “for retrieval” could be skipped, but it is not clear by what general procedure this could be done.

Another answer-sentence had a W=O score and a J=l score, although the total clue word frequencies involved were so low that this difference in score did not make for R significant difference in output rank. The answer- sentence was, for the Russian translation question quoted at the beginning of this section, the following:

Another group of four important Soviet (R) physics journals, also available in complete translation (T) , was deliberately excluded. . .

(This sentence is an answer-sentence only if the question will accept an answer beginning, “For the special case of an article in a Russian physics journal published in complete English translation” (see 4, p. 314).

Two other answer-sentences had J=1 scores and also had W= 1 and W = 3 scores.

One multi-paragraph answer passage was not retrieved, as noted in the previous section. The question involved was, “What techniques have been used or suggested for the representation in index sets for documents of relations between terms?” One paragraph of the passage described some techniques for representing relations be-


tween words, and another passage six paragraphs earlier indicated that the words were document index terms.

Four multi-sentence answer passages were retrieved, three by connected sentences and one by a combination of connected sentences and enclosure.

Discussion

The results described above are surprisingly good, considering the undisciplined language, style and inference patterns in information science. It is reasonable to hope that in less “soft” sciences and technologies the retrieval techniques described will work even better.

This permits predicting what a dissemination and retrieval system for such fields will be in the near future. Its files will consist of:

One or more externally supplied machine-readable bibliographic data bases which include abstracts, such as Engineering Index, Air Pollution Abstracts, Water Resources Abstracts, Excerpts Medica, INSPEC, etc.; Descriptions of current research and development, such as the Smithsonian Science Information Ex- change files (available in machine-readable form) ; Descriptions of “who knows what,” such as the Library of Congress Directories of Information Resources in the United States (available in machine-readable form).

retrieval request to the system will be in question form, for output of answer-passages from the files. Re- trieval requests in topic form can also be accepted (each specific content word in the topic request being repre- sented by a clue word list), for output of “topic passages” (not a welldefined concept).

Search formulations will be constructed off-line and/or on-line, by the user and/or by a surrogate. (First results in the author’s current study of text searching retrieval of biomedical answer-passages are that a surrogate with only modest biomedical knowledge can use a medical dictionary as both a source of information and a guide to terminology, and construct high-recall, good-precision search formulations.)

Retrieval in a broad sense consists of two phases: 1) request negotiation= obtaining a stated retrieval request which best represents the subject interest (“information need”) of the user; and 2) searching=finding records which satisfy the stated request (e.g., answer- passages for a stated request). The techniques described in this paper, and all text searching and index searching techniques, are “searching” procedures in this sense. Any negotiation technique can be attached to the front end of a searching technique. In particular, the retrieval system of the near future being described here may include as its front end whatever negotiation procedures are locally prcferred.

The output of the retrieval system will consist of sentences, and some multi-sentence answer-passages. Some of the latter may simply take the form of single sentences slightly expanded by the computer (“augmented sentences”). As an example from the author’s biomedical passage retrieval study, a sentence would be an answer- sentence if it specified that its results were for humans, and the computer can augment the sentence with the annotation “in humans” because it finds humans and no other species (e.g., rats, dogs, etc.) named in the title, summary or figures of the paper.)

Though abstracts are short compared to full papers, the output of a single snetence from an abstract rather than a five or six sentence abstract a unit will be a great improvement because it will directly present the user with information of interest to him, e.g., methodological specifics in the fifth sentence of an abstract, and will also permit the most efficient screening out of false retrievals. The advantages of this selective power wil l be even greater for retrieval of sentences from documents ten sentences and more long, such as the descriptions in the Smithsonian Science Information Exchange files and the Library of Congress Directories of Information Resources.

Retrieval output will be on-line and/or off-line. If it is on-line, the viewer wil l have the option of inspecting the context of any passage (by CRT or by coordinated microform). The source of a passage (bibliographic reference plus location by page and l i e numbers) might accompany it automatically or be available for viewing as an option. If output is off-line, each passage will be accompanied by its source and (depending on the eco- nomics of text outputting) some portion of its context. Inspection of the context is important when the answer- sentence is answer-indicative, and may be desired for other reasons (e.g., to see what methodology gave the results) when the answer-sentence is answer-providing .

References

1. Judge Advocate General Law Review, 14 (NO. I): (Winter). Special h e on LITE (Legal Information Thru Electronics) (1972).

2. LANCASTER, F., Evaluutwn of the Medlars Demand Search Service, National Library of Medicine, Be- thesda, Maryland (1968).

3. O’CONNOR, J., “Answer-Providing Documents : Some Inference Descriptions and Text-Searching Retrieval Results,” Journal of the American Society for Znfor- mation Science, 21 (No. 6 ) : 406-414 (1970).

4. - , “Some Independent Agreements and Resolved Disagreements About Answer-Providing Documents,” American Documentation, 20 (No. 4) : 311-319 (1969).

6. - , “Retrieval of Answer-Providing Documents, American Documentation,” 19 (No. 4): 381-386 ( 1968).

6. SALTON, G., Automatic Information Organization and Retrieval, New York, McGraw-Hill, Pp. 316-318. (1968).


7. SWANSON, D., “Research Procedures for Automatic In- dexing,” in Machine Indexing, Washington, D.C., American University, pp. 281-304 Pp. 29S299. (1962).

8. CHOUEM, Y. et al., “Full Text Document Retrieval: Hebrew Legal Texts,” Proceedings of the Symposium on Information Storage and Retrieval, New York Aeaociation for Computing Machinery, Pp. 61-79. (1971).

9. OLNEY, J., Some Patterns Observed in the Conteztual Specialization o f Word Senses, Report TM-1393, Santa Monica, California, System Development Cor- poration, (1963).

10. WATERHOUSE, V., “Independent and Dependent Sen- tences,” International Journal of American Lin- guktics, 29 (No. 1) : 45-54 (1963).

11. RUSH, J. et al., “Automatic Abstracting and Indexing. I1 Production of Indicative Abstracts by Application of Contextual Inference and Syntactic Coherence Criteria,” Journal of the American Society for Injor- matwn Science, 22 (No. 4) : -274 (1971).

12. SAU~ON, G. (Ed.), Information Storage and Retrieval, Report No. ISR-13, Dept. of Computer Science, Cor- nell University, Ithaca, New York. Appendix A, p. 17. (1967).

13. KEEN, E., “An Analysis of the Documentation Re- quests,” in Information Storage and Retrieval, Re- port No. ISR-13, Ed. by G. Salton, Dept. of Computer Science, Cornell University, Ithaca, New York. Sec- tion X. (1967).

Appendix A: Thesauri

1. KEEN, E., el al. Appendices L-N of Report of an Infor- mation Science Znder Languages Test. Dept. of In- formation Retrieval Studies, College of Librarianship Wales, Aberystwyth, 1972.

2. SCHULTZ, CLAIRE. Thesaurus of Information Science Terminology, revised edition, Communication Service Corporation, Washington, D.C., 1968.

3. AD1 Thesaurus used in SMART experiments. Copy pro- vided by G. Salton.

4. Thesaurus of ERIC Descriptors, second edition, Wash- ington, D.C., Office of Education, 1969.

5. Thesaurus of Engineering and Scientific Terms. Wash- ington, D.C., U. S. Dept. of Defense, 1967.

6. Roget’s International Thesaurus, New York, Crowell, 1962.

2. STOLK, H. Glossary of Documentation Terms, Advisory Group for Aerospace Research and Development, NATO, n.d.

3. EVANS, R. Tutorial Gbssaqj of Documentation Terms, American Documentation Institute, Washington, D.C., 1966.

4. KOROTKIN, A. et al. Indexing Aids, Procedures and Devices, General Electric, Glossary, pp. 66-74, 1965.

5. IBM. Indez Organization for Information Retrieval, IBM, White Plains, New York, Glossary, pp. 45-59, 1961.

6. THOMPSON, D. Glossary of STINFO Terminology, Office of Aerospace Research, Washington, D.C., 1964.

7. WAGNER, F., “A Dictionary of Documentation Terms,” AmekanDocumentation, 11 : 102-119 (1960).

8. TAUBE, M. and H. WOOSTER (Eds.). Information Storage and Retrieval, New York, Columbia University Press, Terminological Standards, pp. 1-16. 1958.

9. PERRY, J. and A. KENT (Eds.). Documentation and Zw formation Retrieval, New York, Reinhold, Glossary, pp. 136-160. 1957.

10. MACK, J. and R. TAYLOR. “A System of Documentation Terminology,” in Documentation in Action, ed. by J. Shera, New York, Reinhold, Pp. 15-28.1958.

11. American Heritage Dictionary of the English Lan- guage. New York, Houghton Mifin, 1969.

12. American College Dictionary, New York, Random House, 1962.

Appendix C: Syntactic Joints

PREPOSITIONS about, across, after, against,, along, among, around, as, at, before, behind, below, beneath, beside, besides, between, beyond, by, concerning, considering, despite, down, during, except, for, from, in, inside, into, of, off, on, onto, out, over, per (not per cent), regarding, since, through, throughout, to, toward, under, until, up, upon, via, with, within, without

PHRASAL PREPOSITIONS

as against, as between, as compared with, as distinct from, as distinguished from, as far as, as far back as, as for, as opposed to, as to, a t the cost of, a t the hands

Appendix B : Glossaries Dictionaries

of, a t the &stance of, a t the peril of, a t the point of, a t the risk of, because of, beyond the reach of, by dint of, by (the) help of, by means of, by order of, by reason of, by the aid of, by virtue of, by way of, face to face with, for fear of, for lack of, for the benefit of, for the ends of, for the purpose of, for the sake of, for want of, from among, from behind, from below. from beneath, from between, from beyond,

and

- .

1. CASEY, F. Compilation of Terms in Information Sci- ences Technology, Federal Council for Science and Technology, Washington, D.C., PB 193 346, 1970.

from ’in front of, from lack of, from off, from out (of), from over, from under, hand in hand with, in accordance with, in addition to, in advance of, in

Journal of the -4merican Society for Information Science-November-December 1973 459

agreement with, in vack of, in behalf of, in the interest of, in between, in case of, in common with, in company with, in comparison with (to), in compliance with, in conflict with, in conformity with, in conse- quence of, in consideration of, in contrast with (to), in course of, in default of, in disregard of, in (the) face of, in favor of, in front of, in fulfillment of, in lieu of, in obedience to, in opposition to, in order that, in place of, in point of, in preference to, in process of, in proportion to, in pursuance of, in pursuit (quest) of, in re (concerning, in recognition of, in reference to, in regard to, in relation to, in respect to (of) , in reply to, in return for, in spite of, in support of, in the case of, in the event of, in the matter of, in the middle (midst) of, in the name of, in the presence of, in the place of, in the way of, in token of, in view of, inside of, on account of, on behalf of, on pain of, on the face of, on the occasion of, on the part of, on the point of, on the pretense of, on the score of, on the side of, on the strength of, of (the) top of, out of, outside of, out of regard for, out of respect for, over against, side by side with, so far as, so far from, through lack of, to the order of, under cover of, under pain of, up against, with a view to, with an eye to, with reference to, with respect to, with regard to, with the exception of, with the intention of, with the object of, with the purpose of, with the view of, within reach of, without regard to

CONJUNCTIONS

after, although, as, because, before, but, for, however, if, neither, nor, since, so, than, though, unless, when whenever, where, whereas, whereby, wherever, whether, while

PUNCTUATION MARKS

semicolon, comma, open and close parentheses, braces, brackets, dashes

Appendix D: Connector Words

The number following a word indicates within how many words of the beginning of a sentence it must occur to be a connector. If this number is the same for all words in a group it is only given at the beginning of t,he group.

PREPOSITIONS

All 1. In addition, a comma must occur at most seven words from the beginning of the sentence. about, across, against, along, among, around, at, behind, below, beneath, beside, besides, between, beyond,

by, concerning, considering, despite, down, during, except, from, in, inside, into, of, off, on, onto, out, over, per (not per cent), regarding, through, throughout, to, toward, under, until, up, upon, via, with, within, without

CONJUNCTIONS All 1, except “however” which is 7. after, although, and, as, because, before, but, for, however, neither, nor, or, since, so, though, unless, when, where, whereas, whereby, whether, while

PRONOUNS both 7, each 7, either 7, every 4, he 7, her 7, herself 7, him 7, himself 7, it 7 (not “it is necessary”, “it is beside the point”, “it has been determined”, “it” followed by “that” with at most three intervening words), its 7, itself 7, one 4 (not one followed by a noun), other 4, others 4, our 7, she 7, that 1, then 7, themselves 4, these 7, they 7, this 7, those 7, us 7 (not “let us”), what 4, which 4, who 4, whose 4

ADVERBS accordingly 4, actually 7, also 7, anway 7, clearly, consequently 7, even 4 (not “break even”), finally 7, furthermore 7, hence 7, instead 7, moreover 7, never- theless 7, next 4, now 4, presently 4, respectively 4, similarly 7, still 7, this 7, therefore 7, thus 7, together 4, too 7, yet 4

ADJECTIVES All7 above, additional, another, final, former, further, future, latter, resulting, same, similar, such (not “such ad’, “such a” and “that” with at most three intervening words)

ORDINAL NUMBERS All 4 first, second, third, fourth, fifth, sixth, seventh, eighth, ninth, tenth, eleventh, . . . l., 2., 3., 4., 5., 6., 7., 8., 9., lo., ll., . . .

NOUNS All 7. This is a very incomplete list. case (not “special case”, “upper case”, “lower case”, “case history”, %ase report”) , example(s) , instance(s)

PHRASES All 7. This is a very incomplete list. aa a result, by comparison, by contrast, by the time, for example, for illustration, for that reason, if not, in addition, in contrast, in fact, in other words, in question, in that w e , in the first place, of course, on the other hand, to that end, under consideration

460 Journal of the American Society for Information Scienc-November-December 1973