12
Information Processing& Management Vol. 18, No. 4, pp. 167-178, 1982 0306~573/82/040167-12503.0010 Printed in Great Britain. Pergamon Press Ltd. THESAURUS-BASED AUTOMATIC BOOK INDEXING MARTIN DILLONt School of Library Science, Universityof North Carolina at Chapel Hill, Chapel Hill, NC 27514,U.S.A. (Receivedfor publication 16 February 1982) Abstract--This paper describes a technique for automatic book indexing. The technique requires a dictionary of terms that are to appear in the index, along with all text strings that count as instances of the term. It also requires that the text be in a form suitable for processing by a text formatter. A program searches the text for each occurrence of a term or its associated strings and creates an entry to the index when either is found. The results of the experimental application to a portion of a book text are presented, including measures of precision and recall, with precision giving the ratio of terms correctly assigned in the automatic process to the total assigned, and recall giving the ratio of correct terms automatic- ally assigned to the total number of term assignments according to a human standard. Results indicate that the technique can be applied successfully, especially for texts that employ a technical vocabulary and where there is a premium on indexing exhaustivity. 1. STATEMENT OF PROBLEM Despite their importance and the substantial effort and cost of their production, book indexes have not often been studied. Can one generalize about the kinds of topics one should include, the size of the indexing vocabulary, or their density per page? Should the index be hierarchical? What measures of effectiveness are appropriate? These questions are rarely asked, nor are there generally accepted answers to them. A comparison to document retrieval systems where there is extensive experience with patron queries is suggestive. The primary measures of evaluation used there are precision and recall. For a given patron query, precision refers to the ratio of retrieved relevant documents to the total retrieved, recall to the ratio of retrieved relevant documents to the total sought. A major theoretical difficulty in applying these measures experimentally is the concept of relevance. It is well known that individuals vary in their assessments of what documents ought to be retrieved for a given query. Pertinence, an alternate basis for evaluation, takes this variation into account by considering the individual patron's judgement. When evaluating a book index, a similar distinction would focus on what ought to be retrieved (indexed) by the same term in the index in response to different reader needs. For example, a reader seeking general information about a term or its definition in a text, might be satisfied by a single reference--to that portion of the text where the term is defined, assuming there is one. On the other hand, the need might be for specific information on how a term is applied in relation to a second term scattered throughout the text. A typical device in an index is to highlight the former type of occurrence in some way, by boldface page numbers, for example. In a document retrieval system, the second type of query would be translated into a boolean "AND" extracting documents that deal with both terms. The user of the book index is reduced to the expedient of scanning the text of each reference. Occasionally a book index will be designed to perform the "AND" function for selected terms, by joining one very frequent term with other less frequent terms, usually in a hierarchy. A biography, for example, usually does this with the entry of the person whose biography it is. Exhibit 1 presents brief extracts from two book indexes. One is from an advanced text in mathematics fl], the other is from a programming language reference manual[2]. They are not typical in that both are superior to the usual book index. The first employs underscoring to highlight sections of the text where the term is given significant coverage; both are arranged hierarchically. To what extent is it possible to automate the production of book indexes like these? Because of the increasing availability of automatic text processing systems, and particularly tAssociate Professor. 167

Thesaurus-based automatic book indexing

Embed Size (px)

Citation preview

Information Processing & Management Vol. 18, No. 4, pp. 167-178, 1982 0306~573/82/040167-12503.0010 Printed in Great Britain. Pergamon Press Ltd.

THESAURUS-BASED AUTOMATIC BOOK INDEXING

MARTIN DILLONt School of Library Science, University of North Carolina at Chapel Hill, Chapel Hill, NC 27514, U.S.A.

(Received for publication 16 February 1982)

Abstract--This paper describes a technique for automatic book indexing. The technique requires a dictionary of terms that are to appear in the index, along with all text strings that count as instances of the term. It also requires that the text be in a form suitable for processing by a text formatter. A program searches the text for each occurrence of a term or its associated strings and creates an entry to the index when either is found. The results of the experimental application to a portion of a book text are presented, including measures of precision and recall, with precision giving the ratio of terms correctly assigned in the automatic process to the total assigned, and recall giving the ratio of correct terms automatic- ally assigned to the total number of term assignments according to a human standard. Results indicate that the technique can be applied successfully, especially for texts that employ a technical vocabulary and where there is a premium on indexing exhaustivity.

1. STATEMENT OF PROBLEM Despite their importance and the substantial effort and cost of their production, book indexes have not often been studied. Can one generalize about the kinds of topics one should include, the size of the indexing vocabulary, or their density per page? Should the index be hierarchical? What measures of effectiveness are appropriate? These questions are rarely asked, nor are there generally accepted answers to them.

A comparison to document retrieval systems where there is extensive experience with patron queries is suggestive. The primary measures of evaluation used there are precision and recall. For a given patron query, precision refers to the ratio of retrieved relevant documents to the total retrieved, recall to the ratio of retrieved relevant documents to the total sought. A major theoretical difficulty in applying these measures experimentally is the concept of relevance. It is well known that individuals vary in their assessments of what documents ought to be retrieved for a given query. Pertinence, an alternate basis for evaluation, takes this variation into account by considering the individual patron's judgement.

When evaluating a book index, a similar distinction would focus on what ought to be retrieved (indexed) by the same term in the index in response to different reader needs. For example, a reader seeking general information about a term or its definition in a text, might be satisfied by a single reference--to that portion of the text where the term is defined, assuming there is one. On the other hand, the need might be for specific information on how a term is applied in relation to a second term scattered throughout the text. A typical device in an index is to highlight the former type of occurrence in some way, by boldface page numbers, for example. In a document retrieval system, the second type of query would be translated into a boolean "AND" extracting documents that deal with both terms. The user of the book index is reduced to the expedient of scanning the text of each reference. Occasionally a book index will be designed to perform the "AND" function for selected terms, by joining one very frequent term with other less frequent terms, usually in a hierarchy. A biography, for example, usually does this with the entry of the person whose biography it is.

Exhibit 1 presents brief extracts from two book indexes. One is from an advanced text in mathematics fl], the other is from a programming language reference manual[2]. They are not typical in that both are superior to the usual book index. The first employs underscoring to highlight sections of the text where the term is given significant coverage; both are arranged hierarchically.

To what extent is it possible to automate the production of book indexes like these? Because of the increasing availability of automatic text processing systems, and particularly

tAssociate Professor.

167

168 M. DILLON

bDVA~CSO ~lbTlt~.EaTI¢S TI~T

condition number 94-96,98ai13 conditional pcobability 24-25,359.60 conditioning I-~_1~96r-98,|00,

- ~ 1 2 4 . 1 ~ 7 . 1 5 6 . 1 6 8 . 224,373-75,377-78

evaluat ion 96-98 measure 9~ of matrix 91-94,96,373 See also condition numbec;no~-

nalized determinant consistency test |39#205 constant scope, method of 203,

211-16,225

PLI ! L~GUAGE REFERENCE MANUAL

DEFAULT precision binary fixed-point data 25,8~,q08 binary floating point-data 27,84,~08 decinal fixed-point data 25,8~,408 decimal floating-point data 26,84,~08

DEFAULT statement 55,420-~23.8q-87 and standard default attributes 85 attribute specification 85 conflicting attributes 85 in attribute processing 83 not applied to null parameter

d e s c r i p t o r s 87 DEFINED attribute 38,386-390

common errors in ove~lax defining 262 for variable in CHECK name-list 367

Exhibit 1. Sample book indexes.

those that format texts, it is a useful and timely question to pose. As more and more book texts become available online, it becomes more likely that automatic retrieval operations now used for document collections will be extended to book texts. If book indexing can be automated, this goal will be realised sooner.

In order to explore this question an exercise in the automatic indexing of a portion of a text was carried out. The text chosen for the exercise was a technical manual used as a reference for a bibliographic processing system [3]. The text was already machine readable and had no index. In the sections that follow, automatic indexing mechanisms will be discussed. Text processing facilities will be described as they relate to the indexing function. Finally, a technique for automatic book indexing will be presented, along with an experiment in its use.

2. APPROACHES TO AUTOMATIC INDEXING

There are three feasible methods for producing an automatic index for a book, with varying degrees of effectiveness and human involvement. The first, and virtually the sole approach in widespread use for automatic indexing generally, is the use of stemming algorithms to group words that are likely to refer to the same topic, combined with statistical criteria to exclude unwanted words. (See VAN RIJSBERGEN[4] for a general treatment.) A simple device, for example, is to include as indexing terms those words or stems that fall within a certain frequency range in the text, and exclude those which do not. This crude approach is fairly effective in document retrieval systems, especially when coupled with feedback devices. There is no reason to suppose that it could not be used on books when the retrieval process is interactive. Because individual words or stems can express relatively few of the concepts in text, it is very unlikely that it could be effective for producing a printed book index. See Exhibit 1 as a typical example where there are almost no terms formed from a single word.

A second method, which is far superior in principle to the use of stems, is to use a

Thesaurus-based automatic book indexing 169

dictionary of phrases derived automatically. There is some evidence to suggest that phrases form a better basis for any indexing environment[5], and if an effective technique were available, it would be far preferable to stems. In the near term, however, it is unlikely that a book index could be based on one because of the difficulty in grouping synonymous phrases, although the possibility should be investigated.

A third method, the one explored here, depends far more on human involvement than the first two. It is similar in principle to ARTANDI'S[6] work referred to in Bol~o[7]. Briefly, the technique requires that a dictionary be constructed of all terms that are to appear in the index, along with text strings that will count as instances of the term in the text. A program then searches the text for each occurrence of a term or its associated strings and creates an entry to the index when either is found. The advantages and disadvantages of this technique are considered in greater detail below. The major disadvantage is that the degree of human involvement in the indexing process is high. Its major advantage is that it removes all clerical operations from the indexing process, normalizes it to a large degree and makes explicit the relation between the vocabulary of the text and the index. It promises to be an effective intermediate between a fully automated technique and a purely manual one.

3. TEXT PROCESSING ENVIRONMENT A key to understanding the general approach adopted here is a desire to exploit the index

production facilities of the SCRIPT system [8]. SCRIPT is an example of a generally available text formatting system designed to ease the creation of texts in machine readable form. It is highly sophisticated and one of the best of its kind, containing many facilities such as special commands for table creation and the control and printing of headings of chapters, sections, or subsections of the text. Its primary benefit is formatting the pages and numbering them as they are printed. It also contains some programming features which allow users to create commands required for special operations not included in the official command set.

Among the more powerful and useful of the standard SCRIPT facilities is a mechanism for creating an index to a text. Only a minimal understanding of SCRIPT, both in general and in the means it uses to produce an index is required here. Interested readers should consult the SCRIPT reference manual for a fuller treatment. Exhibits 2, 3 and 4 are meant to convey the general idea.

Exhibit 2 is an example of formatted text. It was produced through SCRIPT from the text lines in Exhibit 3. Lines beginning with "." in Exhibit 3 are SCRIPT command lines; they

Proqran nethod

1. SKED cards are edited and placed into a table, An order by input report nuaber code.

2. Each n&RC record read is processed against each SKED card that has a latching input report number code {or a code that specifies all reports}.

3 . If the ~kRC record input report nunber code is zero, it is processed against all SKED cards.

~. Variable field inforlation is edited for unifornity:

- punctuation is translated or reloved;

- nultiple blanks are compressed;

- lower case letters are transforled to upper case.

Sort Xe I Edit [$KED) ~ds

[Portion of page 2. See Exhibit 8}

Exhibit 2. Sample page of formatted text.

170 M. DILLON

. U S Program method

.sp 2

I. SKED cards are edited and placed into a table, in order by input report number code. .sp 2. Each RAEC record read is pocessed against each SKED card that has a matching input report number code {or a code that specifies all reports). .sp 3. If the MARC record input report number code is zero, it is processed against all SKED cards. 4.

Variable field information ks edited for uniformity: .in 5 .sp - punctuation is translated or removed; .sp - multiple blanks are compressed; .sp - lower case letters are transformed to upper case. .in .sp 2 .us Sort Key Edit [SKED} cards

Exhibit 3. Sample of SCRIPT input lines.

contain commands that inform the processor how to format the text, or provide other information necessary to format the text, or provide other information necessary to format the text. For example, ".us" indicates that a line is to be underscored and ".in 5" is a command to indent the left margin 5 columns for the lines following until a subsequent ".in" is encountered.

Exhibit 4 duplicates the SCRIPT lines in 3 with indexing instructions inserted. The command ".IX" informs the processor that the information contained in the line is to be interpreted as an indexing term associated with the text line immediately above it. The action taken by the SCRIPT processor when it encounters ".IX" is to save the indexing term along with its location in the text. At the completion of the processing of the text, a set of commands informs the processor to print the index. A rich variety of possibilities is built into the system

.us Program method

.sp 2 |. SKED cards are edited and placed into a table, in order by input report number code. .IX Report number code--input. 2. Each MARC record read is pocessed against each SKED .IX MARC record. .IX Beport number code--input. code that specifies all reports). .sp 3. If the MARC record input report number code is zero, .IX MARC record. .IX Report number code--input. it is processed against all SKED cards. 4.

Variable field information is edited for uniformity: .IX Variable field data. .sp - punctuation is translated or removed; .sp - multiple blanks are compressed; .sp - lower case letters are transformed to upper case. .If Upper and lower case conventions. .in .sp 2 .us Sort Key Edit [SKED) cards .ZK Sortkey

Exhibit4. Inputlineswithindextermsinserted.

Thesaurus-based automatic book indexing 171

for processing terms and for formatting them in the index itself. By and large, they are not of interest here. The only significant fact at this point is that normal use of these facilities would require the insertion of the ".IX" lines while the text is being input, placing the burden entirely on the human indexer. The purpose of the automatic indexing approach described here is to cause these lines to be inserted automatically based on an algorithm that matches a human prepared thesaurus and the unindexed text.

3.1 Automatic book indexes Automatic book indexing can be described as occurring in three steps. The first step requires

an existing machine readable text in appropriate SCRIPT form and its major product is an indexing thesaurus. In this project, the indexing thesaurus was produced using an existing system designed to provide thesaurus support for both manual and machine indexing of literature for automatic retrieval systems (in fact, a part of the system documented in the text to be indexed [3]). It was applied in this context without modification. The system itself is highly complex but only its broad functions need be described in order for its use to be appreciated here.

The output of the thesaurus production step can be understood through Exhibit 5 which gives a selection of records from the thesaurus developed for this project. The record organization is fairly standard--"mtm" indicates a "main term", that is, one that will appear in the index; "usr ' indicates a used for term, one that will result in a cross reference in the index to the main term; "scp" indicates a scope note for the term. The one non-standard field is labelled "npu" for non-print term, indicating that the term will not appear in any listing of the thesaurus or index. These are matching text strings, which, along with the main term itself, will cause a section of the text to be indexed by the main term.

The major advantage of the indexing philosophy adopted here, and its major disadvantage as well, is the degree of human involvement in the creation of the indexing vocabularly. The book index results from the automatic matching of the thesaurus to the text. The quality of the book index depends on two major factors. First, and beyond the control of the indexing operations, it depends on the degree to which the vocabulary of the text is amenable to a matching process in its indexing. There is, quite obviously, a ceiling in quality based on tex- tural properties for any index achieved in this manner. For some texts, that ceiling might be qu- ite high, where the vocabulary is technical and relatively invariant; for others, the approach may be wholly impractical (unlikely, in my view). Whatever the ceiling, it is no simple matter to produce an indexing thesaurus capable of achieving it. One major objective for the exercise described here is to open these issues to investigation.

The indexing thesaurus was produced by a team of graduate students in library science. Though generally familiar with data processing, and in particular, bibliographic data processing, they had little experience with the text to be indexed. The procedure used in creating the

DID * * * * * T - 3 2 9 0 Ivl 10 mtm MARC record npu MARC-format record npu/2 MARC format record npu/3 A~RC input record scp Machine readable cataloging usf Output record usf/2 Input record

DXD *****T-3270 l v l 10 atm 8ABCIN npu MARC input f i l e n p u l 2 8AEC file s c p 8achine readable cataloging file usf MARC-format input file

Exhibit 5. Sample listing of marc records.

172 M. DILLON

indexing thesaurus followed more or less standard approaches to the creation of any thesaurus. It begins with the text, from which a working vocabulary is extracted, using various represen- tations of the text as aids including a KWIC. The working vocabulary is reviewed for synonymous entries, thesaurus terms are selected, cross references are established, scope notes created where necessary, etc. No magic formula exists for this work, and it is likely that the effort that went into the construction of the indexing thesaurus at least approximates the effort that would go into constructing an average index. As anyone who has tried to make a book index recognizes, nothing is easy about it, and nothing was easy about the intellectual decisions that went into this process; more will be said of this in the concluding remarks.

The second step in the indexing process is the automatic matching of the indexing thesaurus to the text. Again, the details are not significant. Many different approaches are possible, most hinging on the degree of latitude allowed in what constitutes a match between a thesaurus entry and a text string. Possibilities are: exact character match; matching of multi-word terms by any string in which all the words occur regardless of order; and purging stop words ("of", "the", "this", etc.) from both text and synonymous strings prior to matching. The method used here was the last, which tends to increase recall at the expense of precision. Exhibits 6 and 7 are

Input and o u t p u t files oIX SORTFILE

Input to the program consists of MARC-format records (ddname {MARCIJ).

.IX MARC record.

.IX MARCIN°

Records nay contain prefixed sort kels from a

previous execution of BPSSKED. olX BPSSKEDo

The output file consists of MARC-format records containing

3 ~ 8 - b y t e l e a d e r s . . I X L e a d e r .

Exhibit6. Samplelistingoftextlineswithindexterms.

MASCON°

I n p u t t o t h e p r o g r a m c o n s i s t s o f M A a C - f o r s a t r e c o r d s {ddnaae MABCIN).

on each SKED card that has been found in the MAaC file.

- MARC format input file {ddnase MARC~N)

//MARCIN DD DDMMY,DCB=BLKSIZE=q4~4

- MARC input file {ddname MARCIN)

The MARC input file is assumed to be in ascending sequence

at the end of each MAHC input file record and for each line

mARC~EADCT.

null record counts are included only in M&RCREADCT and SORTOgT

Exhibit7. Samplelistingofindcxtermsandtextlines.

Thesaurus-based automatic book indexing 173

samples of listings produced at this stage of the process; they were designed to aid in the evaluation of the indexing operation. The first gives the text, in text order, each line ac- companied by its associated indexing terms. Exhibit 7 is more useful; it is an alphabetical listing of all terms along with all text lines in which the term was found. Both are used to determine if the process is working as it should.

Steps 1 and 2 above are iterative, in the sense that indexing inaccuracies can be discovered and corrected short of producing the actual index. These include both clerical errors (where terms are input misspelled or are overlooked), and errors of design in the indexing vocabulary. One example of the latter is the inclusion of a term that appears too frequently; often this can only be discovered after Step 2 has been performed.

Step 3, thanks to SCRIPT, is relatively straightforward. Once the unformatted SCRIPT text has the "correct" indexing terms inserted, the system does the rest. The result is exemplified in Exhibits 8 and 9. Exhibit 9 is an alternative form of the index, where the terms are given an hierarchical arrangement.

Table 1 gives summaries of the process, with entries included for comparison from the two manually created book indexes cited above. It is difficult to draw general conclusions from these figures. The IBM manual is dense in terms (8.99 unique terms per page) indexed infrequently (1.40 times each); the mathematics text has relatively few terms (2.85 per page) indexed frequently (10.20 times each). The automatic indexing effort resulted in fewer terms (2.12 per page) with a frequency of use falling between the two examples (2.08 times each). Overall, indexing frequency was less in the experiment: 4.41 references per page, compared with 29.13 for the mathematics text and 12.58 for the IBM manual. Whether these differences are attributable to deficiencies in the process or simply due to differences in the texts cannot be determined without a firmer basis for evaluation.

4. DISCUSSION

Only the most general homilies on evaluation are available in the literature on book indexing and of value inversely to their generality--"every index should be tailor-made to fit the book",

MArC record, 1-11, ]6, MAECIN, 2, 32, 35-36 MARC READCT, 33 Mnemonic tag, 32

18-20, 22, 26-27, 32-35

Normal mode, I, 21-22 Null record, 33 NULLFILE, 36 Numeric data, 36-37 Numeric tag, 5, 10, |8, 2~-25

Exhibit8. Samplepagefrom SCRIPT bookindex.

End statement, 20 End-of-file, 18 Execution log

EXPLODED, 29-30 ORIG, 29-30 OUTPUT, 4,, 8-9, SKIPPED, 9# 18,

25, 29-3~ 25, 29-30

F Field address specification, 15 Field specification, 9, II, 23 Field type identification# 10 Files

Dummy# 33 DUMPw 33 MARCIN, 2, 29, 32-33 NULLFILE, 3]

Exhibit9. Samplepa~ from hierarchicalSCRIPTbookindex.

174

i I TEXT I I I I I 1 MATB I TEXT I I I I I I I B M 1 MANUAL l

I I

I I | AUTOSATIC l INDEXING

I ! | ,

M. DILLON

Table I. Term and reference f requencies for example indexes

TOTAL TOTAL TEHMS/ PAGE HEFS/ PAGES TERMS PAGE EEPS TER~

l 1 4 9 3 I 1 4 0 8 I 2 . 8 5

I 1 i 1 1 1 I I | 1 l I

4 9 2 1 4 4 2 2 I 4 . 4 9 1 1 1 I | 1 1 t 1 I I |

1 0 0 I 2 1 3 1 2 . 13 i I 1 I 1 1 | •

1 4 3 6 1

6 1 9 0

4 4 1

1 0 . 2 0

1 . 4 0

2.08

I EEPS/ I PAGE I

I I

2 9 . 1 3

12.58

4 . 4 1

"indexes should cover the complete contents of books", "significant items in the text must appear in the index", "an index must be comprehensive", etc.[9]. A recent attempt at evaluating the index of an encyclopedia discusses some of the problems that must be addressed [10].

Because of the absence of an accepted evaluative methodology, only a cursory evaluation was carried out in this project. Ideally, one would like a sample of user "queries" on which to base an evaluation, or more accurately, terms corresponding to such queries, along with references to the text that would serve as responses. These were not available, nor could a satisfactory set be created. See BENNIAN[10] for a discussion of this point. As an alternative, portions of the text were indexed manually, with the result considered a standard against which the automatic book index could be judged. (Indexers were instructed to create what in their judgement was a perfect index.) By analogy to the evaluative methodology employed with retrieval systems, it is possible to consider this standard as equivalent to a set of user queries, and the text sections associated with each term as "relevant" responses. An automatic assignment of a term is then "correct" if the same term assignment appears in the human standard.

Measures of precision and recall can then be calculated, with precision giving the ratio of terms automatically assigned that are correct to the total assigned, and recall giving the ratio of terms automatically assigned that are correct to the total number of term assignments in the standard. Typical results, derived from a sample of 25 pages of text, are given in Table 2.

Not surprisingly, at least in retrospect, recall (.59) tends to be pleasingly high: what is recognized as an indexable concept by human indexers can generally be captured by the automatic process. The inference is that terminology in this environment tends to be explicit and adequately dealt with by character string matching operations.

In the absence of a generally accepted method of evaluating a book index, one must be cautious in drawing conclusions from figures like those in Table 2. (One wonders how book indexing has existed these many years without such a method.) Our purpose in evaluation was largely to uncover hidden problems in the process, and in this we were successful. The most serious problem, and the most challenging for future indexing attempts, is the figure for precision. Though we may suppose that an index may do as much damage by failing to cite a significant portion of the text (a failure in recall), somehow it seems less acceptable that an index should cite portions of the text that are not significant. (Again, a question requiring more research.) Errors in precision are endemic to this type of indexing and are due primarily to a lack of selectivity in the matching process. All occurrences of a matching term in the thesaurus

Thesaurus-based automatic book indexing

Table 2. Per formance measures for indexing source text

CORR RCT INCORRECTLY TERNS rRDEXED

! I ! INDEXED I 95 l 15 I

I J ! 4 i l J I l ! I i

NOT I 1 I IRDEKED I 66 I -- I

I I I l I l • l !

TOTAL CORRECT TERH$ 161

175

PRECISION = . 8 6 RECALL = . 5 9

receive entries in the index; many such matches were not considered substantive enough by human indexers to derseve an entry. The sort of thing which happens is exemplified by two terms in Exhibit 10. For each term, the first set of lines are acceptable sources of index entries; the second are likely not.

Proper indexing for the term "constant" would capture only those references in the text that dealt with their manipulation by the control language, not such references as occur in the second set which are general and not likely to be of interest. The term "default" is more numerous, both in correct references and incorrect references. Many of the correct references could be captured by restricting the matching string to "default value", as the example implies, but others would certainly be missed. (It should be noted that indexers disagreed significantly over such terms as are exemplified in Exhibit 10, a phenomenon that is not surprising, nor does it alter the conclusions of this paper.)

Though the problem is serious, in the context of the book index, it is not as serious as it seems from the figures given in Table 2. Term assignments often occur in clusters, with the same term being assigned multiply within a brief span of text. The majority of terms in such a cluster are often judged as non-substantive by human indexers. The unit of analysis in the book is the page and many of these term assignments are collapsed in the final index to a single page reference.

The second problem resulted from the discrepancy between the indexing unit and the syntactic units that define concepts. The crux of the automatic indexing problem is to associate a term with the correct text line, where "correct" depends on consequences within the SCRIPT environment. Many terms not indexed were missed due to the basic unit of analysis, the SCRIPT line; phrases that spilled over from one line to the next were not recognized as instances of the term. A tradeoff exists between the size of the unit of analysis during automatic indexing, and the unit of reference (the page) when the actual index is created. Whenever a page division occurs between the text string that generated an index and the SCRIPT line associated with the index term, an incorrect page reference is created. A large unit of analysis will miss fewer terms, but will more often incorrectly assign terms at the page level. Because page divisions occur after indexing and independently of it, there is no way to anticipate when an assignment will result in an incorrect page reference. It is worth noting that human indexing in the SCRIPT environment must deal with a similiar problem.

The third problem area uncovered by the evaluation had to do with correctable inadequacies in the thesaurus. These were of two sorts. The first were oversights in the set of matching strings for some of the terms. These can be added to the thesaurus as they are discovered, with diminishing returns. A second difficulty were terms that occur too frequently in the text. Typical was the term "default", referred to above as a source of incorrect indexing, which

176 M. DILLON

I~DEI TERM: Constant

CORRECTLY INDEXED

"A constant {C) card will print its constants and/or perform..."

"Front and hack constant specification." [Section heading]

"$KED field format for a constant." [Section heading]

INCORRECTLY INDEXED

"Overprinting specification causes field and constant data to be printed darker..."

"If right justification is specified...then only one item of data, a constant, a tabulation total,...may be generated."

"...statements are treated as a set of sequential instructions specifying what constant data to generate, which fields to..."

INDEX TEBM: Default

CORRECTLY INDEXED

"...the default value for continuation...-

"...determines the default Talue for page size, vertical..."

"This specification results in special default values for..."

INCORRECTLY INDEXED

"...EIEC statement rill cause the appropriate default page.,."

"...causes a new paragraph to begin regardless of the default.-

"...the null record permits the default report number."

Exhibit 10. Examples illustrating low precision.

occurred on more than half of the pages. The ideal solution for such a term is to resolve it into a set of narrower terms and represent them in the index hierarchically, with the more general term as the entry. Both of the example indexes use this approach. It is possible to adopt this approach automatically by attaching modifiers to the text strings that cause matches. Special SCRIPT facilities must then be used to represent the hierarchical relationships in the index. (Refer to Exhibit 9, taken from a test of this approach to a section of the text.)

We are left with those terms in the perfect index for which no suitable matching strings can be found. These account for all of the recall failures, once those due to thesuarus oversights and the unit of analysis are removed. They are relatively few, and include such instances as appear in Exhibit 11. In order to capture such terms, a more sophisticated matching algorithm would be required.

A second type of evaluation was carried out with interesting results. Because only a portion of the text was used in the exercise, it was possible to index automatically text that was not used in the development of the indexing thesaurus. Again, the evaluative technique depended upon creating a perfect index, with comparisons made with the automatic indexing. Table 3 gives the results, the most significant of which is the 0.30 recall. Though not surprising-- technical texts, especially manuals, are not redundant, with each new section taking up a different portion of the system--this figure is disappointing. What it means, of course, is that the introduction of new terms in the text is relatively constant, and that an indexing thesaurus must be based on the entire text, not selected parts of it. One would wish the world were otherwise, in this and other matters.

Thesaurus-basedautomaticbookindexing

TEXT LINE: "directory to the contents of the sortkey" MISSES: .IX Sort key directoEy

TEXT LINE: "vaEiable and fixed field data" MISSES: .IX ¥aEiable field

TEKT LINE: "indicator code requirenents" MISSES: .IX IndicatoE specifications

TEIT LINE: "As each SKED card is entered for processing, a s w i t c h is set off"

MISSES: .IX 3KED switch

Exhibitll. Sampleofindexi~ misses.

177

Table3. Performance measuresforindexingnewtext.

INCORRECTLY XNDEXED

CORR E(:T T E E H S

[ , • ,

1 l ! Z E D E I E D I q 7 1 1 0 I

1 I | I l I ! I . . . . . i i 1 1

NOT 1 1 I I N D E X E D I 1 0 8 i - - I

| J l J ! ! • ! I

TOTAL C O R R E C T T E R B S 1 5 5

P R E C I S I O N = . 8 2 R E C A L L = . 3 0

5. CONCLUSION We consider the excercise a signal success. The cost in human time is relatively high--the

quality of the thesaurus is crucial and not an easy achievement--and likely no less than in the production of a book index by traditional means, but the results are more satisfactory in many ways. First, the thesaurus itself is a concrete product of substantial value. It is capable of refinement and improvement and, in many environments, it is usable for texts other than the one on which it was based. One might consider the example indexes in Exhibit 1 from this vantage. Few refinements would be required for a well developed thesaurus in mathematics to produce a respectable index for any suitable text. For texts like system manuals where the vocabulary and the text are constantly evolving, the dictionary becomes an important supple- ment to the system's documentation effort, considerably easing the processes of updating and re-indexing.

The technique discussed here is likely to be most useful in an environment where the vocabulary is highly specific and a premium is placed on indexing exhaustivity, an environment like the one which it was applied in this experiment. Moreover, in future text processing systems, where books may be available for consultation online, anil boolean retrieval or other automated retrieval techniques are usable, indexing exhaustivity will be a virtue.

Acknowledgements--Many organizations and individuals participated in these experiments and their help is ~atefully acknowledged. Among the former were the Institute for Research in Social Science, the Carolina Population Center and the Computation Center at the University of North Carolina, whose technical support and advice were essential. The

178 M. DILLON

thesaurus itself was created by Library Science students as part of a course project. The credit is theirs for the successful completion of the project. Gale Shaffer helped in preparing the examples and tables, as well as preparing the SCRIPT file of the paper.

REFERENCES [1] S. M. PIZER, Numerical Computing and Mathematical Analysis. Science Research Associates (1975). [2] IBM System/360 Operating System: PL/I (F) Language Re[erence Manual, IBM (1972). [3] Bibliographic/MARC Processing System. Carolina Population Center, University of North Carolina at

Chapel Hill (1977). [4] C. J. VAN RIJSBERGEN, In[ormation Retrieval, 2nd Edn. Butterworths, London (1979). [5] P. H. KLINGBIEL and C. C. RINKER, Evaluation of machine-aided indexing. Inform. Proc. Management

1976, 12, 351-366. [6] S. S. ARTANDE. Book Indexing by Computer. University Microfilms, Ann Arbor, Mich. (1967). [7] H. BORKO and C. L. BERNIER, Indexing Concepts and Methods. Academic Press, New York (1978). [8] Waterloo SCRIPTRe.ference Manual. University of Waterloo, Waterloo, Ontario, Canada (1979). [9] L. M. HARROD (Ed.), Indexers on Indexing: A selection o[ Articles Published in The Indexer. Bowker,

New York (1978). [10] B. C. BENNIAN, Performance testing of a book and its index as an information retrieval system. JASIS

1980, July, 264-270.