Computer generation of thesaurus from structured subject-propositions

Infomwrion Pmcessinf & Monagcmml. Vol. 17. pp. l-l 1. 1981 03064573/81/01oMl-llWZ.OO/O

Printed in Great Britain Pergamon Press Ltd.

COMPUTER GENERATION OF THESAURUS FROM STRUCTURED SUBJECT-PROPOSITIONS

F. J. DEVADASON and V. BALASUBRAMANIAN

Documentation Research and Training Centre, Indian Statistical Institute, 31 Church Street, Bangalore 560 001, India

(Received for publication 26 February 1980)

Abstract-Several experiments were conducted at the Documentation Research and Training Centre for generating thesaurus. The work reported in this paper, uses subject headings structured according to postulates and principles for facet analysis, as the input for generating thesaurus. The system uses a coding scheme for augmenting the subject headings to make them suitable for computer manipulation. Once the subject headings are coded and input, the other processes are done automatically. The system has five phases namely, Translation Phase, Term-pair Generation Phase, Coordinate Term-pair Generation Phase, Retranslation Phase and Printing Phase. The system is described briefly, giving the systems flow-chart, inputs and outputs of the different phases and a sample printout of a model thesaurus generated using test data of about 1500 subject-propositions from Leather Technology.

I. INTRODUCTION 1.1 Thesaurus A thesaurus is a list of structured compulsory keywords. It could be defined as “a controlled dymanic vocabulary of semantically related terms offering comprehensive coverage of a domain of knowledge”. It is a controlled list of terms with indication of conceptually associated terms for use in information retrieval systems. In defining the term thesaurus for information retrieval use, WALL[~] specifies that a thesaurus “must list terms, exhibit relationship, and define vocabulary”. The thesaurus, an authority list or controlled vocabulary of terminology, is an enumeration of approved index terms from which the indexer and the searcher make their selections. Entries also appear for non-approved terms and the user is referred to appropriate approved terms. The conceptual structure and terminological control provided by a thesaurus contains

(1) structured system of concepts with indication of hierarchical and associative relationships between the concepts; and

(2) all the terms that designate a particular concept, i.e. synonymous terms.

1.2 Need for thesaurus

“A system that does not use thesaurus is based on the premise that words have precise meaning and that they do not derive any significance from context”[2]. Thus, a word must mean the same thing to all those attempting to communicate. The need for vocabulary control results from the occurrence of imprecisely defined words, rapidly changing terminology and numerous synonyms for a term within a particular discipline or system. Therefore, an information centre needs a vocabulary of terms which are sufficiently precise in meaning and which signify the concept of interest to the users of the centre. As a result, the design and construction of thesaurus of terms for use in information retrieval has gained much importance in recent years and several thesauri in different subject fields have been published.

1.3 Structure of thesaurus

The structure of thesaurus combines features of classification schemes as well as subject heading lists. In a thesaurus, .terms are almost always arranged alphabetically like the subject heading list; like a classification scheme, it displays the hierarchical relationships among terms by means of generic (broader) and specific (narrower) term designations, synonymous term and related term designations[3]. “Although the hierarchies are not so discretely displayed (in a

1 IPM Vol. 17. No. I-A

2 F. J. DEVADASON and V. BALASUBRAMANIAN

thesaurus), they go geyond the confines of the traditional classification array, by permitting any term to appear in as many hierarchies as may be appropriate”[4].

1.4 Computer generation of thesaurus Several experiments were conducted at the Documentation Research and Training Centre in

the computer generation of thesaurus. To begin with, a faceted scheme for classification for a specific subject field was used in the computer generation of thesaurus [IS] in which it was possible to incorporate the hierarchic relationship (HR) of terms and only one type of non-hierarchic relationship (NHR), namely, the coordinate relationship of terms. In another approach, subject headings arrived at by facet analysis based on the General Theory of Library Classification was used[6], in which it was possible to generate HR and all NHR except coordinate relationship.

In this paper, a computer-based system for generating thesaurus from modulated subject- propositions (subject headings), structured according to facet analysis, augmented with appropriate codes for relating the terms in the subject-propositions is presented. The system generates HR and all types of NHR, i.e. associative relationships and coordinate relationships.

2. SYSTEM OVERVIEW

The subject-propositions formulated using the Postulates and Principles of General Theory of Library Classification[7], augmented with codes denoting the relation between terms and the typology of their relationships (see Section 3.2.2) are input to a program “CODER”. This program generates an unique serial number for each of the unique terms in the propositions and creates a term-code dictionary and rewrites the input subject-propositions as coded expressions replacing the terms by their respective serial numbers. The coded subject expressions are manipulated by a program “GENTHES” which generates the different term-pairs forming entries in the thesaurus according to the codes supplied along with the subject-propositions. The generated entries are sorted and coordinate relations between NTs to a particular term are generated by another program “GENCORD”. The output file created by this program is retranslated into the natural language using the term-code dictionary created in the first step and are sorted and printed in double column format as thesaurus by a print program. The overall systems flowchart is given in Exhibit 1.

3. INPUT TO THE SYSTEM

3.1 Formulation of subject-propositions The input to the system is prepared following the specific steps of the Postulational

Approach to Library Classification@-101. The title of the document or the raw specification of the reader’s query is taken as the

starting point. All the specific subjects dealt within the document/reader’s query are determined and

expressed in natural language. Let one of the subjects be expressed as “Re-tanning uf chrome tanned leather using

chestnut”.

The subject is then expressed in natural language as an “expressive title” as given below:

“In leather technology, re-tanning of chrome tanned leather by vegetable tanning using chestnut”.

By analysis, each of the specific component ideas in the subject and the broader or superordinate ideas or upper links implied by each of the component ideas are determined. The apparatus words are removed and the kernal terms (key words) obtained are arranged in a helpful sequence using the principles for facet sequence of the Genera1 Theory of Library Classification[7, lo]. The subject-proposition thus arrived at is given below:

“Leather Technology, Leather > Chrome Tanned Leather, Re-tanning, Vegetable Tanning, Chestnut”.

Computer generation of thesaurus from structures subject-propositions 3

Tempo-

GEXTHES l-ax-y

~'Work File

Exhibit 1.

The hierarchic (HR) and the non-hierarchic (NI-IR) associative relations between the terms and the Typology of Relationships of terms (see Section 3.2.2) are determined. The Typology of Relationships are inserted between the terms as phrases enclosed within brackets. The subject-proposition thus arrived at is given in Exhibit 2.

Note I: In Exhibit 2 the dotted lines indicate NT/BT relations and continuous tines indicate RT

relations.

LEATHER TECHNOLOGY. LEAkHER. (Type of -) CHROME 3 ANNED LEATHER.

(process use>TAGsed-) KG&ABLE TANNING. 4 \

(agent used -) CHESTNUT.

Exhibit 2. Formulated subject proposition.

4

Note 2:

F. J. DEVADASON and V. BALASUBR,~~AN[AN

The different entries for the thesaurus are generated following the connecting lines. For every entry, a reverse entry is also generated. In other words, as and when a NT relation is generated, a reverse BT relation is generated and for every RT relation, a reverse RT relation is generated. Moreover, the position of the hyphen in the Typology of Relationship phrases is changed from prefix to s&ix and vice versa as appropriate. The different entries that would be generated for the subject proposition given in Exhibit 2 are given in Exhibit 3, where the Lead Terms are given in capitals and the Context Terms are given in lower case letters. (For computer manipulation, the connecting lines shown in the subject-proposition are replaced by suitable codes described in Section 3.2.)

Note 3: The position of the hyphen in the Typology of Relationship phrase has a special role in

reading the entries in the thesaurus. If the hyphen occurs at the end of the phrase, then the Context Term should be substituted in its place and the entry should be read bottom upwards. For example, in the first block in Exhibit 3, the first entry will be read as “Re-tanning (is a) process used (in) Leather Technology”. If the hyphen occurs at the beginning of the Typology of Relationship phrase, then the Lead Term should be substituted in its place and the entry should be read top downwards. For example, in the last block, the entry will be read as “Chestnut (is an) agent used (in) Vegetable tanning”.

3.2 Coding the subject-propositions The relationship between pairs of terms NT or RT as indicated by dotted lines and

continuous lines respectively in the example given in Exhibit 2 are replaced by appropriate codes to form the input to the system for further manipulation.

The codes used in the subject-propositions for generating entries for a thesaurus are of two types namely,

(1) those that indicate which terms are to be related and whether the relation is NT or RT or SYN; and

(2) those that indicate the Typology of Relationships.

3.2.1 Codes fur refating terms. The codes for relating terms are of three types namely,

(1) those that indicate NT relation; (2) those that indicate RT relations; and (3) that which indicates synonymous relation.

LFATHBR TWHNOLOGY

RT (process used-)

He-taming

LEA THKH

Err (type of -)

Chrome taaaad leather

BHUWE TANNGU LIUTHKR

ET i-type of)

Leather

HT (procsss used-)

W&stable tan ti.oR

fit?-tanning

RL-TANNlNG

RT t-prooess used)

Leatbar technology

Chrome tamed lsatner

(proce#s used-)

Vegetable iaenxn&

VgGZTABLK TINNING

RT (ageor used-)

Cbestaut

(-prucaas urea)

Be-tan mng

Carom taamd leatner

CHltSTNUT

RT (-agent used)

Vegetable taming

Exhibit 3.

Computer generation of thesaurus from structured subject-propositions 5

(a) The codes for generating NT relation and the associated computer manipulation are:

$2 Generate a NT relation with the immediately succeeding term using the typology of relationship code of the term being manipulated and generate a reverse BT relation changing the “-” in the typology of relationship code (Genus-species relation).

$3 Generate a NT relation with the immediately succeeding term and generate a reverse BT relation. No typology of relationship code is used (Whole-part relation).

(b) The codes for generating RT relation and the associated computer manipulation are:

$1

$0 $5 $6 $7 $8

Generate a RT relation with the immediately succeeding term using the typology of relationship code and generate a reverse RT relation changing the “-‘I in the typology of relationship code.

Generate a RT relation with the immediately preceding term with the same “code” taking the typology of relationship code of the term being manipulated and generate a reverse RT relation changing the “-” in the typology of relationship code.

(c) The code for generating synonymous relation and the associated computer manipulation is:

I Generate a synonymous relation with the immediately preceding term and generate a 1 reverse USE relation.

Note: A special code “$4” is also provided to stop generating any relation between which it is

coded.

3.2.2 Codes for typology of relationship. When subject representations are facet analysed and structured into subject-propositions, it could be found that the relationships between the ideas forming component of subjects are mostly of the NHR associative type. It has been found helpful to identify patterns of association of ideas, forming components of subjects and categorise the relationships into a few types using suitable phrases and also to group the concepts on the basis of these types of relationships.

Several studies have been made in this area[llj. Almost 39 types of relationships have been identified so far[l2]. In general genus-species relations and part-whole relations are taken as hierarchical and represented as NT (and BT) in a thesaurus. Representation of Genus-species relations could also be categorised to achieve better display format and for generating RTs out of NTs to a particular concept/term (see Section 4.3).

The following is an extract from a table of Typology of Relationships we have used:

00 Coordinate ideas 01 Source .*. . . . . . . 07 Property of 08 Process used . . . . . . . . . 12 Agent used 13 Device used . . . . . . . . . 16 Type of ..* ..* . . . 25 Used for

6

Note: It would be helpful

thesaurus based on the ponents of the subject.

F.J. DEVADASON and V. BALASLIBRAMANIAN

to develop the typology of relationship codes for the subject of a characteristics used for deriving the different isolates forming com-

3.2.3 Coded subject proposition. The subject-propositions are augmented with the codes

described above to form the input for thesaurus generation. The subject-proposition given in Exhibit 2 is augmented with codes described above to reflect the different NT and RT lines (links) as given below:

$0 LEATHER TECHNOLOGY $4 LEATHER $2 (16-) $6$5 CHROME TANNED

LEATHER $5 (08-) $0 (08-) RE-TANNING $6 (089 $0 (08-) VEGETABLE

TANNING $0 (12-) CHESTNUT #

Exhibit-4 Coded input subject-proposition

J~LJ~rmB:~7-)~cmBS~ s?BmsTe:~t )[email protected] BUYS1 tssT:l3 )CIISTOpxTEJe JotllllT LBJTABR(~9 )P15STOpPIJG:12 ,JoB?Kp fALLLO, 32 )15JOJLFJTt JOj,U COJT"G,S,LT PJCK C~RIWGf~6 ISIDET SLLt‘liJ Jas,SLLr CDRIUG:l6 ,12117 DL"IJG:b6 )JSBE J.LT‘liJ 1oSbJ6UPpBJ IKJWWJ,l2 )S~ICBIIIC PESIIS5IIJILXUIII CSLO6ID,,S6ST~EJl 8O~DIkllJ PhL(-PIWY~G:12-,s~SrI~I1 (lf )s2nAPTEIlKY ZULPcljYICJ JoJt,d'FLlJT CCK¶POL:l~,J"TBICKIIJG pIL~LYS5LBBATDJ STSTJd9 J=J>J6J7JllBB JIPELLIIC~llZ )lOl IREltL SOCCIYXC rCIDJ5SILICOJ1J69TII~JJ~CU~OSIC CdLUkILx CCll~LHnn.mBxwJ*rC PaTTr *tID :LI*oII1oll COIIPIJ,. P J5J6LBJTHEJ(l6 I J~JLDpK9D? LllTRBBl51LOB ttillBC LIA~~BS~AJILIBB DIBD LJITElB4 J"J514*,EJTtl CBBOt? TJKIIIGIl2 )J~DICEBOM~,5B"~~~ TBICSOLJITBI J~s~s6n~SrrJG:l2 ) J~I~TPI~BJ~cJLoP~DRJ~sUL~~S~~OYBAT~~ J~JaJ697?!ASKIKGil2 )Y~IC~~l~~~SOIpI?BS6CJAL~~7C~LATBJ J~J~%MSKIKG:l2-.)SOOPGAKIC JCIDJ5PB9BILJ~KJ6:lTPIC .CID‘ JoT&YKIKG~6~ ,QJLDK"~DB TIJJT"G:l7 )J%LUrAk&LLoB,,l,,I lulllIOKIC RRDLSIOV nc-,s2sanJIc LCID LCULSIOM JJ:bfICWIC J"GLSI3":~6~) 17QPlTRRUIPT ,ll,"J JIIGLSLOY, 1"1516J7LRATS1(*RIlF ,JOBO?p I?hTWJBSIABCT L~A~UGJWA?IHT L6JTJlIEI7LIGUT LSATIBI L

Exhibit 4. Input card image

4. OPERATION OF THE SYSTEM

4.1 Translation of coded subject-propositions An assorted number of subject-propositions from a specific subject field, augmented with

codes for relating terms and codes for typology of relationships are punched on to cards in free format from column l-74. Columns 75-80 are used for serial numbering of the cards. Any number of cards could be given for a subject-proposition, the end of which is denoted by a “#” mark. A sample card image of input subject-propositions is given in Exhibit 4. The punched subject-propositions are read by a program “CODER” and a table is built in the memory. Each of the terms in the subject-propositions is entered in the table with sequential serial numbering simultaneously replacing the respective terms in the subject-propositions by their respective serial numbers. As and when a term is encountered in a subject-proposition, it is matched with the existing terms in the table and its serial number is picked if the term is available, if not the term is entered as the last entry in the table with appropriate serial number and the given serial number is replaced in the subject-proposition. The codes for relating terms are also translated into alphabetic or special characters. This translation is done for easier manipulation by computer. The term-table built in the memory and the translated coded subject-propositions are written as files on tape for further processing. Exhibit 5 is a dump of the term-code dictionary and Exhibit 6 is a dump of the translated coded subject-propositions.

4.2 Manipulation of translated coded subject-propositions The general rule for manipulating the subject-propositions is that, each of the terms

(represented by their respective serial numbers) form a term-pair with the succeeding term in the subject proposition following the “two types of codes” in between them. The manipulation

8 F. J. DEVADASON and V. BALASUBRAMANIAN

starts from left and carried on to the right of the proposition. Once an entry is prepared the reverse entry is generated automatically by changing the position of the “lead term” and the “context term”. In HR entries, the relation NT is changed to BT in reversing the entry. In RT entries the relation does not change in reversing. In the case of entries having the typology of relationship codes, the position of “-” (in the typology of relationship code) is changed from prefix to suffix and vice versa as appropriate. In the case of synonyms indicated by “I” in the subject-proposition, a SYN and USE reverse entries are generated.

When the codes for relating terms fall into the group “$0, $5, $6, $7, $8” separate pointer tables of these terms are built along with their typology of relationship codes. When succeeding terms with the same relation code are encountered in a subject-proposition, then term-pairs are generated using these pointer tables. A detailed description of the logic and algorithm of the program generating the different entries are given elsewhere[l3]. These processes are done by a program named “GENTHES”. The entries for the thesaurus at this stage are in the form of serial numbers standing for the “lead” and the “context” terms with the typology of relationship code in between them. Exhibit 7 is a dump of the generated term-pairs.

4.3 Generation of coordinate term-pairs The term-pairs so for generated are the HR and non-hierarchic associative types. Terms

coordinate to a particular term are not present in them. In order to generate coordinate entries, the generated thesaurus entries are sorted in ascending sequence so that, serial numbers for the same “lead term” are brought together. The serial numbers of the “context terms” that are NTs

Computer generation of thesaurus from structured subject-propositions 9

to a particular “lead term” (having the same serial number in the lead term position and having the same typology of relationship code) are formed as a separate table and coordinate RT term-pairs are generated, merged, and passed as a data set for further processing. The generation of coordinate term-pairs is done by a program “GENCORD”.

Note: For example, two groups of NTs to the term “unhairing” are given below. The first group

has no Typology of Relationship and the second group has the Typology of Relationship “type of-“.

UNHAIRING NT

Hair Saving Hair Burning

(type of-) Lime Unhairing Enzymatic Unhairing

The program “GENCORD” generates coordinate RTs among the terms in each group separately. ‘The generated entries for the above example are:

HAIR SAVING LIME UNHAIRING

RT (Coordinate ideas) RT (Coordinate ideas) Hair Burning Enzymatic Unhairing

HAIR BURNING RT (Coordinate ideas)

Hair Saving

ENZYMATIC UNHAIRING RT (Coordinate ideas)

Lime Unhairing

In actual processing, this manipulation is done using the serial numbers of the terms concerned.

4.4 Retranslation of translated entries The file of generated entries for thesaurus is retranslated back to natural language by a

program. The term-code dictionary created as a sequential file on tape by the program “CODER” is read into the memory as a table. The typology of relationship codes and their corresponding descriptive phrases are read from cards and built in the memory as another table. The file of generated entries is read record by record. The serial number of both the “lead” and “context” terms are translated into natural language, using the table built from the term-code dictionary. The typology of relationship code is also translated into the corresponding descriptive phrase using the corresponding table built in the memory. The translated entries are written onto a tape for further sorting and printing. This retranslation is done by a program “TRANSLAT”.

4.5 Sorting and printing thesaurus The file of thesaurus entries in natural language, output of the “TRANSLAT”, is sorted

alphabetically using the SORT program available in the computer system. It is then printed out in double column format with proper indention for “lead” term, relation, typology of relationship, and “context” term. The format of display of a page from the thesaurus generated is given in Exhibit 8.

4.6 Updating thesaurus A thesaurus is never complete. It has to be updated continuously based on the experience

gathered in its practical application so as to reflect the most recent developments in the subject field[14]. In order to add new terms and new term-pairs to an already generated thesaurus, a built-in updating feature is provided in this system. For this purpose, both the term-code

IPM Vol. 11. No I-B

10 F, J, DEVADASON and V. BALASUERAMANIAN

Exhibit 8. Thesaurus sample

dictionary created by the program “CODER”, and the generated term-pairs created by the program “GENTHES”, of the generated (old) thesaurus are to be kept on tape as back-up copy. When new subject-propositions are to be processed and added to the existing thesaurus, the execution of the program “CODER” is to be invoked by a “control card” specifying the run as an “update run”. The “CODER” program will then be read the old term-code dictionary and form a table in the memory. Then it will read the new update subject-propositions and proceed with its translation function creating translated new subject-proposition and a new term-code dictionary. Then the translated new subject-propositions are to be fed to the program “GENTHES” and the newly generated term-pairs are to be “merged” with the old generated term-pairs (already backed-up on tape) and passed on for processing by the program “GENCORD” for generating coordinate RTs. The generated term-pairs are to be retranslated back to the natural language by the program “TRANSLAT” using the new term-code dictionary and then sorted and printed. For deleting an existing term-pair in the generated thesaurus, a separate prog~m “CORUPDT” is to be run after retranslation and sorting.

5. PROGRAMS DEVELOPED

The programs developed for generating thesaurus as outlined in this paper are written in ASSEMBLER and ANS-COBOL language for IBM system 360/370 series computers and require a 256 K partition, two tape drives, a disk-drive, and line printer. The programs have been tested at the IBM/370-ISS OS/VSI system available at the Computer Centre, Indian Institute of Technology, Madras. A sample test data of about 1500 subject-propositions in Leather Technology was used to gnerate a thesaurus. The number of unique terms were 1851, the total number of thesaurus entries were 13,717. The programs were kept as load modules and executed. The total CPU time taken was 10 min 26.73 sec. The generated Leather Technology Thesaurus is being used, and put to test, at the ~cumentation Centre of the Central Leather Research Institute, Madras. These programs will be used to generate a few thesauri and will be modified in the light of the experience gained to develop a software for generating thesaurus.

6. FURTHER WORK

It is possible to associate each term in the thesaurus with a frequency count calculated from the total number of assorted subject-propositions input, and the number of times a particular term has occurred in them. A study is being carried out to investigate the feasibility and usefulness

Computer generation of thesaurus from structured subject-propositions II

of this additional provision in the thesaurus. A work[5] in keeping the generated thesaurus as a machine readable “indexed file” on disk to aid in query formulation has been carried out and is under modification to develop an integrated on-line information system.

Acknowledgements-We thank Prof. A. Neelameghan for his help during the early stages of development of the system.

We are grateful to Prof. G. Bhattacharyya for his valuable guidance and encouragement. We also thank Mr. B. Nageswara Rao and Miss R. L.akshmi for their help.

REFERENCES [I] E. WALL, Quoted in V. G. Rogers Thesaurus construction: an introduction. Drex Lib. Quart. 1972, 8,

117. [2] C. F. BALTZ, The need for a thesaurus in automated information retrieval. Owego, New York, IBM,

1%2. [3] V. G. ROGERS, Thesaurus construction: An introduction Drex Lib. Quart. 1972, 8, 117. [4] M. R. HYSLOP, Sharing vocabulary control. Special Libraries 1%5, 56, 708-774. [5] R. LAKSHMI, Creation of thesaurus as a machine readable file to aid query formulation in computer-

based bibliographical information retrieval systems. Guide F. J. Devadason, Project submitted to DRTC, 1978.

[6] I. K. RAO RAVICHANDRA, Semi automatic method of construction of thesaurus. Seminar on Thesaurus in Information Systems, DRTC, Bangalore, 1975.

[7] S.R. RANGANATHAN, Prolegomena to Library Classification, 3rd Edn., Bombay (1967). [8] G. BHAITACHARYYA, POPSI: Fundamentals and procedure. Seminar on Subject Indexing, DRTC, Dec.

1975, Bangalore. [91 A. NEELAMEGHAN and M. A. GOPINATH, Postulate based permuted subject indexing (POPSI). Lib. Sci.

Slant Docum. 1975, 12, 80. [IO] S. R. RANGANATHAN, Hidden roots of classification. Inform. Stor. Retr. 1%7, 3, 399-410. [ 111 A. NEELAMEGHAN and I. K. RAO RAVICHANDRA, Non-hierarchical associative relationships: their types and

computer generation of RT links. Lib. Sci. Slant Docum. 1%7 13, Paper C. [12] A. NEELAMEGHAN and R. MAITRA, Non-hierarchical associative relationships in social science:

Identification and typology. DRTC Annual Seminar, 15, 1977, Bangalore. [13] V. BALASUBRAMANIAN, Computer-based thesaurus generation from modulated subject structures.

Guide F. J. Devadason, Project submitted to DRTC, 1978. [14] D. SOERGEL, Indexing Languages and Thesauri: Construction and Maintenance, p. 13. Melville,

California (1974). [15] M. SHEPARD and C. WAITERS, Computer generation of thesaurus. Lib. Sci. Slant Docum. 1975, 12,

Paper E.

Documents

Computer generation of thesaurus from structured subject-propositions