Document reuse: Organizing, finding and reorganizing content

lnterffational Joufnai of /nfo~~~~ion ~aff~ge~enf (1992), 12 (31 O-319)

Roy Rada is a Professor of Computer Science, at the University of Liverpool, Liverpool L69 3BX, UK (phone 44-51-794- 3669; email rada at liverpool.ac.uk);

Hafedh Mili is an Associate Professor in the Departement de MathCmatiques et d’fnformatique, Universite du Qukbec Z’\ Montreal, Boite Postale 8888, Succ ‘A’, MontrCal, Quebec H3C 3P8, Canada (email mili at aicha.info.uqam.ca).

Funds for the authors to meet were provided by a NATO Travel Grant. The work of Hafedh Mili is supported by a grant from Clam and NSERC. The work of Roy Rada is supported by ESPRIT Project P-1094. The version of MUCH presented in Figure 2 was developed by Karl Strickland.

%ARSiY, l’.‘l’., HUNI, W.T. AND LOI’LL- SIIAREZ, A. (19%)). Roles for tables of contents as hypertext overviews. Humcm- Computer Interaction - Interrrct ‘90, pp. 581-586. -NEUWIRTH, C. AND KAIJI+K, D. (l’)#Y). The role of external representations in the writing process: implications for the design of hypertext-based writing tools. Procwrl-

ings Hypertest ‘89. New York: Association of ~~)mputing Machinery, pp. 343-364. ‘3C.AMPRFI.I., H. AND GOODMAN. J.M. (198X). HAM: a general-purpose hypcrtcxt ah- stract machine. C[)mnfunjcutj~n.s of the Associution af Computing Machinery. .?I (No. 7), pp. 85WS61. “RFURENSTEIN, H.B. AND WATERS, R.c‘. (19Yl). The requirements apprentice: auto- mated assistance for requirements acquisi- tion. IEEE Transacti& on Software’ En- nineerinp, 17 (No. 3). m. 226-240. %.~DA, ‘k. (1490). hypertext writing and document reuse: the role of a semantic net. Ekctronic Publishing, 3 (No. 3), pp. 3-13.

continued on page 3 11

Document Reuse: Organizing, Finding and Reorganizing Content

R. RADA AND H. MILI

Existing structures should be exploited as best as possible in the course of document reuse. Reuse may take multiple forms: rearranging a single document so as to provide different views of the same information, copying a portion of a single document so as to provide a portion of a new document, and combining portions of existing documents to constitute a new document. We present algorithms for doing this reuse, describe some computer tools which we have built to support this reuse, and provide real examples. For significant reuse, the information in existing documents needs to be abstracted so as to highlight the conceptual patterns. Outlines are a structure which could, at least, partially support this function.

introduction In the halls of academe some people talk about document reuse as an unethical way to pad out one’s rCsumC, but a benefit to presenting similar claims in different forms for different audiences is often accepted. In the factories of software engineers, software reuse is a treasured goal. What is the key to good reuse? Under what conditions can one find information which is then susceptible to reuse? Wow can one take a single document and reorganize it for a new audience or convert it into a hypertext from which many different views of the same information can be obtained?’

One function of a document is to provide information which can be repeatedly used by other than the authors of the document. This use of the information becomes reuse when the information is included in another document. Students are taught to write technical reports by collecting notes from other sources and then synthesizing those notes into a new report. This is a kind of reuse which is extended as the tools for finding the relevant sources of notes and the tools for combining those notes into a new product are extended.2 Within the realm of software engineering, the approach to reuse has been connected to object-oriented” and knowledge-based methods.” Furthermore, one of the growing emphases in software engineering is on the requirements document, which is more like a traditional natural language document than the design or code portions of the software lifecycle.

The approaches to document reuse depend on multiple access points into document(s) and abilities to reassemble components - these abilities are intimately linked to hypertext technologies.” A hypertext model may have a runtime layer, a storage layer, and a within- component layer.” The storage layer is composed of nodes and links, and each node or link may have arbitrarily many attribute-value pairs. This paper focuses on the storage layer model of hypertext as it is germane to reuse.

This paper has been carefully organized so as to demonstrate how reorganization can naturally flow from organization. The three major

310 0268-4012192104 031 O-1 0 @ 1992 Butterwo~h-Heinemann Ltd

R. DADA AND H. MlLl

sections of the paper are called ‘organizing’, ‘finding and reusing’, and ‘reorganizing’. Within each of these sections are subsections entitled ‘principles’, ‘examples’, and ‘systems’. Finally, those subsections are, in turn, divided into ‘textbook’ and ‘software’ subsubsections. The reader can imagine how the paper would be reordered to emphasize one perspective over another.

co~tinff~ from page 3 10 %AI.AS%, F. AND SCHWARTZ, M. (1990). The Dexter hypertext reference model. Proceedings of the Hypertext Stundmdiza- tion Workshop. Washington, DC: US Gov- ernment Printing Office. National Institute of Standards and Technology Special Pub- lication 500-178, pp. 95-134. 7GAKG, P.K. AND SCACCHI, W. (1990). A hypertext system to manage software lifecycle documents. IEEE Software, May, gp. 91-98. KADA, R. (1991). Hypertext: from text to

expertext. London: McGraw-Hill.

Organjzjng documents

Amorphous documents do not lend themselves to reuse. What kinds of structure or organization can be imposed on documents? Classification of components is only one aspect of organization.

Principles

Textbook. A textbook is usually organized for access with a table of contents and an index. These may be represented in the node-link- attribute model by viewing the headings of the textbook as nodes and the links as providing the hierarchical structure of the table of contents. The text which follows a section heading x and precedes the next section heading is attached to the node X. The terms in the index may be viewed as nodes which have links to the relevant node in the table of contents. For large document collections an additional organizing principle is the classification of documents by terms from a classification language, such as the Dewey decimal classification.

Software. We use the term software document to refer to all the products of software development. Referring to a simplified waterfall lifecycle, software documents include requirements documents, design documents, code, test specifications and code, user and maintenance manuals. It is customary to divide software documents into two categories: (1) documents that need to be understood by the computer (mainly code), and (2) documents meant for human consumption. The latter, traditionally referred to as ‘software documentation’ is meant for both developers to help them develop and maintain code, and for users, to help them use the software.

As software projects increase in size, researchers have recognized the need to manage software documentation to maintain document consistency, and to ensure retrievability, revision control, and completeness.’ Achieving document consistency, proper revision control, and retrievability benefits from making the various relationships between software documentation explicit. Hypertext systems offer both the conceptual abstractions and computer tools to support those associa- tions. With the advent of conlputer-assisted software engineering tools, the distinction between the two categories of software documents is becoming a fuzzy one, and what is traditionally referred to as software reuse encompasses reusing both code and documentation.

Examples

Textbook. A textbook entitled Hypertext: from text to expertext (hereaf- ter called Hypertext) has been written as hypertext and published in paper form as a traditional textbo0k.s The chapters of Hypertext include ‘Small-volume hypertext’ and ‘Large-volume hypertext’. The chapters in Hypertext include sections on ‘Principles’ and ‘Systems’. Furthermore

311

Document reuse

Figure 1. The ‘principles’, a descendant of ‘small-volume’, are specifically about the principles of small-volume hypertext. The principles at the bottom of the screen which are diretly connected to *hypertext’ are about principles in general

links and nodes exist in the hypertext which are not apparent in the printed table of contents. For instance, nodes exist for ‘Principles’ and ‘Systems’ which are more general than those nodes within the indiv~du~~l chapters. These high-level ‘Principles’ and ‘Systems’ nodes are linked to the lower-level ones in the chapters (see Figure 1).

Sofrwnre. Two structures are evident when we look at software documents:

@ code fragments have an inherent a~~re~ation/function decomposition structure evident in data flow charts or hierarchical structure charts;

e software documentation has a structure that is inherent in document outlines.

We analysed one set of software documentation from the software lifecycle at the Centre de Technologie Tandem in Montreal (CTTM). CM’M has a policy on the tables of contents for each d~)cument of the software lifecycle for all distrit~u~ed systems m~ina~ement software.

The structure of code fragments is not reflected in the requirements stage. Such a correspondence becomes evident, however, at the analysis and design stages, where particular sections of the analysis and design docunlellt can be unamb~~uousiy mapped to code fragments. Automat- ing this mapping may prove crucial to various aspects of software document reuse.

Textbook. We have developed a tool, called the Multiple User’s

‘RADA, K., ZEB, A., Y<Xt, G.-N, MI(‘tlAILIl~IS, Creating Hypertext (MUCH) system, which runs on graphics work-

A., AND MEWSHI, M. (1991). Collaborative station networked to a relational database management system and

hypertext and the MUCH system. Jourrrui supports collaborative editing of links and nodes (see Figure 2).” A <$ In~o~~uti~)ff ~cif7tict, August. document is represented as a set of linked nodes and tools exist for

312

R. RADA AND H. MILI

Figure 2. The outline is in the lower right. The nodes emanating from one node are in the upper right. Paragraphs that are attached to the nodes may appear in the upper left when the user selects from the windows on the right. The user has elected to create another link between two nodes and is entering the relevant information in the small, ‘pop-up’ window in the centre of the screen

importing documents into the MUCH system and for exporting MUCH documents into other systems and formats.

software. A software development and reuse tool, called SoftClass, is being developed at the University of Quebec at Montreal under sponsor- ship of TANDEM. SoftClass adopts a software component(m~~dule)- centred approach to documentation. The documentation relevant to a software component is encapsulated in an object structure with attributes and sub-components, rather than dispersed across various sections of the same (or different) documents. A software engineer interacts with the system and enters the description of a software component based on some description template (called software descr~~ti~~ categories). Traditional development documents can be generated by browsing through software component descriptions following a combination of aggregation hierarchies and attribute value links.

A prototype of SoftClass has been implemented in Smalltalk on Sun workstations. The user is guided in his selection of attributes and attribute values. Attribute values consist of a combination of keywords and text. Keywords are used to link the textual component of attribute values (see Figure 3).

Finding and reusing documents 1%~~~, P. (1989). Linking together books: experiments in adapting published material

The reuse of documents can be facihtated by the hypertext features of a

into Intermcdia documents. Hypermedia, system like Intermedia.“’ Furthermore, once documents have been I, (No. 2). pp. 11 I-145. organized, sophisticated strategies can be applied to retrieval and reuse

313

Document reuse

IIC?p~ cit., Ref. 2.

314

Figure 3. %&Class screen dump

of document components. What are these strategies and how do they inat~riali~~ in practice?

Given the task of creating a document based on partial abstraction, one general strategy is to:

e find the closest match between partial abstraction and existing abstractions for complete documents; and

0 determine how much of the partial description can be completed by copying from the matched abstraction and its document.

The strategy can be embellished when the documents belong to connected sets of documents.

~~xt~~~~. The authoring process may begin with the generation of notes and the connection of those notes in a graph-like structure. From this graph an initial outline is formed. Based on the notes, the outline, and other sources of information, the author creates paragraphs which then need to be related to the outline and the outline perhaps changed.” So continues an iterative process of building paragraphs and revising the outline. (This process is similar to that used in building the classification languages for large document collections where the arrival of new documents may lead to the refinement of the classification language.) The phases of the writing process may be loosely likened to the steps of the software lifecycle. The outline may correspond to the software design document, and the entire document may correspond to the software code.

Softwure. Viewing software documentation as a by-product of software development, the object of reuse in software is primarily software component reuse. However, the description of software c(~mponents is

R. RADA AND H. MILI

included in documentation, and software documentation retrieval and reuse is also important. Software documentation can be retrieved by software component names or software component attribute values. Pursuing the development of a given software product relies on retrieval based on that product’s name. For example, in order to design (or write the design document) of a product X, one needs to locate the requirements document for X. Parts of the requirements document of X may be reused for the design document. Such parts include introductory descriptions, as well as requirements of the software that are not directly resolved at the design phase (e.g. quality assurance and integration testing).

Retrieval based on component attribute values is useful for software reuse. If the developer has built a level i (where each level corresponds to a development activity) partial description (in terms of attribute values) of a product X and looks for similar products that may have been developed, then a close match (say of product Y) can be used in two ways:

0 Suggest ways of completing the description of X based on the match. This is particularly useful if there are conceptual dependen- cies between the different attributes.

l Provided that a level i+ 1 description of Y has been developed, use the similarities between the level i descriptions to infer those between the level i+l descriptions; in other words, reuse the level i+l description of Y.

By virtue of its software component-orientation (vs. document- orientation), the documentation management of SoftClass is particularly suited to retrieval and reuse based on attribute values.

Examples

Textbook. We are writing a textbook on collaborative hypertext. Since one heading in the Hypertext textbook conveniently matches to ‘collaborative hypertext’, we easily find one part of Hypertext as a candidate for reuse. In addition to reusing parts of the book on hypertext, we also want to use parts of a dissertation entitled D~cuss~un and an~utatio~. The ~~~c~~~~~~ and amputation document has a regular structure based on discussion vs. annotation at the top level and then ‘Literature review’, ‘Model’, ‘Experiment’, and ‘Discussion’ at the next lower level. The relationship between the headings of the dissertation and of the textbook would not be evident to a computer program that used only word match, but use of a thesaurus might improve matching. A relationship between ‘history’ and ‘literature review’ can be established in the thesaurus and be referenced by the reuse algorithm.

Software. CTTM has established a set of documentation standards for all development activities, including a nomenclature for documentation file names. While this supports retrieval based on component name, as the pool of documentation grows, a more sophisticated retrieval strategy will be needed. Attribute-based document retrieval and reuse can be illustrated by the following example. Attribute values were defined for several software code modules developed at CTTM. The ‘Function’ attribute for module ‘handle-bad-server’ in one software system was matched against the ‘Function’ values of other modules in that system. The closest match was to the ‘abend-server’ module. The

315

Document reuse

Data flow diagram Structure chart

MHASHJ, M., RAUA, R., YOU. G.-N, ZEB, A., MI~HAILIL~IS, A. ANU MILI, H. (1991). Word frequency based indexing and authoring. In: Computers and writing: thr sate of the arf. Oxford: Intellect Books. (Patrik Holt, ed.) 13 ~11.1, 1-1. (1990). Reusing software: issues and research direction. Technicul Report. Montreal: Department of Mathematics and Computer Science, University of Quebec, June.

A

f

?

Figure 4. Data flow diagrams and structure charts for systems A and A’

functional decomposition of the abcnd-server proved to be identical to that needed for the handle-bad-server. This was true despite the fact that handle-bad-server and abend-server have different inputs and outputs. Given the small number of modules in the exercise, we cannot reasonably argue that modules that have the same ‘Function’ have the same components. However, we expect the sub-modules to play similar functions; i.e. we expect those sub-modules to have similar values for the ‘Function’ attribute.

Matches between software documents might also support reuse across levels. In Figure 4, software modules A and A’ have a number of submodules in common in the data flow diagram. The existing structure chart for A and the data flow diagram for A’ may be used in predicting the structure chart for A’.

Systems

The MIJCH system supports finding and reusing textbook sections in multiple ways. The author-generated indices are accessible in a separate window and on finding a relevant index term, the user can jump directly to the associated textbook sections. The MUCH system also generates an index based on word frequency patterns and this can be used in finding relevant sections. I2 When the user finds sections which he wants to reuse in another document, he can then cut and paste material from one document into another.

In SoftClass. software components are located based on their attribute values using classification. The user can specify an arbitrary number of attributes to be used for classification. Roughly speaking, if A is a software product with attribute-value pairs ((I,, yi),. ..(a”,~,,), classification with respect to ai,. . .,a,, returns the software products with the most specific values for attributes u,,. . .,a, that are identical to, or more general than vI ,. . .,v,.‘j

316

R. DADA AND H. MILI

Table 1. Values of attributes vs. objects. Rows correspond to attributes and columns to objects. object9 is a parent of object, and object2. objectlo is the original ancestor of all the other objects

Attribute Objects 1 2 3 4 5 6 7 8 9 10

Small-volume + + + + - - - - + + Large-volume + + + + - +

Principles + + - - + + - - + + Systems + + - - + + - +

Interface + - + - + - + + + + Database _ + - + - + - + + +

Reorganizing documents

The traditional approach to document reuse focuses on classification and retrieval. With hypertext systems, however, users may expect to discover multiple organizations of a single document,14 and these different organizations constitute a kind of document reuse.

Principles

Given a document which was characterized as objects with attributes, we can generate links between objects based on relationships among attribute values. In this way, outlines or tables of contents can be generated. Furthermore, the abstraction of one document may support the generation of multiple outlines. One algorithm for establishing such relationships follows:

0 For existing objects x and y, make x the parent of y when each attribute t is such that

value(t,y) C value(t,x)

0 If additional objects are to be created in order to build the outline, then create object z and make it the parent of y such that

value(t,y) c value(t,z)

A large number of different outlines for one document can be generated when the attributes are appropriately created. One method of document reorganization for the case of multiple documents is to select objects with certain attribute values, and to organize the objects according to the method for single-document reorganization, except that a preferen- tial weighting is given to connecting parts from the same document.

Examples

Textbook. Part of the Hypertext textbook may be seen as a set of objects with attributes of the types:

0 small-volume vs. large-volume; 0 principles vs. systems; and

14~~~~~~~, H. VAN DYKE (1989). Hyper- 0 interface vs. database.

media topologies and user navigation. Pro- ceedings Hypertext ‘89, New York: Asso-

The objects which have a singleton value on each attribute correspond

ciation of Computing Machinery, pp. 4-S to leaf nodes of a ‘table of contents’ (and are objects objecti through

50. objects in Table 1). From these ‘leaf’ objects with singleton values can

317

Document reuse

User

User report

Figure 5. Data flow diagram for high-level query system

1. HLQS 1,l function 1.2 INPUTS/OUTPUTS 1.3 Performance Requirements 1.4 Structure 1.4.1 TRANSLATE 1.4.1 .l ~unctjon 1.4.1.2 /NPUTS/OUTPUTS 1.42 QUERY 1.4.2.1 Function

Figure 6. Aggregation-major outline for high-level query system (HLQS)

1. Function 1.7 HLQS

I 1.1.1 TRANSLATE 1 .I 2 QUERY I 1 .I .3 PRINT 2. Pe~~rmance requirements 2.1 HLQS 2.1.1 TRANSLATE

Figure 7. Attribute-major outline for HLQS

‘5f’?p. cit., Ref. 9.

318

be generated new objects whose attribute values are unions of the

values of the leaves (two examples are object,, and objectlo in Table 1). If for an object each of these three attributes can have one or both of its values, then the number of possible objects is (22-1)3. More generally, from y1 attributes with k values (2’-1)” objects can be formed. Many ‘tables of contents’ can be generated from these leaf objects. For instance, n! ‘tables of contents’ are possible when:

0 the root node has for each of n attributes the union of all values; and 0 the immediate descendants of a node X are the same as X, except

along one attribute the descendants are single-valued.

Sofrware. The existing outlines of software documentation at CTTM do not lend themselves to being viewed in multiple ways based on the principles advanced here. However, the component-orientedness of software descriptions enables us to generate documentation with alternative organizations. Roughly speaking, we generate documents by ‘traversing’ the object hierarchy of software descriptions of that level starting with the description of the product we wish to document, and printing values of a select set of attributes. The set of attributes whose values are printed during traversal determines the type of document.

As for the organization of the document, we can use two traversal strategies, which we call ffggregut~~iz-mffjor and ~ttr~b~te-~l~jor. With aggregation-major traversal, we start with the root software description of the software object hierarchy, and perform a depth-first traversal of the object hierarchy, where visiting a node consists of printing its attribute names and attribute values. Consider the software component HLQS (High-Level Query System), which provides a high-level, user- friendly interface to relational databases (see Figure 5). Figure 6 shows an outline generated by aggregation-major traversal. With attribute- major traversal, for each attribute, we start with the root software description of the software objects hierarchy, and perform a depth-first traversal, printing the value of that attribute for each visited node. The outline in Figure 7 illustrates ~lttribLlte-major traversal.

Systems

The MUCH system has been applied to the reorganization of individual documents, particularly books. Its current implementation emphasizes the use of a graph-like hypertext structure, but this may be seen as a transformation of the object-attribute representation. From this graph, the MUCH system strategy for generating alternate outlines is to traverse the graph with a modified, depth-first traversal. This traversal produces a tree from the graph. The traversal program takes as a parameter an ordering on attribute types, and when choosing the next object to visit will consider the ordering on object types. Traversals can start from any specified node and stop at any specified node, and the document corresponding to a traversal may be automatically generated and printed.

Since the MUCH system is not generally available, programs have been written to translate the output of the MUCH system into four, widely available hypertext systems, namely, Emacs-Info, Guide, Hyper- Ties, and SuperBook. ” The Guide and HyperTies versions allow the user to select from two different outlines when reading the book. The Emacs-Info version allows the user to generate alternate outlines from the hypertext graph.

R. RADA AND H. MILI

Discussion

This paper has been organized in a way that would lend itself to reorganization. Basically, the objects within the paper have three attributes which can be defined by their attribute values as:

0 (1) ‘organizing’ vs. ‘finding and reusing’ vs. ‘reorganizing’; 0 (2) ‘principles’ vs. ‘examples’ vs. ‘systems’; and 0 (3) ‘textbook’ vs. ‘software’.

This paper has made the top-level headings accord with attribute (l), the second-level headings, with attribute (Z), and the third-level, with attribute (3). Attribute (1) values address the method of dealing with documents so that they can be organized, reused and reorganized. Reuse has been a popular theme in many document-related disciplines, such as software engineering, but the prominence of dynamic reorganization methods is particularly linked to hypertext advances.

If this paper were to be reorganized so that ‘principles’, ‘examples’, and ‘systems’ were the top-level issues, then we would first stress that documents need to be identified before they can be reused. In other words, first document(s) are organized or classified. Then one may either use this organization to find a part of a document and reuse it elsewhere, or one may reorganize the document. A major question in the organizing of documents is whether to exploit existing outlines and indices or to manually create new abstractions for the documents. Examples of outlines and other abstractions for existing documents show what features of these organizations lend themselves to reuse and reorganization. The computer system SoftClass supports people in developing highly structured, object oriented, descriptions of software and then automatically locating related descriptions. The computer system MUCH helps people exploit existing outlines of documents and connects to various hypertext presentations.

The textbook vs. software dimension raises many interesting prob- lems of audience and economics. Textbooks are only intended for human consumption, whereas software documents may be required to make sense to a computer as well as to a person. We hypothesize that the ideal document for a particular user and task is so uniquely tailored to that user and task that reuse and reorganization are not particularly valuable ways to obtain that ideal document. On the other hand, when economic constraints do not allow a document to be highly tailored, then reuse or reorganization may be valuable whether the document be from the textbook or software domain.

319

Documents

Document reuse: Organizing, finding and reorganizing content