Second international conference on automatic processing of art history data and documents, Pisa: II. Using computers for art history and collections management: The problems and some

56 Automatic Processing of Art History Data

Pisa Conference (contd)

II. Using Computers for Art History and Collections Management:

The Problems and Some Solutions

ROB DIXON

Pisa was chosen as the venue for the Second International Conference on the ‘Automatic Processing of Art History Data and Documents’, held at the Scuola Normale Superiore on 24-27 September 1984. The joint organizers, the Scuola Normale Superiore and the J. Paul Getty Trust, hoped for about 120 delegates. Space limitations, rather than level of interest, limited the actual attendance to 350 people. The organizers must have had many problems accommodating almost three times the number of participants expected, but, if so, this was not apparent to those attending this well-organized conference.

The programme consisted of a series of nine sessions on different topics at which a leading authority gave an overall summary of a particular aspect of the topic, normally followed by a panel of three or four other experts in the field contributing their own viewpoints. When time permitted, delegates were able to add their own comments or ask questions. There were also several live demonstrations each day. Valuable documentation provided to delegates included three volumes of preprints, two containing 49 papers specially prepared for the conference, the third being a census of ‘Computerization in the History of Art’. The proceedings of the conference will be available from the Scuola Normale in due course, so rather than attempt to detail the discussions, several problem areas which were raised throughout the conference are reviewed here, as well as their suggested solutions.

The broad aim of scholars is to create a computer-based research tool which will be able to store and aIlow fast and easy retrieval of any information relating to the history of art. This would include catalogues of all object types, biographical details, bibliography, iconography, and relevant social, economic, political and art history. If all the data available were stored in one totally integrated central system, accessible to all, expanded and amended by those with appropriate knowledge and authority, then such a system should assist researchers to make scholarly inferences from the information, deductions which would not be arbitrarily restricted to any one collection or information source. Such a central database would require enormous research and entry of data. This would, however, be somewhat less than the work entailed in compiling a large number of separate projects, and would eventually achieve much greater accuracy, as prime sources of information would become more easily accessible and eliminate conflicts between secondary sources. Such a system must also serve collection management needs. The general consensus of the conference was that this was a valid objective, and the obstacles to its achievement were discussed at some length.

The obstacles were of two kinds: firstly the difficulty of considering the concept of a system that could be ‘all things to all men’, when some institutions who might wish to participate have

ROB DIXON 57

genuinely different requirements from others with collections of similar objects; the second was finding a technical solution in terms of computer hardware and software. The two main obstacles to the overall concept were ‘standards’ and ‘vocabulary control’. An example of the standards problem was the idea that all museums wishing to catalogue paintings would have to agree a common standard or structure of information to be entered about paintings before a computer system could be designed to catalogue them. Vocabulary control was necessary to ensure standard terminology, but there was debate on just what terminology, particularly in iconography, was appropriate, and how control should be exercised, although subject thesauri were clearly the appropriate tools.

No attempt was made at the conference to specify exactly what was required from an integrated system, so an outline of the basic requirements is given here-the full specifications are much more complex. In an integrated system, information of every type would be held together so that operators could browse at random through all catalogues, biography, history, etc. Each catalogue would have features for storing data particularly relevant to the type of object, such as author for book, or artist for a painting. Information about artists, such as dates and places of birth and death, would be held, not in the catalogue entry for each work by that artist, but in biographical records, thus reducing data redundancy. Links would be established between the artists’ records and each of their works, the link being the function or relationship (in this case artist) between the artists’ records and the catalogue entries. If an artist also wrote a book, and engraved prints, there would also be similar links between the biographical record and the relevant book and print records, the functions being author and engraver. Thus, objects in three separate catalogues would be connected through different relationships to the same biographical record. An inquirer to the biographical record could use the relationships to navigate to the related painting, book and print records. Alternatively, you could, say, locate a painting record for the artist, then use the relationships to find details of the artist’s life from the biographical record, and then move from that to other paintings by the same artist, or to the related book or print records.

Searching through the text of the different catalogue entries to find all occurrences of the artist’s name would not be an adequate solution. On a large multi-user system with many records this could consume too high a proportion of computer resources, suddenly causing an unacceptable degradation in response time for other operators, whilst failing to give a timely response to the inquirer. In any event, such a system does not properly handle duplicate names. If the artist’s name is ‘John Smith’ and there is more than one ‘John Smith’, the search of the catalogue entries alone would not be able to select only the ‘John Smith’ required. Creating a relationship between the catalogue entry and the correct ‘John Smith’ selected from the biography file overcomes this problem, and ensures that the operator creating the catalogue has to decide which ‘John Smith’ is correct. This is a form of vocabulary control not generally possible in free text systems. The main reason that a totally integrated system has seemed impossible to achieve is that, using normal techniques of system design and development, it would be so very complex and difficult to construct. To provide links between catalogues and biographical records of authors, artists, engravers, etc., would require special programming for each link; and links in the reverse direction, from biography to catalogue entries, would require further programming for each link. These are only some of the links that would be required.

The conference had no overall solution to the problem of data standards nor of producing an integrated system. The demonstration systems, which varied considerably in user friendliness and the facilities provided, were all designed to solve particular cataloguing problems and did not appear readily adaptable to other types. None was integrated. Even if standards of information reouirements for particular object or other information types could be agreed, the idea of

$8 Automatic Processing of Art History Data

achieving an overall standard that would allow the construction of a totally integrated system seemed only a dream to some. These were just a few of the many problems which needed solving when the objectives of STIPPLE (System for Tabulating and Indexing People, Possessions, Limnings and Ephemera) were first considered in June 1980. The general aims of STIPPLE were very similar to those discussed at the conference.

STIPPLE has overcome all the problems discussed there, and went live in February 198 3. Before explaining how the solutions to the problems evolved, it is necessary to understand the general specifications of STIPPLE when design commenced.

1. The system had to be totally integrated to allow browsing and full name and vocabulary control. It must be possible to bypass menus when required.

2. It had to be ‘all things to all men’, however ambitious an aim this might seem. If this could not be achieved, then the justification for and value of the whole project would be much reduced. The system must be online and real-time.

3. Whilst agreement on basic items to be recorded about particular object types could probably be achieved between different users, it was unrealistic and undesirable to expect agreement on all types of data, as this in itself implied an artificial limit to the variety and amount of data that could be recorded.

4. It was unreasonable to expect possible users to define in advance all their requirements for the types of data to be stored, or the eventual uses of the system and the types of inquiry, as these would evolve and continue to change with experience of using the system. In many cases, users did not have adequate manual systems, so were not sure of their needs. The system must be totally open-ended to cater for changing and unanticipated facilities.

5. It was necessary to be able to establish an unlimited number of relationships between any records of any type. It must be possible to have an unlimited number and type of secondary access paths to any record. The use of these features should not be pre-defined in programs- so that the types and numbers of relationships could be changed by authorized operators without any programming changes being required.

6. Variable-length records and repeated fields and groups of fields were necessary, as well as unlimited free text.

7. It should be possible to operate the system without the use of code books. 8. The system had to be very efficient in its use of computer resources to allow a large number

of users to operate concurrently with fast response times. 9. The system must require minimal operator training, particularly for the casual inquirer, and

have a standard method of operation, whatever the type of data.

We began to realize that the way to avoid imposing a standard solution upon users was to devise a standard, highly structured method of storing data in the computer, irrespective of its type or purpose. Then, by controlling access to the various elements of data, it would be possible to provide different views of those data to suit each user. At the same time, this would make an integrated solution easier to implement. It was only when we learnt about the advanced architecture of the IBM System /38 that we were able to find a method of achieving our data structure. The System / 3 8 is a virtual machine with 64 bit addressing. This means that it can address at any one time 2”” pages each of 5 12 bytes. This is not twice the amount of the more normal 3 1 or 32 bit machines, but the square of it, a theoretical total of over 9 400 000 000 000 000 000 000 bytes of storage immediately accessible, all of which could be in one file. This is way in excess of the total capacity of all disc drives currently attached to all computers of any make in the world. At present 6. 1 billion characters of disc storage can be attached to the System / 3 8, although this is likely to increase, but the addressing structure allows

ROB DIXON 59

a new approach to storing and retrieving records. This was the machine we selected. The addressing structure allowed us to reserve in that structure, but not in disc storage, a space

for a very large number of different data elements or types relating to any catalogue entry, biographical record, subject thesaurus record, etc. Each data element could have a very large number of characters of storage, but again using only the space actually required. Thus, purely as an example of the principles involved and not as an explanation of how STIPPLE actually works, we could in the addressing structure accommodate 4700 different major categories of data, e.g. biography, catalogues for many object types, etc., allow space for a million records in each, and within each record subdivide it into a possible one million different elements of data, and within each element have the capability of storing a million bytes of data. Although at first this would appear to cope with most situations, it still does not give adequate flexibility (there could be more than one million objects in any catalogue), and the actual principles we have used are much more complex than this. Nevertheless, taking the example, it is difficult to imagine that all museums and experts in the world together could between them request a million different types of data about any one object type; and this is not the purpose of the method, which is to ensure that we will never run out of space in the addressing structure.

Such a system would be practical only if no space on the discs had to be reserved when it was not being used. A common method of using large virtual storage is to use ‘hash indexing’. In order to ensure that duplicate hash addresses are never generated, an algorithm must be developed which may leave large spaces in the numbering system. Initially this wastes a great deal of disc space, although as the database grows it becomes more efficient. We did not feel that this method of addressing was appropriate, and so chose to use a binary tree index. If records are always being accessed randomly, hash indexing does have speed advantages over a binary tree. Our researches suggested that in practice operators do not wish to access every record in a totally random sequence; they may wish to access one record randomly and then access sequentially the next few records in the database. In this case, using a binary tree index, only two movements of the disc access arm are required. This, being a mechanical movement, is much slower than reading records in sequence. With hash addressing, a head movement would be required for every record, and would therefore be slower overall. We have used an algorithm in conjunction with the binary tree index which allows us to compress the identifier of a record, for instance the identifying name, in such a way that the resulting internal identifier in the computer is always of the same length, yet maintains the records in alphabetical sequence. No disc space is wasted when it is not used, and none has to be reserved. All records of whatever type are put into one physical file, and we have designed a relational model which allows us, without having constantly to change that model, to handle situations far more complex than any real-world problem which we could imagine. Thus, we no longer need to consider our relational model.

The method we have in fact used allows about a billion (lo’*) elements of data about a billion objects in any one catalogue (or in the biographical file, etc.). Each element of data can be very large, if required, and is given a name, although different users can be given different names for the same element of data: thus, one might prefer to use the term ‘iconographical content’ for an index of visual representations in a work of art, whilst another user might prefer the term ‘subject index’. Equally, not all users will require access to all data elements, and by providing separate lists for users with special requirements they can be allowed to see only those elements that they require. STIPPLE even allows users who wish to reference a particular data element by the same name to store and retrieve data which suit only their requirements. These need not be the same data as others might use in the same context. An example of this is where different users have different requirements for indexing the iconographical content of an image. A Museum of Leather might be interested only in objects made of leather that can be identified in any work of

60 Automatic Processing oj’A rt History Data

art, and go into considerable detail about these. It might, for instance, wish to identify blinkers on horses, different types of harness and bridles, identify all horses where a martingale was used, and perhaps distinguish between racing-, hunting- and side-saddles. Whilst a standard method of indexing subject content should be able to handle this, including such specialized detail in a standard subject index for a particular work of art would distract other users from those items which are of more general interest, and might cause considerable annoyance to them. Equally, a Museum of Leather might not be interested in any other elements in the picture, perhaps not even the type or breed of horse, unless the harness, etc., were made for a particular type of horse.

In STIPPLE, it is possible to set up the system so that the special needs of particular users are kept separately from the more general requirements. It would be up to the museum with the specialized interest to decide whether they wanted access to only those items relevant to their needs or the wider range of data. This one file not only contains user data but also menus (lists of types of data which allow users to navigate through the data and control the routes they can take), and also option tables. Most systems, unlike STIPPLE, tend to have separately compiled programs for each menu. The option tables tell the main processing program how to process the particular subset, or table, of data in the relational database which is currently being processed. This allows ‘run-time’ modification of the main processing program, even though the modifications are stored as data in a file which is processed by that program. Since the menus and the option tables are stored in exactly the same way as user data, they can also be added to, changed and deleted in real-time whilst the system is in use. Special menus and option tables can be created for any user or department or operator. This approach allows STIPPLE to be tailored very easily to the requirements of any user, and for their ‘view’ of the data to be changed when required, without affecting the facilities available to anyone else. Most of the changes can be made dynamically without any programming being necessary.

The System /38 uses only one copy of an active program, even when it is in use by many operators. In STIPPLE, just one program does all file accessing and updating of the main file, irrespective of the type or purpose of the data, so that all operators, whatever their requirements, will always be using this one program although they are unaware of its existence. In fact, it was designed not particularly for art-historical data, but for any commercial application, including accounting, office administration, word processing, etc. This main program calls a few other programs which relate to particular general groups of record types. Up to 64 000 different record types can be defined as standard, and any or all of these can be modified as required for particular users. It is theoreticaIly possible to have all 64 000 record types in one table, so that data in any one table can be copied to any other table. All these points are vital if the system is to be totally integrated.

By defining the content of each data table, and the relationships allowed between it and other tables, it is possible to define data structures to be followed by a user when both adding and accessing data. This linking facility has unlimited flexibility. In the list of design targets for STIPPLE given above, no mention was made of subject thesauri, and the full significance of these was not appreciated initially, although they had been considered. When the need for them became apparent, it took just one hour, without any programming, to use the facilities of the relational database to create an unlimited number of subject thesauri which can all be cross- referenced to each other with both hierarchical and lateral relationships, synonyms, hypernyms, antonyms and definitions of all the terms in the thesaurus. Each separate thesaurus is created as a subset of a master thesaurus, so that any term used in several separate thesauri has exactly the same meaning to the system. Any term or phrase can be included in many hierarchies in each thesaurus, if required, and there is no effective limit to the number of levels in a hierarchy. In addition, it is not necessary always to travel through hierarchical relationships to find a

ROB DIXON 61

lower-level term, as all the allowed terms appear together in any particular thesaurus even when there are hierarchical relationships between some of the terms. This means also that the system requires what loosely might be described as ‘knowledge’. Thus, once the relationships have been defined in the thesaurus, such as, for instance, a mallard is a type of wild duck, which is in turn a type of waterfowl, which in turn is a type of bird (these relationships are not intended to be ornithologically correct!), then in future any reference to a mallard will be automatically connected to the above links. The particular subject thesaurus in use can be defined in the option table for a specific data table, to ensure strict vocabulary control when adding records to the data table. As users can have their own option tables for a data table, each user or department or even operator can be given different thesauri for vocabulary control to suit their needs.

Whilst at the Pisa conference the need for vocabulary control for cataloguers and indexers was properly recognized, no real mention was made of the needs of researchers who cannot be expected to know the allowed terminology used by a cataloguer. STIPPLE allows researchers to enter their own terms and, through hierarchical and lateral relationships, find the terminology which is allowed and then to find the records which have been indexed under that terminology. Since the method of storing data in STIPPLE is the same, irrespective of its type, then the facility to create hierarchies in subject thesauri can be used elsewhere, for instance to record ‘whole’ and ‘parts’, so that a set of prints can be recorded both as a set and as individual prints which make up the set. Equally, the whole or parts do not need to be in the same data table so that, using the example of prints again, the whole might be the book from which prints which were not separately published have come. As any record in any table can be cross-referenced or related to any record in the same or any other table, terms in the various thesauri can be used to classify or index other records in the thesauri or records in any other table. Thus, when indexing buildings, the specific type of building, for instance hospital, theatre, church, etc., can be indexed as well as the more general purpose of building, such as public, religious, etc., the type of architecture, for churches the denomination, and any features of each building which might merit special mention, such as, in a church, the choir screen, the organ loft, the roof, the tower, etc. In this way, any reference to ‘tower’ can be traced via the subject thesaurus where it is part of any type of building, including even the Tower of Babel. Not only can relationships between records of any type be established, but terms in the subject thesaurus can be used to describe the relationship. Thus, Sir Joseph Paxton can be identified as the architect of the Crystal Palace, or Lord Boringdon, then the Prince of Wales, and Col. Dennis O’Kelly as the successive owners of an eighteenth century racehorse, Anvil. The subject thesaurus can be used to create controlled lists of the part names of any object, so that information about them may be entered where appropriate, but information cannot be entered about a part which cannot belong to the particular object or idea.

These features can be combined to provide the facilities that all users may require for cataloguing any object type or recording any other related information. All information is stored under the particular object or subject heading to which it relates, unlike many free text packages used for bibliography where the information is stored in the sequence in which it occurred in the original source document. So far, separate catalogues have been created for the following object types:

architectural fittings buildings or structures articles exhibition, saleroom or dealer catalogues essays ephemera theses, etc. fans books picture frames

62 Automatic Processing of Art History Data

furniture photographs manuscripts prints and engravings musical instruments site (of a building or of archaeological newspapers, magazines and other periodicals interest) paintings water-colours and drawings

No programming at all has been required to create these catalogues, nor to provide the authority files (e.g. for artist, author, builder, etc.), nor for any other vocabulary control, such as special thesauri (for building types, for iconographical content, etc.), nor for creating any relationships between items in different catalogues (for instance, an engraving may have been engraved from a drawing or water-colour reduction prepared by the engraver or by another hand, and this in turn may have been taken from an original work of art, probably an oil-painting). Where there is more than one version of an oil-painting (which may or may not be by the same painter), these can all be related together, even if they have been given totally different titles.

When there is a need to create cataloguing facilities for a new object type, the basic facilities can be provided in about half an hour without programming, and the more complex requirements might take two or three hours to create, but still without any programming. Our own experience has certainly been that with use of the system we have wanted to change some of the facilities. This can be done without affecting existing data whilst the system is in use. Each catalogue entry consists of a small basic display and then a variable number of further elements of data about the object. The system is menu-driven, and special menus can be included at any point in the system to suit individual users, departments or operators. These can be used to control or limit access to just some of the data elements, and to allow different users to refer to the same data element as another user but under a different heading. As yet, we have not attempted to create a catalogue for any natural-science objects, but see no reason why this should cause any problems.

The facilities available for cataloguing and collections management of oil-paintings are typical of those which can be provided. Unlimited free text can be entered under various headings, such as attribution, history of the painting, related paintings or other versions, preparatory sketches and more general notes. Other headings can easily be added. The picture can be indexed under the painter, the designer, for a portrait the sitter, under a place depicted or an event portrayed, and under any iconographical items selected by the indexer. The painting can be cross-referenced to all collections in which it is known to have been, and all references to the painting and bibliography can be cross-referenced to the appropriate item in the book, the catalogue, article or ephemera catalogue, together with (where appropriate) volume and page numbers. The catalogue entry for the painting can be cross-referenced to any other objects of the same or of a different type. Where the painting itself is a source of information (e.g. the depiction of an event), then the items to which it relates can also be cross-referenced to it. Any photographs of a pamtmg can be catalogued as separate objects and cross-referenced to the painting. Collections management information includes exhibitions, framing details, if necessary packing details, the normal location for the painting, the present location (it may be on loan to another institution), the complete conservation history covering inspections of the condition as well as actual conservation. All conservation work can be cross-referenced to the restorer concerned, so that it is immediately possible to see all work carried out on any object in date sequence. At the same time, a diary is automatically created in which all conservation work for all paintings is listed in date sequence so that a complete history of conservation of all oil-paintings in the museum can readily be seen. Where any information about the painting is cross-referenced to any other object-a person, place, event, conservation record, photograph, etc.-these cross-referenced

ROB DLXON 63

records can be used as additional routes to the information in the painting catalogue. An inquirer can therefore immediately browse not only through the information about a particular painting but also through any related information, because of the total integration of the system.

STIPPLE has been in use seven days a week for over eighteen months. It is being used to create a catalogue of British prints, mainly of the eighteenth and early nineteenth centuries, together with related information, such as biographical records, bibliography, etc. About 200 000 records are now up on the system (not all are related to art history). These are all in one file which, so far at least, has never been destroyed since it was created in July 1982. A particular use which has thoroughly tested the system and proved its power in collating information has been in the writing of a book about George Stubbs and his son, George Townly Stubbs, including a catalogue ruisonnt of all known prints by and after Stubbs. Five people have been working on this book, all adding to and amending the database concurrently. All sources searched are entered so that all information found is immediately identified. If one of the researchers wishes to know whether a particular source has been checked and, if so, what information was found, he does not have to remember to ask his colleagues when he next sees them but can immediately find out the answers using STIPPLE. This helps to reduce duplication of effort and also to ensure, as far as possible, that all known sources are checked. STIPPLE has certainly allowed the information for the book to be gathered and collated much more quickly and, hopefully, more accurately than would have been possible with a manual system. In the process scholarly inference can be made from the data collated by STIPPLE. For instance, in the late 1760s and throughout the 1770s there was an active market in publishing and republishing prints after George Stubbs by other engravers. Then, in the early 1780s this virtually stopped until 1 May 1788, when a large number of prints engraved by Stubbs himself were published. Whilst such observation could have been made without the use of a computer, it would have been very much more difficult, and a positive effort would have been needed, whereas in STIPPLE the information is automatically collated in the correct sequence, irrespective of the order in which it was entered. Looking at the publication history of prints by or after Stubbs automatically highlighted this interesting piece of information-unfortunately this is not the place to speculate on the significance of the 1780- 1788 gap.

The use of STIPPLE to date has all been in-house, and has been a thorough test of the system. We have just completed a successful one month’s trial run at a major London museum. Within ten minutes of the terminal arriving, it was online back to our System /38 over a leased line which had been installed temporarily. There were no system or programming problems during the trial, and the curatorial staff who had never operated computers before soon learnt to explore the existing database and-to add to it. The museum had provided no specifications of their requirements in advance, yet STIPPLE did have all the facilities that they required. If they decide to take STIPPLE as a service, or use the software on their own System / 3 8, they will no doubt require minor amendments to suit their own particular needs.

STIPPLE has already solved all the problems raised at the Pisa conference. The solutions may not always be perfect, but they work well and they work today. Experience of using STIPPLE will no doubt stimulate new ideas, and the best of these will be implemented to improve the system further. STIPPLE is an open-ended system and because of its unique design we are unable to think of any information-handling request (ignoring, for the moment, the fields of artificial intelligence and expert systems) that we cannot implement with relative ease.