“Stuff” about “Stuff” — the differing meanings of “metadata”

Stuff about Stuff -the differing meanings of metadata

by Matthew J Dovey Oxford University Libraries Automation Service

Metadata is a complex subject and many of its complexities are extremely subtle To further complicate matters metadata has become an overloaded term and is often used in different contexts by different communities with different motivations It is very easy to overlook this diversity of motivation since these communities share many of their tools and problems This paper identifies three different such schools of metadata and discusses their history and motivations

Three schools of metadata

The term metadata is now very widely used within both information science and computer science In many ways the term metadata is yet another product of a recent tendency to create new names for old concepts (how many librarians have awoke to discover that overnight they had undershygone a Kafkaesque metamorphosis into a information scientist) Far worse the term metadata has become overloaded and means different things to different communities and in different contexts This is not that surprising since strictly speaking metadata means data about data or stuff about stuff (ie stuff) Also not surprisingly the various different meanings of metadata have enough in common for the casual observer to fail to miss the subtle differences 1 see three main schools of use of this term but I would not want to argue that these are the only three

The cataloguing school

The first school and 1 would hazard the oldest and perhaps largest school is that of the cataloguers Inevitable this is divided into two major camps -

those who do cataloguing and those who do metadata mdash with a small group who realise that actually both camps are really doing the same thing and it may be useful if they were to compare notes

The older of these two camps the cataloguers stem from the library world They have been creating metadata almost since the library was first invented although it is only recently that they have discovered that all this time they have been doing metadata (they naively thought they were cataloguing) In the last few centuries this metadata consisted of 5 by 3 catalogue cards and with the advent of computers in the last few decades the majority of this metadata has been in the form of M a r e 1 (MAchine Readable Catashylogue) records M a r e is a computer format devised in the late sixties as a means of transfershyring catalogue records (via magnetic tape) between different library systems Its origins are tightly linked with the heritage from the card catalogue and also the cataloguing rules the most common in the UK being the Anglo-American Cataloguing Rules (AACR now in its second major revision known as AACR22) Whilst strictly speaking it is a binary transfer format it can be rendered as a textual representation Cataloguers often edit records in this text format - originally using text editors - but now sometimes using more sophistishycated editors which have some knowledge of the format This is comparable to the move from text editors to the more sophisticated HTML editors for web authorship It seems quite common in metadata that a format intended for transfer sudshydenly becomes a format used for input display or even storage

A typical M a r e record consists of a number of fields with numeric labels from 0 to 999 Each field has a number of indicators (normally two) which refine the meaning of that field (comparable to the semantic operator in Dublin Core which I will mention shortly) Each field has a number of subfields indicated by an alphanumeric character Field 245 for example is used to indicate the title the first indicator is used to indicate whether there is an additional title statement and the title stateshyment is in subfield a (often shown in text as $a) with subfield b containing any additional title and subfield c the statement of responsibility and so on The 100 field subfield a contains the authors name and subfield d contains the authors dates and so on An extract from a typical record (in this case for a microfilm) is given in Figure 1

6mdashVINE 116

Stuff about Stuff - the differing meanings of metadata

100 1 $a Campe Joachim Heinrich $d 1746-1818

245 13 $a An abridgement of the new Robinson Crusoe $h microform $b an

instructive and entertaining history for the use of children of both sexes $c translated from the French embellished with thirty-two beautiful cuts

260 $a London $b Printed for John Stockdale opposite Burlington House Piccadilly $c MDCCLXXXIX [1789]

500 $a Original attributed to Joachim Heinrich Campe

Figure 1 - Atypical M a r e record

Campe Joachim Heinrich 1746-1818 An abridgement of the new Robinson Crusoe microform an instructive and entertaining history for the use of children of both sexes translated from the French embellished with thirty-two beautiful cuts

London Printed for John Stockdale opposite Burlington House Piccadilly MDCCLXXXIX [1789] Original attributed to Joachim Heinrich Campe

Figure 2 - Illustration of formatting information in US M a r e record

There are a number of items to note about the M a r e format and the AACR rules which dictate how to populate the records Firstly although this is not always recognised this is not a standard just for cataloguing books (as the above illustrates) Librarians have had to deal with a variety of objects from books to microfilm from broadsheets to videocassettes I know one librarian who was considering cataloguing her cassette recorders (used for language teaching) in M a r e so that she could make use of the librarys circulation system to keep track of them M a r e is hence a perfectly viable standard for creating metadata for digital objects Secondly the above example is in US M a r e Unfortunately the M a r e standard has diverged into a number of variants such as US M a r e UK M a r e UNI M a r e etc The differshyences are on the whole fairly insubstantial (the same information often occupying fields with different numerical tags between two variants) but also subtle enough to make conversion between the variants a complex process A major difference between US and UK M a r e which is worth remarking on at this stage is that US M a r e includes the punctuation needed for the presentashytion of the record For example removing all the field numbers indicators and $a $b etc field indicators from the example in Figure 1 above

would yield that in Figure 2 - which is more or less exactly how you would expect it to appear on a catalogue card or an online catalogue entry

UK M a r e on the other hand does not include the punctuation in the M a r e record but instead defines a set of rules indicating how punctuation should be added to the data when rendering it for presentation Thus the content is separated from how it should be displayed Unfortunately these rules are fairly complicated and also inconsistent in that it is not possible to write an algorithm which can realise the entire rule-set just an apshyproximation As we shall see there are a number of themes which we have touched upon in this discusshysion of M a r e which reoccur in metadata the question of binary versus textual formats the question of separating content from presentation an internal format for exchanging records between systems becoming a visible format for editing storing and manipulating records and the existshyence of multiple standards rather than the single standard

Returning to the school of cataloguing the second camp in this school is that of the computer scientists There is a growing recognition as the amount of electronic information increases and

VINE 116mdash7


as the World Wide Web continues to grow exponentially that some structure is needed to make this information manageable Hence many computer scientists are now looking at ways of constructing metadata to catalogue this informashytion

However many of these computer scientists have backgrounds in developing computer protocols and software and this alone is not sufficient for devisshying standards in knowledge management Possibly because of this background a typical approach to metadata is to concentrate on the format or syntax of the metadata (comparable to M a r e ) and to ignore the details of the content or semantics of the metadata ie to specify that there should be a location for author but to assume that the content is self-evident It is soon evident that this is not sufficient For example in the case of author questions such as how do you deal with variant spellings transliteration arise in the case of subject questions of controlled vocabulary (or in the cataloguing vocabulary authority control) arise It is precisely these questions of content or semantics above syntax that cataloguing rules such as AACR2 address Many computer scientist in this field seem to fail to recognise that what they are doing when creating metadata formats and rules is really cataloguing (in the same way that cataloguers fail to recognise that they are really doing metadata) and ignore the work and experishyence already down in this field For example consider the following opening extract from a draft proposal to the Internet Engineering Task Force (IETF - one of the groups defining standards on the internet operates)

Many music libraries music centers music publishers music shops and public need to share bibliographic musical records No standard format exists and exchanging musical records involves an important pre-andor post-processing of these data Searchshying sorting and cataloging music bibliographical records does not currently follow any standard3

I did pass this on to a few bemused music librarshyians and cataloguers Their main comment apart from pointing out the non-existent standards (such as M a r e ) which they used was that the proposal was too naive and simplistic to be of any real use The real problem here is not that there are no standards but that there are a number of standshyards and variants of standards Defining yet

another standard and claiming it as the standard is not going to work especially one which does not show a deep understanding of the problems inshyvolved A similar and worrying example is the latest work on the MPEG 7 format The group which is looking at metadata for searching audio and video almost entirely consists of digital audio and video processing experts with little representashytion from the music librarian or information scientist communities

Others in this camp do recognise the existence of cataloguers and cataloguing standards but either do not feel that they are applicable (being devised for non-electronic materials) or that cataloguing standards are too complicated again often concenshytrating on syntax rather than semantics Whilst I would argue that this is definitely not correct - the IETF draft standard mentioned above illustrates the danger of ignoring this wealth of information and experience in dealing with these problems -there is also an element of truth Librarians (and standards such as M a r e ) are used in dealing with a wealth of different media (the term multiple media encompassing both digital multimedia and physical media such as print cassettes film etc) beyond the medium of books and so have already encountered many of the problems involved in dealing with information in a variety of formats However they do underestimate the problems involved with electronic information which do not always fit into existing practice without modificashytions The rules which underpin cataloguing such as AACR2 are complicated - AACR2 consists of over seven hundred pages of fairly terse text However many of these rules have been justified by experience and attempts at a back to basics simplistic approach soon start finding themselves re-adding such complications in order to deal with real situations That is not however to say that AACR2 could not do with some revision Many rules are at a fairly nit-picking level such as the position of punctuation Many others owe their being to their card-catalogues origins or were needed to accommodate some quirk of a library system in the 1960s and really should not belong to a standard for todays computer systems Also the time needed to change or modify the standard (as is the case with many standards) is too slow compared to the rapid changes in todays technolshyogy and ever diminishing Internet time scales

I could not leave this camp without mentioning the terms RDF and Dublin Core and some of the related issues Dublin Core4 is an initiative which

8 mdashVINE 116


Content Title Subject Description Type Source Relation Coverage

Intellectual Property Creator Publisher Contributor Rights

Instantiation Date Format Identifier Language

Table 1 - Core attributes for Dublin Core

began a few years ago at a workshop in Dublin Ohio (hence the name) Its aim is to create a core standard for resource description It began by trying to determine the essential data needed to generically describe resources Eventually it devised the set of core attributes shown in Table 1

Deriving such a list of the core essential attributes for all resources is not a trivial exercise and possibly of debatable value (a quote attributed to Cliff Lynch is different communities look at different things in different ways) As such the list has been revised a number of times However even the Dublin Core community recognises that as it stands such a list is not sufficient for any but the very simplest applications To accommodate the complexity of the resources it tries to describe the Dublin Core has a concept of the semantic indicator (or operator) For example the Creator attribute has a semantic indicator to determine the type of creator (eg author composer arranger editor etc) The intention here is that someone with knowledge of the subject area will know which semantic indicators to apply to refine the search whilst someone without knowledge of the area (or searching across a range of areas) will still be able to do a less sophisticated search (this concept of having different layers of complexity is taken further in RDF) However implementing this is not without its difficulties for example should Publisher be a separate attribute or an operator on Creator

Just defining the attributes even with the added layers of the semantic indicator does not fully address the problems of describing resources It is important that the contents of these attributes follow standard rules and formats Considering the Creator entry questions can arise such as should the forename precede or follow the surname how do you deal with variant names (Shakespeare allegedly spelt his name eleven different ways and many old English names now have a modern

English corruption) or transliteration how do you deal with corporate names where the corporation has changed its name and so on For subjects ideally a controlled vocabulary should be used -otherwise a resource could fail to be found because the subject was defined as red blood cells rather than haemoglobin The Dublin Core community is addressing these issues but in doing so the standards are beginning to approach the levels complexity for which cataloguing standards such as AACR2 are criticised Many of the solutions are not that far removed from those adopted by years of practice by the cataloguing community

The RDF (Resource Description Framework)5 is a more ambitious initiative of the World Wide Web Consortium (a group for approving standards for the World Wide Web) RDF defines a standard for defining metadata standards In the case of Dublin Core it provides the storage syntax in very much the same way that M a r e does for AACR2 although there are differences in the nature of these relationships RDF is not limited to represhysenting Dublin Core and Dublin Core is not limited to being representing in RDF (the same can however be said about AACR2 and M a r e ) Unlike M a r e however RDF also has mechashynisms for referencing a standards relationship to other standards The goal is that an application does not necessarily need to know about a particushylar standard or version of a standard Given an unknown standard it can deduce various things from its definition in RDF and its relationship with other known standards This is a somewhat more ambitious layering than that achieved by semantic operators in Dublin Core but very similar in concept - for example in Dublin Core an applicashytion not knowing the meaning of Composer can deduce some things from the fact that this is a specific case of Creator RDF does not address the problem of proliferation of standards to do the same job (in fact it somewhat encourages it) nor does it define how to enforce the links between

VINE 116 mdash 9


similar standards In this respect it is debatable how far the goals of interoperability can be achieved as information systems like any other systems obey the fundamental rules of thermodyshynamics - namely that you do not get more out than you put in

The structuralist school

RDF is implemented as an application of yet another metadata standard XML (extensible Mark-up Language) Mark-up Languages are really the invention of the second school of metadata which I shall term the Structuralist School Like the cataloguing school this is a fairly old school and has its origins in a much older discipline dating from before the computer The Structuralist School is a product of the humanities disciplines in particular textual analysis As scholars of the humanities began looking at computer based tools to aid their researeh and teaching they soon realised that simple streams of characters or words were not sufficient to capture the complexities or techniques of grammatical linguistic and literary analysis Their work revolved around regarding the text (in the sense of a piece of literary work) on a

variety of hierarchical level and also considering links both within the text and between different texts

This lead to the invention of the mark-up language - the ability to embellish a simple ASCII text file with additional comments and hints as to its underlying and interpretive structures SGML (Standard Generalised Mark-up Language - ISO 8879)6 was invented in the 1970s as a definition of the syntax and rules for defining tags indicating these structures and how the tags should be used within a document These tags can be used for a variety of purposes from delineating grammatical structure (such as paragraphs verb etc) to indicatshying alternative readings or translations This leads to the Text Encoding Initiative (TEI)7 which aims to define an SGML application for marking up texts Figure 3 is a sample extract

It should be noted that some of the TEI tags (which form what is known as a TEI header) describe the document as a whole (for example the DOCTITLE in the example above) In this respect TEI overlaps with the cataloguing school

SGML is one of a family of standards which are used in expressing TEI and other standards for

10 mdash VINE 116


mark-up The other members of this family are HyTime8 which is used for defining different kinds of links between and within documents DSSSL9

which provides mechanisms for defining a script which determines the presentation of an SGML document and HyQ (technically part of HyTime) which defines a language for searching and navishygating a SGML structure I should also mention at this stage MPEG 710 which is currently trying to achieve something similar to TEI but for audio and video data to illustrate that the general techshynique is not just applicable to textual data

However more recently focus has been on a relative newcomer to the mark-up scene namely XML (extensible Mark-up Language)11 Its history is somewhat interesting as it highlights a number of the reoccurring themes in metadata mainly the problem of separating content from presentation and that the attempt to build a simple standard tends to result in rediscovering complexshyity The life-blood of the World Wide Web HTML (HyperText Mark-up Language) was an SGML defined language for writing documents Initially in the early versions the tags used for mark-up defined the structure of the document (tags to indicate paragraphs headings of different levels titles etc) The presentation of these tags was left to the browser and in many cases this was definable by the user As the Web grew and especially as commercial interests took hold content creators wanted a greater control over the presentation of their information It is no coincishydence therefore that most of the recent additions to HTML have been tags defining presentation rather than content Running parallel to this is the rapid growth of the Web and the need for software to help manage this information (such as search engines) However it is very difficult to manipushylate HTML documents where the bulk of the mark-up is aimed at its presentation rather than content and meaningful extraction of content has to be done using heuristic algorithms The purpose of XML is to allow the mark-up of a document according to its content rather than its presentation and there is a standard under development called XSL (extensible Style Language) which define how to specify a set of rules describing how to present the document At a recent conference on XML Brian Reid who invented an early text processing tool called SCRIBE was heard to comment that if the aim was to separate content from presentation then XML had already lost - the basis for this comment being that this had been

attempted before but human nature had defeated the effort

Reid was in fact referring to the element of reinventing the wheel as regards the development of XML XML itself is very close to SGML It is almost a subset although there are a few subtle differences - none important enough that an entire new standard is preferable than a modification to the existing SGML standard XSL although very different from DSSSL performs the same task XHL (extensible Hyperlink Language) is functionshyally a subset of HyTime (and almost a subset in terms of syntax) A proposed standard XQL (extensible Query Language) performs the same task as HyQ A justification for yet another standshyard is that the SGML family of standards are too complicated (there are very few HyTime impleshymentations for example) However just as in the cataloguing school of metadata the limitations of a simpler standard are soon discovered as it is used in earnest and the complexity of the simshypler standard slowly increases and begins approaches the complexity of the original

The data-structure school

XML brings us to the final of my divisions into three schools of metadata This is a fairly new school which sees XML as a universal language for defining data structures The origins of this school are in the main based on the rise of the Internet The problems of content versus presentashytion are most obvious when accessing a database resource through a web-based interface Normally the interface is designed with the presentation of the resulting data in mind However many users wish not only to view this data but to also to use and manipulate this data - hence the move to separating the data from the presentation using XML and XSL XML is also being viewed as a suitable exchange format between databases The comma delimited textual format is no longer suitable for the complex data structures that modern databases (such as multimedia and object databases) have to handle The Meta Data Coalishytion12 is an initiative various vendors looking to develop standards for precisely this purpose Their aims are somewhat ambitious (similar in style to the RDF initiative) in that they also hope to use an XML basis for defining and exchanging the semantic structure of the database as well as the data content

VINE 116mdash11


However XML is also now being viewed as a standard for editing manipulating and storing data in the same way that M a r e has been corrupted from a purely exchange format in the bibliographic world Whilst editing XML in a text editor does allow a large amount of flexibility but there are more appropriate mechanisms for editing data which allow a far greater amount of verification Also XML text is not a particularly efficient means of storing data either in terms of space or searchshying There is a camp within this school who religiously regard text formats as good whilst binary formats are bad This probably stems from the traditions of the Web where the main protocol is a now heavily overloaded text transfer protocol (HTTP - HyperText Transfer Protocol13) and most server scripts are written in text processshying languages such as Perl14 This school regards XML as a universal language for expressing data structures This is subtle difference to the Meta Data Coalition which sees XML as a language for describing data structures but not necessarily that those structures should be realised in XML This concept of a universal language for notating data structures is not new There is an ISO standard which is intended to do this namely ASNl15 ASNl however produces formats which are binary in nature and such are an anathema to this particular camp There is a fallacy that textual formats are less proprietary because they can be edited in a text editor However a complex XML document although it can be opened in a text editor may be unintelligible without a well-annotated document explaining the semantics and usage of the tags and the structure in the same way a binary document can be opened in a hex editor but is unintelligible with documentation of its structure Binary formats in certain situations have a number of advantages They can be more efficient in storage and processing They are less prone to abuse and can lead to more readable and easily maintained software code However this is not really an eitheror divide - both textual and binary formats have a role to play depending on the context The strong movement for text-based formats however has lead to an initiative to develop a means of expressing ASNl over XML16 mainly so that traditional non-textual and non-XML protocols such as Z395017 can be made more attractive to the XML and Internet communishyties (but not to add any real functionality or benefits) There are similar initiatives with other traditionally binary-based computer communicashytion protocols such as CORBA (Common Object Request Broker architecture)18

Bringing it together

Metadata means different things to different communities I have identified three fairly distinct (albeit interconnected) such communities Alshythough these communities have different aims and different motivations the tools and standards that are used are often the same Perhaps more surprisshyingly the problems and history of the different communities are also comparable There is a tendency to re-invent the wheel often in an atshytempt to produce something simpler but only to discover that the complexity is really needed This and other factors leads to a number of standards to do the same job and attempts to invent a new standard to be the standard just proliferate the problem We often see a standard intended as an interchange or internal format become a visible standard for data entry manipulation and storage There is a tendency to concentrate on the problem of structuring metadata and its syntax rather than the complicated problem of the semantics that should govern its content There are the interplays of textual versus binary formats and the desirabilshyity and possibility of separating the content from its presentation Finally the similarities of the problems and techniques of the different communishyties can often lead to confusion between them if they do not realise that they are trying to achieve different things

References

1 M a r e Standards Office httpwwwlocgovMare

2 Anglo-American Cataloguing Rules 2nd Edition 1998 Revision Library Association Publishing ISBN 1856043134

3 A Bibliographic Format for Music Media AVan Kerckhoven httpwwwietforg internet-draftsdraft-avk-bib-music-rec-0ltxt

4 Dublin Core Metadata Initiative http purloclcorgdc

5 Resource Description Framework http wwww3 orgRDF

6 Goldfarb Charles F The SGML Handbook Edited and with a foreword by

12mdashVINE 116


Yuri Rubinsky Oxford Oxford University Press 1990 Extent 688 pages ISBN 0-19-853737-1 (includes the text from ISO 8879) (see also httpwwwoasis-openorgcover sgml-xmlhtml)

7 Text Encoding Initiative Homepage http wwwuiceduorgstei

8 HyTime HypermediaTime-based Structuring Language (HyTime) 2nd Edition ISO 107441997 (see also http wwwhytimeorg)

9 Document Style Semantics and Specification Language ISOIEC 101791996

10 MPEG 7 httpdrogocseltitmpegstandardsmpeg-7 mpeg-7htm

11 Extensible Mark-up Language httpwwww3 orgXML

12 The Meta Data Coalition httpwwwmdcinfocom

13 HyperText Transfer Protocol httpwwww3corgProtocols

14 Perl httpwwwperlorg

15 ISO 8824 mdash Information Processing Systems - Open Systems Interconnection -Specification of Abstract Syntax Notation One (ASN1) 1990

16 XER Initiative httpasfgilsnetxer indexhtml

17 Z3950 Maintenance Agency http www locgovz3950agency

18 OMG CORBA HomePage httpwwwcorbaorg

Contact details Matthew J Dovey Libraries Automation Service Oxford University 65 St Giles Oxford OX1 3LU Tel 01865 278272 Email matthewdoveylasoxacuk

VINE 116mdash13


100 1 $a Campe Joachim Heinrich $d 1746-1818

245 13 $a An abridgement of the new Robinson Crusoe $h microform $b an

instructive and entertaining history for the use of children of both sexes $c translated from the French embellished with thirty-two beautiful cuts

260 $a London $b Printed for John Stockdale opposite Burlington House Piccadilly $c MDCCLXXXIX [1789]

500 $a Original attributed to Joachim Heinrich Campe

Figure 1 - Atypical M a r e record

Campe Joachim Heinrich 1746-1818 An abridgement of the new Robinson Crusoe microform an instructive and entertaining history for the use of children of both sexes translated from the French embellished with thirty-two beautiful cuts

London Printed for John Stockdale opposite Burlington House Piccadilly MDCCLXXXIX [1789] Original attributed to Joachim Heinrich Campe

Figure 2 - Illustration of formatting information in US M a r e record

There are a number of items to note about the M a r e format and the AACR rules which dictate how to populate the records Firstly although this is not always recognised this is not a standard just for cataloguing books (as the above illustrates) Librarians have had to deal with a variety of objects from books to microfilm from broadsheets to videocassettes I know one librarian who was considering cataloguing her cassette recorders (used for language teaching) in M a r e so that she could make use of the librarys circulation system to keep track of them M a r e is hence a perfectly viable standard for creating metadata for digital objects Secondly the above example is in US M a r e Unfortunately the M a r e standard has diverged into a number of variants such as US M a r e UK M a r e UNI M a r e etc The differshyences are on the whole fairly insubstantial (the same information often occupying fields with different numerical tags between two variants) but also subtle enough to make conversion between the variants a complex process A major difference between US and UK M a r e which is worth remarking on at this stage is that US M a r e includes the punctuation needed for the presentashytion of the record For example removing all the field numbers indicators and $a $b etc field indicators from the example in Figure 1 above

would yield that in Figure 2 - which is more or less exactly how you would expect it to appear on a catalogue card or an online catalogue entry

UK M a r e on the other hand does not include the punctuation in the M a r e record but instead defines a set of rules indicating how punctuation should be added to the data when rendering it for presentation Thus the content is separated from how it should be displayed Unfortunately these rules are fairly complicated and also inconsistent in that it is not possible to write an algorithm which can realise the entire rule-set just an apshyproximation As we shall see there are a number of themes which we have touched upon in this discusshysion of M a r e which reoccur in metadata the question of binary versus textual formats the question of separating content from presentation an internal format for exchanging records between systems becoming a visible format for editing storing and manipulating records and the existshyence of multiple standards rather than the single standard

Returning to the school of cataloguing the second camp in this school is that of the computer scientists There is a growing recognition as the amount of electronic information increases and

VINE 116mdash7









8 mdashVINE 116











VINE 116 mdash 9









10 mdash VINE 116









VINE 116mdash11





References







12mdashVINE 116
















VINE 116mdash13









8 mdashVINE 116











VINE 116 mdash 9









10 mdash VINE 116









VINE 116mdash11





References







12mdashVINE 116
















VINE 116mdash13











VINE 116 mdash 9









10 mdash VINE 116









VINE 116mdash11





References







12mdashVINE 116
















VINE 116mdash13









10 mdash VINE 116









VINE 116mdash11





References







12mdashVINE 116
















VINE 116mdash13









VINE 116mdash11





References







12mdashVINE 116
















VINE 116mdash13





References







12mdashVINE 116
















VINE 116mdash13
















VINE 116mdash13

Documents

“Stuff” about “Stuff” — the differing meanings of “metadata”