Case History:Library of Congress Audio-Visual Prototyping ProjectMETS Opening DayOctober 27, 2003
Carl FleischhauerOffice of Strategic InitiativesLibrary of Congresscfle@loc.gov
The AV ProjectPreservation, sense one: reformatting into digital-file formPreservation, sense two: sustaining digital objectsParticipation by Motion Picture, Broadcasting, and Recorded Sound Division (M/B/RS) and the American Folklife Center
Reformatting DocumentationAbout the source original disc or tape being reformatted
About the processhow the copy file was made, what devices/tools
About the outcomecharacteristics and features of the copy file
PRODUCERSADMINISTRATIONDATAMANAGEMENTARCHIVALSTORAGEINGESTACCESSCONSUMERSPRESERVATION PLANNINGReference Model for an Open Archival Information System (OAIS)SIPs (Submission Information Packages) will be produced by the AV preservation activity, ready to submit to LCs future digital repository.
AV Project Web Site Home Page http://lcweb.loc.gov/rr/mopic/avprot/
AV Project Extension Schema Page http://lcweb.loc.gov/rr/mopic/avprot/metsmenu2.html
AV Project Initial Data Capture SystemMS-Access Database - Collation Input Screen
Top level: workSecond level: sound recordingsThird level: disc sidesFourth level: cuts
Recorded Sound Processing SectionContent selected for reformatting1. Initial creation or copying-in of metadataWorkflow Sidebar
Recorded Sound Processing SectionContent selected for reformatting1. Initial creation or copying-in of metadataLC Recording Lab or offsite contractorScanning activity2. Creation of second layer of metadataWorkflow Sidebar
Recorded Sound Processing SectionContent selected for reformatting1. Initial creation or copying-in of metadataLC Recording Lab or offsite contractorScanning activity2. Creation of second layer of metadata3. Return loop to processing, edit and possible addition of third layer of metadataWorkflow Sidebar
The AV METS System Today
OUTCOME ONE: A VIRTUAL DIGITAL OBJECT (SIP)Logical storage structure based in a UNIX filesystemmaster -- family of logical directories where the master files are stored (there is a parallel set of service directories)afc -- owner is the American Folklife Centerafc1941001 -- group or aggregate of items, often from an actual collectionsr05 -- item directory (at the level of the digital object, counterpart to a bib record or line in a finding aid)sr05am.wav -- the master file for side A of this discsr05am.wav -- the master file for side B of this discIndex of master/afc/afc1941001/sr05
OUTCOME ONE: VIRTUAL DIGITAL OBJECTThe fileGrp segment of a METS instance binds the objectIncludes logical pathnames for files, future switch to persistent names possible.
OUTCOME 2: PRESENTATION OF OBJECTPresentation in Browser
Zoom on Image in Presentation
Interim username/password access management
In the Presentation: Metadata Map for the Dedicated
sourceMD data from the Metadata MapExtension schema content displayed as name-value pairs
Generator takes data from the database and makes METS XML
Snapshot of the database back end
Selection from the database diagram: tables for METS id, agent information, and structMap data
Selection from the database diagram: tables for extension schema data for image source, video source, and audio source
Selection from the database diagram: tables for digiProv (digitization process) information
Builder: the data-entry front end to the database
Builder: template making tool
Builder: tool to shape a structMap using indent, outdent, up, and down. May be used in both template and individual object modes.
Cut wizard a twenty more like this one tool
Part of MODS descriptive data for a recorded interview with a former enslaved person.
File Association Tool
Tool to append a MODS record
Two samples from the MODS entry and editing tool.+ repeats the sectionx and delete sections or subsections
Selection from the online data dictionary
Some METS objects, by title
Administration Tool Menu
Example of data entry screenBlue terms are used to select separate data entry screens
Some ShortcomingsCumbersome data entry many screens, many actions Bugs hard to get them all fixed now that the contractor is goneBest if users understand METS and the structMap barrier to entry for new team membersDoes not include tools for bulk compilation from pre-existing data
Distributed Data EntryHoped-for futureEach teams enters its own data in less cumbersome local toolsTool for descriptive data, especially copying in and out of the ILSTool for data about the source item and certain technical aspects, copied in and out of MAVISTool for digiProv data, the engineers formTool or a MAVIS extension to encode the structMap
Supporting ToolsTo support the hoped-for futureCentralized tool to gather and compile the various XML data units into a METS instanceFacility to manage the METS XML documents
Fiddling our way to the future? Listen for hints in Corey Keiths talk tomorrow . . . .Thats all in this talk today. Thank you!
Greetings. This is a two part case history. Part one is a story about the project and its development and part two takes a look at the METS-making tool that is in place today. This project is strongly oriented toward preservation, meaning both reformatting older physical materials and sustaining the digital result, and that accounts for the projects particular shape. Two special collections divisions have participated: M/B/RS and AFC.We have a high interest in documentation and wanted data about the source item, i.e., the entity that was being reformatted, about the process, i.e., how the reformatting was carried out, and about the outcome, i.e., the details of the digital file that reproduces the original item. As a result, we have tried to capture quite a bit of metadata.
Regarding preservation in the other sense--sustaining content once in digital form--we want to orient ourselves to the OAIS reference model for a digital-content repository. Our project didnt plan to build a repository but we did want to produce a digital object that was as ready to submit to one as possible. (The city talk for this is SIP or submission information package.)
We started in late 1999 and by November 2000, the project had taken enough shape for me to give a talk at the DLF forum in Chicago. I reported that our group was taken with the MOA2 (Making of America 2) metadata model, then limited to images and texts. Jerry McDonough (NYU) and Mackenzie Smith (then Harvard, now MIT) collared me at that meeting and explained that they were thinking about expanding the MOA2 structure and also finding ways to embrace audio. Would LC wish to join that effort, they asked. I said that we surely would.So we did join--the LC group was led by Morgan Cundiff. Dick Thaxter from the Motion Picture, Broadcasting, and Recorded Sound Division and I joined in during some of the early meetings. During 2001-2002, with the help of our contractor, User Technology Associates, the AV project was able to sketch out some extension schemas.Some of this extension-schema work has been taken and improved by others to give us the METS-community-endorsed schemas for descriptive, image, and text metadata. Meanwhile, with a great debt to David Ackerman of Harvard Universitys Library Digital Initiative and the Audio Engineering Society, we cooked up and continue to use our own working schema for audio and digiProv metadata. Weve also got one for video but have not yet put it to work.We use a relational database to capture the data, which is then output as XML. Our contractor cooked up some early data-capture software in MS-Access, which -- how shall I say -- taught us enough lessons to set the stage for the construction of the software we have today. One big lesson concerned the need for considerable recursion in the structMap. Our first stab at a relational database to capture the data only gave us three levels, what we called the work, the type, and the subtype.But we found that we wanted n levels, i.e., an indefinite number of levels. For example, the reproduction of a phonograph album includes both sound recordings and entities like images and easily gave us four levels. The work is the parent of the sound recording division (and some image elements). In turn, the sound recording division is the parent of disc sides, which are in turn the parents of cuts. So our three-level limit was frustrating to us.Sidebar on workflow: As the preceding examples suggest, our work has been focused on recorded sound collections. For the M/B/RS Division, the items to be reformatted start their digital life in the Recorded Sound Processing Section, where they are prepared and cataloged (if not previously cataloged), and where some conservation work takes place. Some of our METS metadata is first inscribed here or, if it pre-exists, is copied into the data set.Then physical materials go to the M/B/RS Recording laboratory or to an outside contractor for digitizing. When images are to be made, there is often a separate imaging loop in the process. Additional METS metadata is added as a result of these activities.Then the originals and the digital reproductions come back to (or are made accessible to) the processing section, which adds to or corrects the final METS metadata. Throughout the data-entry design process, we were considering how to serve this multi-location, multi-layer workflow.OK, thats the history and context -- now let me talk about the data system at work today.Let me start with the outcomes, and work backwards to the data capture system. We have two final outcomes.
First is our SIP, our package for preservation. At this time, we create virtual objects. This snapshot of a UNIX filesystem directory for the American Folklife Center (actually a logical view, the real filesystems have slightly different names). Master and afc tell us that this is the storage location for the master versions, in this case owned by the AFC. Then there is an aggregate directory1941001for a group of related items, and an item directory, which is where the files are. In this case, item sr05 (sound recording number 5 in the group) is a disc with an a and b side, as reflected in the filenames. This general approach has been inherited from American Memory, with the refinement of adding separate systems for masters and service copies. Meanwhile, the XML metadata for each item is stored in another UNIX directory. The fileGroup information in METS consists of an inventory and location list for the files and this is what enables us to package the object in a virtual sense.
The second outcome is a presentation of the object. Here is an example, with a tip of the hat to our colleagues at Berkeley and the old MOA2 project. The viewer uses a java server page approach and an XSLT style sheet transforms the XML into HTML. In the left frame, we have the METS structMap, represented in a manner like the tree in the left frame of Windows Explorer.
The right frame is used to present the content proper. You can zoom in on images, listen to audio, an so on. This is the J-card from one of about 100 audiocassettes sent by songwriters to the Copyright Office during and after Desert Storm, to protect their new musical compositions.Since much of the content is protected by copyright, or by ethics-based relationships to folk performers, we want to limit access to a very few locations. For now, we have a provisional, interim system to accomplish this. In order to see images or hear audio, a log in is required, which works in conjunction with an ht_access file in the Apache web server. Without a log in, you can see only the metadata; the pointers to the actual files and the files themselves are suppressed.The presentation includes what we call the metadata map, a way for those who are really curious to get at all of the metadata we provided in this preservation project.Here is the data for the audiocassette tape recording used as a source for a digital item. The display consists of name-value pairs.The values in the XML instance come from a database that Ill describe in a moment. The actual XML is compiled by a tool called Generator. Generator is another Java server page that pulls data from the tables in the db and makes the XML. Here, three items have been selected to generate METS XML.Now, the database. Heres a blurry snapshot of the table layout. This database replaced the old MS-Access db. It is built in Oracle and has more capabilities, including support for n-level recursion in the structMap. There is also multi-user support, and we put in all of fields for the all of the extension schemas we had our eye on. This accounts for the complexity of the table structure. (This diagram may be one generation behind the latest version and some details may have changed.)At the time we developed this approach, we were drawn to the idea of a central, robust database. We felt that it would serve as a data warehouse, a place where updating would be possible, for example. We also took into consideration the fact that different work teams enter data in an incremental way, and we believed that there as each team added their chunk of data, the could see what had been entered by teams earlier in the production chain.There are tables (or groups of tables) that map to extension schemas. For example, here are three that get the information for items (analog or digital) that serve as the source for images, video, or audio.Here is the family of tables that hold the data for digiProv, where a digitization process may use multiple devices or tools, each with its own settings.The data entry system--nicknamed Builder--also uses scripts and java server pages, which are themselves fairly sophisticated and complex. It is full of good ideas, and is capable of constructing complex, multi-part objects.There is a template making capability, when you have fifty of the same type of object.
And a reasonably facile tool with indents and outdents to make the structMap, whether you are making a template or entering data for an individual object.A twenty-like-this-one feature for cuts on a disc side, or pages in a booklet. For numbered items, it will increment each instance by one.It is worth saying, however, in spite of the various efficiency tools, that the system is strongly oriented toward one-at-a-time data entry.The technical metadata for the files is automatically generated by a web service that also creates an MD5 checksum for the file. All of this data is automatically inserted in the correct slots in the database when the files are associated with the metadata, as shown here. But a data-entry person has to select each file in conjunction with the correct node in the structMap.There is also a capability to append or attach certain kinds of data, like cataloging information from our OPAC. First a separate tool takes the MARC record, makes a MODS record out of it, which is then incorporated into...