Upload
jeffrey-wood
View
214
Download
0
Embed Size (px)
Citation preview
JH VE2
JHOVE2A Next-Generation Architecture for
Format-Aware CharacterizationBritish Library, 1 October 2008
Stephen AbramsCalifornia Digital Library
Sheila MorrisseyEvan Owens
Portico
Tom CramerKeith JohnsonStanford University
JH VE2
Thanks to…
• Koninklijke Bibliotheek our meeting sponsor
• British Library our meeting host
• Library of Congress our funding agency
JH VE2
Agenda
• Introductions
• Review of objectives and agenda
• Project goals, deliverables, and schedule
• New terminology
• Functional requirements
• Assessment
• Technology options
• Community engagement
• Next steps
JH VE2
Introductions
• Who are you?
• Where are you from?
• What is your level of involvement with JHOVE?
JH VE2
Objectives
• Getting feedback on the utility and achievability of project goals, deliverables, and schedule
• Introduce new terminology and concepts
• Refine functional requirements
• Understand what assessment means in preservation workflows
• Propose community engagement plan
• Gauge the potential for development partnerships
• Solicit interesting test data
JH VE2
Why JH VE2?
• Preservation requires management of the gap between what you were given and what you need
• That gap is only manageable if it is quantifiable
• Characterization tells you what you have, as the starting point for iterative preservation planning and action
Adopted from A. Brown, “Developing Practical Approaches to Active Preservation,” IJDC 2:1 (June 2007): 3-11.
Characterization
Preservation action
Preservation planning
JH VE2
JH VE2 project
• A next-generation architecture for format-aware object characterization
– Three-fold goals:
• Re-factor the existing architecture to achieve higher performance, simplify system integration, and encourage third-party enhancement
• Provide significant new function
• Implement modules
• Collaborative project of CDL, Portico, and Stanford University
– Funded by Library of Congress/NDIIPP– Open source BSD license
JH VE2
New function
• Complex object data model
• Generic plug-in interface
• Common data structure passed between modules to enable stateful processing
• Identification de-coupled from validation
• Standardized handling of format profiles and error reporting
• Symbolic display of binary formats
• API-level support for editing
JH VE2
Format support
• Based on project partner requirements and budgetary constraints
– Image: JPEG 2000, TIFF– Audio: WAVE– Text: SGML, UTF-8, XML– Document: PDF– GIS: Shapefile– Color: ICC– And their well-known variants, e.g. TIFF/IT, TIFF/EP, GeoTIFF,
EXIF, DNG, …
• Unfortunately precluding some JHOVE-supported formats
– AIFF, GIF, HTML, JPEG
JH VE2
Schedule
• Months 1-6 Outreach, design, and prototyping
• Months 7-9 Core APIs and framework
• Months 10-24 Module implementation
JH VE2
Terminology
• What is it?– Identification Determining presumptive format through
signature matching
• What is it, really?– Validation Determining conformance to commonly-
accepted normative requirements
• What about it?– Feature extraction Reporting intrinsic properties significant
to preservation planning and action
• What should you do with it?– Assessment Determining acceptability for a given
purpose on the basis of locally-defined policies
JH VE2
Objects, not files
• JHOVE assumed 1 object = 1 file = 1 format
• But what about…
– TIFF with embedded ICC profile and XMP metadata
1 object = 1 file = 3 formats
– JPEG 2000 JPX fragmentation
1 object = n files = 1 format
– ESRI Shapefile
1 object = 3 files = 3 formats
• JHOVE2 will support 1 object = n files = m formats
JH VE2
Objects, not files
ICC
XMP
TIFF
abcd.tif
dBASE IV
1234.dbf
SHP
1234.shp
SHX
1234.shxShapefile
SHPdBASE IV SHX
Source units Reportable units
TIFF
XMPICC
JH VE2
Functional requirements
• JHOVE2 will provide a highly configurable, extensible, and scalable framework for preservation characterization
• JHOVE2 can process an arbitrary number of source units during a single invocation
• JHOVE2 function will be encapsulated into granular plug-in modules that can be installed in and invoked by the framework
• The output of characterization will be expressed in an intermediate XML form, with further presentation via XSL
• JHOVE2 can be invoked through an API or as a stand-alone application with a command line interface
JH VE2
Identification requirements
• Based on extrinsic hints and internal/external signatures
– Aggregate-level identification signatures are based on file-level properties
• Identification may be possible for formats than feature for which JHOVE2 feature extraction or validation is not supported
• Identification will be reported in terms of level of confidence
• Identification will be reported in terms of all known common names and public identifiers
JH VE2
Validation requirements
• Conformance reported in terms defined by the format itself
• The applicability of ambiguous conformance requirements will be under local configuration control
• Validation will be exhaustive, documenting all identified violations (not merely the first)
• Ability to distinguish syntactic and semantic conformance
• Error and informative messages will reference specific passages in the relevant specifications
JH VE2
Feature extraction requirements
• Control over the granularity of parsing and reporting will be local configuration option
• General reportable properties will include:
– Source unit name and modification date– Reportable unit size and offset– Format identifiers and version– Format profile
• Genre- and format-specific properties will be reported in terms of well-known data dictionaries and schemas
– ANSI/NISO Z39.87 for still image– AES X-098B for audio
JH VE2
Assessment
• Evaluation of characterization data on the basis of local policy to inform decisions and instigate actions
• Focus is on “instance risk” in an object
– JHOVE2 cannot answer, “Is JPEG 2000 a suitable archival format?”
– But it can answer, “I have a JPEG 2000 format policy; how does this object measure up to it?”
• Assessment…
– Makes technical metadata actionable
– Provides a common framework in which digital memory institutions can publish and share their policies
JH VE2
Assessment policy
• Expressed in a set of user-configured rules
• Any piece of characterization data is eligible for analysis
– Presence/absence of a particular value or informative/error message
– Particular range or combination of values
• Default policies associated with each format module, based on community best practice
– Easily modified, extended, or replaced to conform to local policy and practice
JH VE2
Assessment output
• Summary report of information not elsewhere summarized
• Ability to create custom preservation metadata
• Can be extended to initiate downstream processes in local workflows automatically
JH VE2
Assessment in workflows
• In a flow of homogenous objects, verifying specifications
– Am I receiving what I expected to receive?
• In a flow of heterogeneous objects, conducting triage
– Which risks are tolerable? Which aren’t?– What level of service can I offer the depositor?
JH VE2
Ingest workflow
Content
Metadata
Identification Validation Feature extract
Package SIP Unpackage
Content
Metadata
Identification Validation Feature extract
Metadata ′
Producer
Consistency Ingest
Archive
Policy rules
Assessment
Policy rules
Assessment
JH VE2
Migration workflow
Content
Metadata
Assessment
Policy rules
Migration
Content ′
Identification Validation Feature extract
Metadata ′
Equivalence (Re)IngestAIP Unpackage
JH VE2
Technical components
• Java
– java.nio package– 1.5 vs 1.6
• OSGi/Spring frameworks
– Component versioning and dependency management– Fine-grained control of component invocation– Inversion of control
• SourceForge
– Distribution platform– Issue tracking
JH VE2
Data abstraction
• Based on the “natural” conceptual structures of a format and their component attributes
– Each such structure maps to a class with methods for parsing, validating, reporting, and serializing
– Each such attribute maps to a field with accessor and mutator methods
• UTF-8 Character
• TIFF Image File Header and Image File Directory
• JPEG 2000 Box
• PDF boolean, number, string, name, array, dictionary, and stream
JH VE2
Community engagement
Wiki confluence.ucop.edu/display/JHOVE2Info/Home
Mailing lists JHOVE2-Announce-L
JHOVE2-Techtalk-L (Subscribe via the wiki)
JH VE2
Advisory board
• Deutsche Nationalbibliothek• Ex Libris• Fedora Commons• Florida Center for Library Automation• Harvard University• Koninklijke Bibliotheek• MIT/DSpace• National Archives (UK)• National Library of Australia• National Library of New Zealand• Planets project
JH VE2
Next steps?