28
JH VE2 JHOVE2 A Next-Generation Architecture for Format-Aware Characterization British Library, 1 October 2008 Stephen Abrams California Digital Library Sheila Morrissey Evan Owens Portico Tom Cramer Keith Johnson Stanford University

JH VE 2 JHOVE2 A Next-Generation Architecture for Format-Aware Characterization British Library, 1 October 2008 Stephen Abrams California Digital Library

Embed Size (px)

Citation preview

Page 1: JH VE 2 JHOVE2 A Next-Generation Architecture for Format-Aware Characterization British Library, 1 October 2008 Stephen Abrams California Digital Library

JH VE2

JHOVE2A Next-Generation Architecture for

Format-Aware CharacterizationBritish Library, 1 October 2008

Stephen AbramsCalifornia Digital Library

Sheila MorrisseyEvan Owens

Portico

Tom CramerKeith JohnsonStanford University

Page 2: JH VE 2 JHOVE2 A Next-Generation Architecture for Format-Aware Characterization British Library, 1 October 2008 Stephen Abrams California Digital Library

JH VE2

Thanks to…

• Koninklijke Bibliotheek our meeting sponsor

• British Library our meeting host

• Library of Congress our funding agency

Page 3: JH VE 2 JHOVE2 A Next-Generation Architecture for Format-Aware Characterization British Library, 1 October 2008 Stephen Abrams California Digital Library

JH VE2

Agenda

• Introductions

• Review of objectives and agenda

• Project goals, deliverables, and schedule

• New terminology

• Functional requirements

• Assessment

• Technology options

• Community engagement

• Next steps

Page 4: JH VE 2 JHOVE2 A Next-Generation Architecture for Format-Aware Characterization British Library, 1 October 2008 Stephen Abrams California Digital Library

JH VE2

Introductions

• Who are you?

• Where are you from?

• What is your level of involvement with JHOVE?

Page 5: JH VE 2 JHOVE2 A Next-Generation Architecture for Format-Aware Characterization British Library, 1 October 2008 Stephen Abrams California Digital Library

JH VE2

Objectives

• Getting feedback on the utility and achievability of project goals, deliverables, and schedule

• Introduce new terminology and concepts

• Refine functional requirements

• Understand what assessment means in preservation workflows

• Propose community engagement plan

• Gauge the potential for development partnerships

• Solicit interesting test data

Page 6: JH VE 2 JHOVE2 A Next-Generation Architecture for Format-Aware Characterization British Library, 1 October 2008 Stephen Abrams California Digital Library

JH VE2

Why JH VE2?

• Preservation requires management of the gap between what you were given and what you need

• That gap is only manageable if it is quantifiable

• Characterization tells you what you have, as the starting point for iterative preservation planning and action

Adopted from A. Brown, “Developing Practical Approaches to Active Preservation,” IJDC 2:1 (June 2007): 3-11.

Characterization

Preservation action

Preservation planning

Page 7: JH VE 2 JHOVE2 A Next-Generation Architecture for Format-Aware Characterization British Library, 1 October 2008 Stephen Abrams California Digital Library

JH VE2

JH VE2 project

• A next-generation architecture for format-aware object characterization

– Three-fold goals:

• Re-factor the existing architecture to achieve higher performance, simplify system integration, and encourage third-party enhancement

• Provide significant new function

• Implement modules

• Collaborative project of CDL, Portico, and Stanford University

– Funded by Library of Congress/NDIIPP– Open source BSD license

Page 8: JH VE 2 JHOVE2 A Next-Generation Architecture for Format-Aware Characterization British Library, 1 October 2008 Stephen Abrams California Digital Library

JH VE2

New function

• Complex object data model

• Generic plug-in interface

• Common data structure passed between modules to enable stateful processing

• Identification de-coupled from validation

• Standardized handling of format profiles and error reporting

• Symbolic display of binary formats

• API-level support for editing

Page 9: JH VE 2 JHOVE2 A Next-Generation Architecture for Format-Aware Characterization British Library, 1 October 2008 Stephen Abrams California Digital Library

JH VE2

Format support

• Based on project partner requirements and budgetary constraints

– Image: JPEG 2000, TIFF– Audio: WAVE– Text: SGML, UTF-8, XML– Document: PDF– GIS: Shapefile– Color: ICC– And their well-known variants, e.g. TIFF/IT, TIFF/EP, GeoTIFF,

EXIF, DNG, …

• Unfortunately precluding some JHOVE-supported formats

– AIFF, GIF, HTML, JPEG

Page 10: JH VE 2 JHOVE2 A Next-Generation Architecture for Format-Aware Characterization British Library, 1 October 2008 Stephen Abrams California Digital Library

JH VE2

Schedule

• Months 1-6 Outreach, design, and prototyping

• Months 7-9 Core APIs and framework

• Months 10-24 Module implementation

Page 11: JH VE 2 JHOVE2 A Next-Generation Architecture for Format-Aware Characterization British Library, 1 October 2008 Stephen Abrams California Digital Library

JH VE2

Terminology

• What is it?– Identification Determining presumptive format through

signature matching

• What is it, really?– Validation Determining conformance to commonly-

accepted normative requirements

• What about it?– Feature extraction Reporting intrinsic properties significant

to preservation planning and action

• What should you do with it?– Assessment Determining acceptability for a given

purpose on the basis of locally-defined policies

Page 12: JH VE 2 JHOVE2 A Next-Generation Architecture for Format-Aware Characterization British Library, 1 October 2008 Stephen Abrams California Digital Library

JH VE2

Objects, not files

• JHOVE assumed 1 object = 1 file = 1 format

• But what about…

– TIFF with embedded ICC profile and XMP metadata

1 object = 1 file = 3 formats

– JPEG 2000 JPX fragmentation

1 object = n files = 1 format

– ESRI Shapefile

1 object = 3 files = 3 formats

• JHOVE2 will support 1 object = n files = m formats

Page 13: JH VE 2 JHOVE2 A Next-Generation Architecture for Format-Aware Characterization British Library, 1 October 2008 Stephen Abrams California Digital Library

JH VE2

Objects, not files

ICC

XMP

TIFF

abcd.tif

dBASE IV

1234.dbf

SHP

1234.shp

SHX

1234.shxShapefile

SHPdBASE IV SHX

Source units Reportable units

TIFF

XMPICC

Page 14: JH VE 2 JHOVE2 A Next-Generation Architecture for Format-Aware Characterization British Library, 1 October 2008 Stephen Abrams California Digital Library

JH VE2

Functional requirements

• JHOVE2 will provide a highly configurable, extensible, and scalable framework for preservation characterization

• JHOVE2 can process an arbitrary number of source units during a single invocation

• JHOVE2 function will be encapsulated into granular plug-in modules that can be installed in and invoked by the framework

• The output of characterization will be expressed in an intermediate XML form, with further presentation via XSL

• JHOVE2 can be invoked through an API or as a stand-alone application with a command line interface

Page 15: JH VE 2 JHOVE2 A Next-Generation Architecture for Format-Aware Characterization British Library, 1 October 2008 Stephen Abrams California Digital Library

JH VE2

Identification requirements

• Based on extrinsic hints and internal/external signatures

– Aggregate-level identification signatures are based on file-level properties

• Identification may be possible for formats than feature for which JHOVE2 feature extraction or validation is not supported

• Identification will be reported in terms of level of confidence

• Identification will be reported in terms of all known common names and public identifiers

Page 16: JH VE 2 JHOVE2 A Next-Generation Architecture for Format-Aware Characterization British Library, 1 October 2008 Stephen Abrams California Digital Library

JH VE2

Validation requirements

• Conformance reported in terms defined by the format itself

• The applicability of ambiguous conformance requirements will be under local configuration control

• Validation will be exhaustive, documenting all identified violations (not merely the first)

• Ability to distinguish syntactic and semantic conformance

• Error and informative messages will reference specific passages in the relevant specifications

Page 17: JH VE 2 JHOVE2 A Next-Generation Architecture for Format-Aware Characterization British Library, 1 October 2008 Stephen Abrams California Digital Library

JH VE2

Feature extraction requirements

• Control over the granularity of parsing and reporting will be local configuration option

• General reportable properties will include:

– Source unit name and modification date– Reportable unit size and offset– Format identifiers and version– Format profile

• Genre- and format-specific properties will be reported in terms of well-known data dictionaries and schemas

– ANSI/NISO Z39.87 for still image– AES X-098B for audio

Page 18: JH VE 2 JHOVE2 A Next-Generation Architecture for Format-Aware Characterization British Library, 1 October 2008 Stephen Abrams California Digital Library

JH VE2

Assessment

• Evaluation of characterization data on the basis of local policy to inform decisions and instigate actions

• Focus is on “instance risk” in an object

– JHOVE2 cannot answer, “Is JPEG 2000 a suitable archival format?”

– But it can answer, “I have a JPEG 2000 format policy; how does this object measure up to it?”

• Assessment…

– Makes technical metadata actionable

– Provides a common framework in which digital memory institutions can publish and share their policies

Page 19: JH VE 2 JHOVE2 A Next-Generation Architecture for Format-Aware Characterization British Library, 1 October 2008 Stephen Abrams California Digital Library

JH VE2

Assessment policy

• Expressed in a set of user-configured rules

• Any piece of characterization data is eligible for analysis

– Presence/absence of a particular value or informative/error message

– Particular range or combination of values

• Default policies associated with each format module, based on community best practice

– Easily modified, extended, or replaced to conform to local policy and practice

Page 20: JH VE 2 JHOVE2 A Next-Generation Architecture for Format-Aware Characterization British Library, 1 October 2008 Stephen Abrams California Digital Library

JH VE2

Assessment output

• Summary report of information not elsewhere summarized

• Ability to create custom preservation metadata

• Can be extended to initiate downstream processes in local workflows automatically

Page 21: JH VE 2 JHOVE2 A Next-Generation Architecture for Format-Aware Characterization British Library, 1 October 2008 Stephen Abrams California Digital Library

JH VE2

Assessment in workflows

• In a flow of homogenous objects, verifying specifications

– Am I receiving what I expected to receive?

• In a flow of heterogeneous objects, conducting triage

– Which risks are tolerable? Which aren’t?– What level of service can I offer the depositor?

Page 22: JH VE 2 JHOVE2 A Next-Generation Architecture for Format-Aware Characterization British Library, 1 October 2008 Stephen Abrams California Digital Library

JH VE2

Ingest workflow

Content

Metadata

Identification Validation Feature extract

Package SIP Unpackage

Content

Metadata

Identification Validation Feature extract

Metadata ′

Producer

Consistency Ingest

Archive

Policy rules

Assessment

Policy rules

Assessment

Page 23: JH VE 2 JHOVE2 A Next-Generation Architecture for Format-Aware Characterization British Library, 1 October 2008 Stephen Abrams California Digital Library

JH VE2

Migration workflow

Content

Metadata

Assessment

Policy rules

Migration

Content ′

Identification Validation Feature extract

Metadata ′

Equivalence (Re)IngestAIP Unpackage

Page 24: JH VE 2 JHOVE2 A Next-Generation Architecture for Format-Aware Characterization British Library, 1 October 2008 Stephen Abrams California Digital Library

JH VE2

Technical components

• Java

– java.nio package– 1.5 vs 1.6

• OSGi/Spring frameworks

– Component versioning and dependency management– Fine-grained control of component invocation– Inversion of control

• SourceForge

– Distribution platform– Issue tracking

Page 25: JH VE 2 JHOVE2 A Next-Generation Architecture for Format-Aware Characterization British Library, 1 October 2008 Stephen Abrams California Digital Library

JH VE2

Data abstraction

• Based on the “natural” conceptual structures of a format and their component attributes

– Each such structure maps to a class with methods for parsing, validating, reporting, and serializing

– Each such attribute maps to a field with accessor and mutator methods

• UTF-8 Character

• TIFF Image File Header and Image File Directory

• JPEG 2000 Box

• PDF boolean, number, string, name, array, dictionary, and stream

Page 26: JH VE 2 JHOVE2 A Next-Generation Architecture for Format-Aware Characterization British Library, 1 October 2008 Stephen Abrams California Digital Library

JH VE2

Community engagement

Wiki confluence.ucop.edu/display/JHOVE2Info/Home

Mailing lists JHOVE2-Announce-L

JHOVE2-Techtalk-L (Subscribe via the wiki)

Page 27: JH VE 2 JHOVE2 A Next-Generation Architecture for Format-Aware Characterization British Library, 1 October 2008 Stephen Abrams California Digital Library

JH VE2

Advisory board

• Deutsche Nationalbibliothek• Ex Libris• Fedora Commons• Florida Center for Library Automation• Harvard University• Koninklijke Bibliotheek• MIT/DSpace• National Archives (UK)• National Library of Australia• National Library of New Zealand• Planets project

Page 28: JH VE 2 JHOVE2 A Next-Generation Architecture for Format-Aware Characterization British Library, 1 October 2008 Stephen Abrams California Digital Library

JH VE2

Next steps?