20
Unified Digital Format Registry a semantic registry for digital preservation UDFR: A Semantic Registry for Format Representation Information Lisa Dawn Colvin Abhishek Salve Stephen Abrams UC Curation Center California Digital Library Digital Library Federation Forum Baltimore, October 31-November 2, 2011

Unified Digital Format Registry a semantic registry for digital preservation UDFR: A Semantic Registry for Format Representation Information Lisa Dawn

Embed Size (px)

Citation preview

Page 1: Unified Digital Format Registry a semantic registry for digital preservation UDFR: A Semantic Registry for Format Representation Information Lisa Dawn

Unified Digital Format Registrya semantic registry for digital preservation

UDFR: A Semantic Registry for Format Representation Information

Lisa Dawn ColvinAbhishek Salve

Stephen Abrams

UC Curation CenterCalifornia Digital Library

Digital Library Federation ForumBaltimore, October 31-November 2, 2011

Page 2: Unified Digital Format Registry a semantic registry for digital preservation UDFR: A Semantic Registry for Format Representation Information Lisa Dawn

Unified Digital Format Registrya semantic registry for digital preservation

Outline

WhatWhyHowWhen

Page 3: Unified Digital Format Registry a semantic registry for digital preservation UDFR: A Semantic Registry for Format Representation Information Lisa Dawn

Unified Digital Format Registrya semantic registry for digital preservation

Why formats?

“Format” is the dividing line between bits and informationffd8ffe000104a46494600010201008300830000ffed0fb050686f746f73686f7020332e30003842494d03e90a5072696e7420496e666f000000007800000000004800480000000002f40240ffeeffee030602520347052803fc00020000004800480000000002d802280001000000640000000100030...

SOIAPP0 JFIF 1.2APP13 IPTCAPP2 ICCDQTSOF0 183x512DRIDHTSOSECS0RST0ECS1RST1ECS2...

Syntax Semantics

Page 4: Unified Digital Format Registry a semantic registry for digital preservation UDFR: A Semantic Registry for Format Representation Information Lisa Dawn

Unified Digital Format Registrya semantic registry for digital preservation

Why formats?

There are many necessary preservation activities that can be usefully performed on bits qua bits

But to preserve information you most act on formatted bits and know what those formats mean• Preservation of syntax and semantics

Page 5: Unified Digital Format Registry a semantic registry for digital preservation UDFR: A Semantic Registry for Format Representation Information Lisa Dawn

Unified Digital Format Registrya semantic registry for digital preservation

Unified Digital Format Registry

“A reliable, publicly accessible, and sustainable knowledge base of file format representation information for use by the digital preservation community”• “Unification” of the function and holdings of PRONOM

and GDFRhttp://www.nationalarchives.gov.uk/PRONOMhttp://gdfr.info/

• Open source platform / GPL• Semantic wiki• Funded by the Library of Congress

Page 6: Unified Digital Format Registry a semantic registry for digital preservation UDFR: A Semantic Registry for Format Representation Information Lisa Dawn

Unified Digital Format Registrya semantic registry for digital preservation

Timeline

PRONOM – National Archives [UK], 2002http://www.nationalarchives.gov.uk/PRONOM

“ready access to reliable technical information about the nature of electronic records”

JHOVE – Harvard, 2003http://hul.harvard.edu/jhove

“digital object validation and characterization”

GDFR – Harvard/OCLC, 2006http://gdfr.info/

“a distributed and replicated registry of format information populated and vetted by experts and enthusiasts world-wide”

Page 7: Unified Digital Format Registry a semantic registry for digital preservation UDFR: A Semantic Registry for Format Representation Information Lisa Dawn

Unified Digital Format Registrya semantic registry for digital preservation

Timeline

UDFR – Ad hoc stakeholder community, 2009

• Resolve PRONOM IPR issues and develop a community-supported open source solution

• Advance beyond legacy RDBMS and XML database technology

UDFR – CDL, January 2011http://udfr.org/

“a semantic registry for digital preservation”

• Stakeholder meeting, April 2011• Beta release, November 2011• Production release, January 2012

Page 8: Unified Digital Format Registry a semantic registry for digital preservation UDFR: A Semantic Registry for Format Representation Information Lisa Dawn

Unified Digital Format Registrya semantic registry for digital preservation

Representation information

What you need to know about something in order to exploit that thing meaningfully [OAIS/ISO 14720]

Information that lets you answer important preservation questions

• What format is it?• What are its significant properties?• Is it valid?• Is it at risk?• How can I render/play/read it?• What can it be transformed into?• And how?

Page 9: Unified Digital Format Registry a semantic registry for digital preservation UDFR: A Semantic Registry for Format Representation Information Lisa Dawn

Unified Digital Format Registrya semantic registry for digital preservation

Why semantic?

Everyone wants to say something about everything• The semantic web lets anyone say anything about

anything• Understandable to both people and machines

Page 10: Unified Digital Format Registry a semantic registry for digital preservation UDFR: A Semantic Registry for Format Representation Information Lisa Dawn

Unified Digital Format Registrya semantic registry for digital preservation

Data modelingAbstract

Base

Abstract Product

Abstract Format

File FormatCharacter Encoding

Compression Algorithm

MediaHardwareSoftware Document File

AgentIPR

specificationreference

file

holder

owner

creator

maintaineripr

Controlled Vocabulary …

HoldingProcess

embodies

product

input / output

dependency

Abstract Signature

External Signature

Internal Signature

signature

Digest

digest

Assessment Grammar

grammarassessment

holder

Page 11: Unified Digital Format Registry a semantic registry for digital preservation UDFR: A Semantic Registry for Format Representation Information Lisa Dawn

Unified Digital Format Registrya semantic registry for digital preservation

Provenance

“Trust, but verify”

• Complete change historyat the assertion level,including– Who made the assertion, and when?

– Confidence based on personal and institutional reputation

• Imprimatur by technically knowledgeable reviewers

Page 12: Unified Digital Format Registry a semantic registry for digital preservation UDFR: A Semantic Registry for Format Representation Information Lisa Dawn

Unified Digital Format Registrya semantic registry for digital preservation

OntologiesPrefixu Namespaceudfrs http://udfr.org/onto#

udfr http://udfr.org/udfr/

dc http://purl.org/dc/elements/1.1/

dcterms http://purl.org/dc/terms/

foaf http://xmls.com/foaf/0.1/

owl http://www.w3.org/2002/07/owl#

pronom http://reference.data.gov.uk/technical-registry/

rdf http://www.w3.org/1999/02/22-rdf-syntax-ns#

rdfs http://www.w3.org/2000/01/rdf-schema#

skos http://www.w3.org/2004/02/skos/core#

xds http://www.w3.org/2001/XMLSchema#

Page 13: Unified Digital Format Registry a semantic registry for digital preservation UDFR: A Semantic Registry for Format Representation Information Lisa Dawn

Unified Digital Format Registrya semantic registry for digital preservation

Technology stack

Ontowikihttp://ontowiki.net/

Virtuoso 4storehttp://virtuoso.openlinksw.com/

Zend frameworkhttp://www.zend.com/

PHPhttp://www.php.net/

Apache httpdhttp://httpd.apache.org/

RDFhttp://www.w3.org/RDF

JavaScript / CSS

HTTP / SPARQL

Erfurt / RDFAuthorhttp://aksw.org/Projects/Erfurt

https://github.com/AKSW/RDFauthor

Page 14: Unified Digital Format Registry a semantic registry for digital preservation UDFR: A Semantic Registry for Format Representation Information Lisa Dawn

Unified Digital Format Registrya semantic registry for digital preservation

Initial population

Export from PRONOM• Working with TNA to identify appropriate subset

• Transform to cross-walk modeling differences

Page 15: Unified Digital Format Registry a semantic registry for digital preservation UDFR: A Semantic Registry for Format Representation Information Lisa Dawn

Unified Digital Format Registrya semantic registry for digital preservation

Licensing

Code is available under GPLv3http://www.gnu.org/copyleft/gpl.html

• Hosted on BitBuckethttp://www.bitbucket.org/udfr

Data is contributed and available under CC-BYhttp://creativecommons.org/licenses/by/3.0/

• Consistent with UK open government license applicable to PRONOM datahttp://www.nationalarchives.gov.uk/doc/open-government-licence

Page 16: Unified Digital Format Registry a semantic registry for digital preservation UDFR: A Semantic Registry for Format Representation Information Lisa Dawn

Unified Digital Format Registrya semantic registry for digital preservation

Demo

Page 17: Unified Digital Format Registry a semantic registry for digital preservation UDFR: A Semantic Registry for Format Representation Information Lisa Dawn

Unified Digital Format Registrya semantic registry for digital preservation

Lessons learned

People with semantic experience are scarceToo much time evaluating/prototyping potential

technology choicesMore difficulty than anticipated integrating disparate

open source products0.x software is often numbered that for a reasonFeature lists aren’t (always)

Page 18: Unified Digital Format Registry a semantic registry for digital preservation UDFR: A Semantic Registry for Format Representation Information Lisa Dawn

Unified Digital Format Registrya semantic registry for digital preservation

Lessons learned

Availability of a worldwide selection of products is a good thing• Excellent support from AKWS/Universität Leipzig

Modeling differences• RDF (non-)standards

VM deployment• Disparate IT organizations supporting dev/prod instances

(except when you don’t read German)

Page 19: Unified Digital Format Registry a semantic registry for digital preservation UDFR: A Semantic Registry for Format Representation Information Lisa Dawn

Unified Digital Format Registrya semantic registry for digital preservation

Next steps

Long-term governance and operational supportTechnical maintenance and enhancementReplication/synchronizationBuilding contributor and reviewer communities

Page 20: Unified Digital Format Registry a semantic registry for digital preservation UDFR: A Semantic Registry for Format Representation Information Lisa Dawn

Unified Digital Format Registrya semantic registry for digital preservation

For more information

UDFRhttp://udfr.org/http://bitbucket.org/udfr

PRONOMhttp://www.nationalarchives.gov.uk/PRONOM

GDFRhttp://gdfr.info/

OntoWikihttp://ontowiki.net/Projects/OntoWiki

Virtuosohttp://www.openlinksw.com/dataspace/dav/wiki/Main/VOSRDFWP

Agile Knowledge and Semantic Web (AKSW), Universität Leipzighttp://aksw.org/

UC3http://www.cdlib.org/uc3 [email protected]

Stephen Abrams Mark ReyesLisa Colvin Abhishek SalvePatricia Cruse Tracy SenecaScott Fisher Joan StarrErik Hetzner Carly StrasserGreg Janée Marisa StrongJohn Kunze Adrian TurnerMargaret Low Perry WillettDavid Loy