Tackling concrete digital preservation
challenges with SPRUCE
Paul Wheatley
SPRUCE Project Manager
University of Leeds
Twitter: @prwheatley
http://openplanetsfoundation.org/blogs/paul
Summary
• Some digital preservation challenges and solutions–Not exhaustive–Illustrate with some real examples–Summarise with some practical steps for digital
preservation
• Taking a community approach to digital preservation
–SPRUCE Project–How to get involved–Where to get help
Keeping the bits
001011001011101010101010101010111001010101010010010010110101010101111010100100010111101010101010110101110010101001010100101010100101010101001010111010111
Digital data is fragile
Courtesy of State and University Library, Denmark
• Media decay• Media becomes partially or completely
unreadable• Media obsolescence
• Without the respective hardware to read the (hand held) media, it becomes inaccessible
• Practical issues• Inserting lots of discs into a drive is
costly
Images courtesy of The British Library
Digital preservation storage: keeping the bits
Bit storage recommendations
• Don’t fall for media longevity claims from vendors! They are missing the point!
• Accept that media decays, media formats will change, and any media will become inaccessible in the medium term
• Rather than putting your data in a dark archive and trusting it will survive for a long period...
• Manage it closely, refresh to new media frequently, chose media that is easy to manage
–Choose media that is easy to access (server storage, cloud, external hard drives)
–Make at least 3 copies of all data, keep copies in different geographical locations
–Frequently check the condition of your data
Are any of your digital files missing?
Are any of your digital files damaged?
Verifiable Manifests (Checksums!)
• Single most useful digital preservation activity• Generate manifests as early as possible• Frequently re-check them over time• Mend content when necessary• LoC Bagit specification and Bagger tool
• Allow you to easily check the condition of your digital stuff
SOI
APP0 JFIF 1.2
APP13 IPTC
APP2 ICC
DQT
SOF0 200x392
DRI
DHT
SOS
ECS0
RST0
ECS1
RST1
ECS2…
1010101010111101000101010100010010100101110100101010101001001001010101001000001010101010111110100010101010001001010010111010010101010100100100101010100100000101010101011111010001010101000100101001011101001010101010010010010101010010000010101010101101010101111010001010101000100101001011101001010101010010010010101010010000010101010101111101000101010100010010100101110100101010101001001001010101001000001010101010111110100010101010001001010010111010010101010100100…
Dependence on software
When it goes wrong…
Migration, Emulation and all that...
• Migrate content from an obsolete format to a more modern usable format
• Emulate the original computing environment and run the obsolete software originally used
• Words of caution:–Is software obsolescence a really critical risk for our digital data?–The debate continues... International Council of Archives Congress
2012:• Michael Carden, National Archives Australia: NAA migrates all content
• Oliver Morley, UK National Archives: “digital formats have standardized”
• Blogged by Inge Angevarre: http://www.ncdd.nl/blog/?p=2786
–The hard part is the quality assurance of the results. Was anything lost or damaged in the process?
Stuff happens!
• Whenever a digital collection is moved, processed, curated or altered in any way.... things can go wrong!
• Network dropouts at critical times• Disks get full, subsequent data copied there is lost• Software bugs lead to unexpected results• Human error leads to all sorts of issues
• Stuff happens a lot more at scale!
Digitisation post processing corruption
Images courtesy of The British Library
TIFF to JPEG2000 migration corruption
Images courtesy of The British Library
JPEG2000
Format specification ambiguity and corresponding tool bugs
JPEG2000s can be missing vital source resolution
Technology can be imperfect!
• For more on JPEG2000 format and tool risks see: http://wiki.opf-labs.org/display/TR/JP2
Images courtesy of The British Library
• Only process or alter digital content when it is absolutely necessary
• Double check everything• Make no assumptions
Assume nothing, validate everything
Prompt check in – have you got what you thought you would receive?– Check expected files are present, open a random selection to verify expected quality– Request replacements from supplier promptly
Create a verifiable manifest– Create a top down manifest file that lists each digital object in your collection as a relative
filename and a checksum– Library of Congress Bagit specification and tools will also do a good job here
Make at least 3 copies. Protect the bits– Keep a copy on easily accessible media– Backup to tape or more disk. Keep copies in different geographical locations to avoid
catastrophic disaster. Cloud storage is also an option. Frequently inspect the condition of your data
– Revisit the collection, recalculate your manifests and verify content has not been lost– Do a test recovery of your backups to ensure they are working effectively!
Record the existence of each of your collections in a digital items register– Record: What it is, who is the responsible owner, where it is, who owns it, and who can
access it. Assume nothing, validate everything!
– Double check any processes in the lifecycle that move or alter your digital content– Built in checks can be flawed, a second opinion is much more trustworthy
First steps in practical digital preservation
• JISC funded• 2 years in length (until Nov 2013)• £250k funding
http://wiki.opf-labs.org/display/SPR
SPRUCE Project
Sustainable Preservation Using Community Engagement
Some observations
• Lack of focus on the real needs of digital preservation practitioners
• Insufficient collaboration + coordination• Duplication of effort
• 3 day workshop for ~30 people• Practitioners bring along digital
collections• We identify preservation challenges• Pair up practitioners with technical
experts• Apply existing open source tools to
solve the problems• In doing so, we exchange knowledge
about digital preservation• Develop a supportive community
The SPRUCE Mashup:Identify and Solve concrete problems
Glasgow Mashup
April 2012
–What is this digital collection?–What risks are associated with this digital collection?–Separate collection content from temporary/other files.–Identify and weed duplicate or similar files.–Is the metadata consistent with the content?–Are all the pages present in each issue?–Are all digitised pages in focus?–Are any files damaged?–Are the files compliant with a particular profile?
• See the results here: http://bit.ly/spruce-results
What questions do practitioners want answered?
• Work with practitioners to develop a business case for their work
• Make small funding awards to further develop and embed the work begun in the mashups
Make it sustainable
York Mashup
September 2011
• Sharing requirements• Sharing experiences: what tools worked well, what approaches should be avoided
• Building on existing tools, rather than re-inventing the wheel
• Libraries + Information Science question and answer site:
–http://libraries.stackexchange.com/
• More recommended collaborative activities:–http://bit.ly/spruce-collaborate
Online collaboration
Thanks for listening! Any quesions?
Paul Wheatley
SPRUCE Project Manager
University of Leeds
Twitter: @prwheatley
Email: [email protected]
http://openplanetsfoundation.org/blogs/paul