23
Tackling concrete digital preservation challenges with SPRUCE Paul Wheatley SPRUCE Project Manager University of Leeds Twitter: @prwheatley http://openplanetsfoundation.org/blogs/paul

Tackling concrete digital preservation challenges with SPRUCE Paul Wheatley SPRUCE Project Manager University of Leeds Twitter: @prwheatley@prwheatley

Embed Size (px)

Citation preview

Page 1: Tackling concrete digital preservation challenges with SPRUCE Paul Wheatley SPRUCE Project Manager University of Leeds Twitter: @prwheatley@prwheatley

Tackling concrete digital preservation

challenges with SPRUCE

Paul Wheatley

SPRUCE Project Manager

University of Leeds

Twitter: @prwheatley

http://openplanetsfoundation.org/blogs/paul

Page 2: Tackling concrete digital preservation challenges with SPRUCE Paul Wheatley SPRUCE Project Manager University of Leeds Twitter: @prwheatley@prwheatley

Summary

• Some digital preservation challenges and solutions–Not exhaustive–Illustrate with some real examples–Summarise with some practical steps for digital

preservation

• Taking a community approach to digital preservation

–SPRUCE Project–How to get involved–Where to get help

Page 3: Tackling concrete digital preservation challenges with SPRUCE Paul Wheatley SPRUCE Project Manager University of Leeds Twitter: @prwheatley@prwheatley

Keeping the bits

001011001011101010101010101010111001010101010010010010110101010101111010100100010111101010101010110101110010101001010100101010100101010101001010111010111

Page 4: Tackling concrete digital preservation challenges with SPRUCE Paul Wheatley SPRUCE Project Manager University of Leeds Twitter: @prwheatley@prwheatley

Digital data is fragile

Courtesy of State and University Library, Denmark

Page 5: Tackling concrete digital preservation challenges with SPRUCE Paul Wheatley SPRUCE Project Manager University of Leeds Twitter: @prwheatley@prwheatley

• Media decay• Media becomes partially or completely

unreadable• Media obsolescence

• Without the respective hardware to read the (hand held) media, it becomes inaccessible

• Practical issues• Inserting lots of discs into a drive is

costly

Images courtesy of The British Library

Digital preservation storage: keeping the bits

Page 6: Tackling concrete digital preservation challenges with SPRUCE Paul Wheatley SPRUCE Project Manager University of Leeds Twitter: @prwheatley@prwheatley

Bit storage recommendations

• Don’t fall for media longevity claims from vendors! They are missing the point!

• Accept that media decays, media formats will change, and any media will become inaccessible in the medium term

• Rather than putting your data in a dark archive and trusting it will survive for a long period...

• Manage it closely, refresh to new media frequently, chose media that is easy to manage

–Choose media that is easy to access (server storage, cloud, external hard drives)

–Make at least 3 copies of all data, keep copies in different geographical locations

–Frequently check the condition of your data

Page 7: Tackling concrete digital preservation challenges with SPRUCE Paul Wheatley SPRUCE Project Manager University of Leeds Twitter: @prwheatley@prwheatley

Are any of your digital files missing?

Are any of your digital files damaged?

Verifiable Manifests (Checksums!)

• Single most useful digital preservation activity• Generate manifests as early as possible• Frequently re-check them over time• Mend content when necessary• LoC Bagit specification and Bagger tool

• Allow you to easily check the condition of your digital stuff

Page 8: Tackling concrete digital preservation challenges with SPRUCE Paul Wheatley SPRUCE Project Manager University of Leeds Twitter: @prwheatley@prwheatley

SOI

APP0 JFIF 1.2

APP13 IPTC

APP2 ICC

DQT

SOF0 200x392

DRI

DHT

SOS

ECS0

RST0

ECS1

RST1

ECS2…

1010101010111101000101010100010010100101110100101010101001001001010101001000001010101010111110100010101010001001010010111010010101010100100100101010100100000101010101011111010001010101000100101001011101001010101010010010010101010010000010101010101101010101111010001010101000100101001011101001010101010010010010101010010000010101010101111101000101010100010010100101110100101010101001001001010101001000001010101010111110100010101010001001010010111010010101010100100…

Dependence on software

Page 9: Tackling concrete digital preservation challenges with SPRUCE Paul Wheatley SPRUCE Project Manager University of Leeds Twitter: @prwheatley@prwheatley

When it goes wrong…

Page 10: Tackling concrete digital preservation challenges with SPRUCE Paul Wheatley SPRUCE Project Manager University of Leeds Twitter: @prwheatley@prwheatley

Migration, Emulation and all that...

• Migrate content from an obsolete format to a more modern usable format

• Emulate the original computing environment and run the obsolete software originally used

• Words of caution:–Is software obsolescence a really critical risk for our digital data?–The debate continues... International Council of Archives Congress

2012:• Michael Carden, National Archives Australia: NAA migrates all content

• Oliver Morley, UK National Archives: “digital formats have standardized”

• Blogged by Inge Angevarre: http://www.ncdd.nl/blog/?p=2786

–The hard part is the quality assurance of the results. Was anything lost or damaged in the process?

Page 11: Tackling concrete digital preservation challenges with SPRUCE Paul Wheatley SPRUCE Project Manager University of Leeds Twitter: @prwheatley@prwheatley

Stuff happens!

• Whenever a digital collection is moved, processed, curated or altered in any way.... things can go wrong!

• Network dropouts at critical times• Disks get full, subsequent data copied there is lost• Software bugs lead to unexpected results• Human error leads to all sorts of issues

• Stuff happens a lot more at scale!

Page 12: Tackling concrete digital preservation challenges with SPRUCE Paul Wheatley SPRUCE Project Manager University of Leeds Twitter: @prwheatley@prwheatley

Digitisation post processing corruption

Images courtesy of The British Library

Page 13: Tackling concrete digital preservation challenges with SPRUCE Paul Wheatley SPRUCE Project Manager University of Leeds Twitter: @prwheatley@prwheatley

TIFF to JPEG2000 migration corruption

Images courtesy of The British Library

Page 14: Tackling concrete digital preservation challenges with SPRUCE Paul Wheatley SPRUCE Project Manager University of Leeds Twitter: @prwheatley@prwheatley

JPEG2000

Format specification ambiguity and corresponding tool bugs

JPEG2000s can be missing vital source resolution

Technology can be imperfect!

• For more on JPEG2000 format and tool risks see: http://wiki.opf-labs.org/display/TR/JP2

Images courtesy of The British Library

Page 15: Tackling concrete digital preservation challenges with SPRUCE Paul Wheatley SPRUCE Project Manager University of Leeds Twitter: @prwheatley@prwheatley

• Only process or alter digital content when it is absolutely necessary

• Double check everything• Make no assumptions

Assume nothing, validate everything

Page 16: Tackling concrete digital preservation challenges with SPRUCE Paul Wheatley SPRUCE Project Manager University of Leeds Twitter: @prwheatley@prwheatley

Prompt check in – have you got what you thought you would receive?– Check expected files are present, open a random selection to verify expected quality– Request replacements from supplier promptly

Create a verifiable manifest– Create a top down manifest file that lists each digital object in your collection as a relative

filename and a checksum– Library of Congress Bagit specification and tools will also do a good job here

Make at least 3 copies. Protect the bits– Keep a copy on easily accessible media– Backup to tape or more disk. Keep copies in different geographical locations to avoid

catastrophic disaster. Cloud storage is also an option. Frequently inspect the condition of your data

– Revisit the collection, recalculate your manifests and verify content has not been lost– Do a test recovery of your backups to ensure they are working effectively!

Record the existence of each of your collections in a digital items register– Record: What it is, who is the responsible owner, where it is, who owns it, and who can

access it. Assume nothing, validate everything!

– Double check any processes in the lifecycle that move or alter your digital content– Built in checks can be flawed, a second opinion is much more trustworthy

First steps in practical digital preservation

Page 17: Tackling concrete digital preservation challenges with SPRUCE Paul Wheatley SPRUCE Project Manager University of Leeds Twitter: @prwheatley@prwheatley

• JISC funded• 2 years in length (until Nov 2013)• £250k funding

http://wiki.opf-labs.org/display/SPR

SPRUCE Project

Sustainable Preservation Using Community Engagement

Page 18: Tackling concrete digital preservation challenges with SPRUCE Paul Wheatley SPRUCE Project Manager University of Leeds Twitter: @prwheatley@prwheatley

Some observations

• Lack of focus on the real needs of digital preservation practitioners

• Insufficient collaboration + coordination• Duplication of effort

Page 19: Tackling concrete digital preservation challenges with SPRUCE Paul Wheatley SPRUCE Project Manager University of Leeds Twitter: @prwheatley@prwheatley

• 3 day workshop for ~30 people• Practitioners bring along digital

collections• We identify preservation challenges• Pair up practitioners with technical

experts• Apply existing open source tools to

solve the problems• In doing so, we exchange knowledge

about digital preservation• Develop a supportive community

The SPRUCE Mashup:Identify and Solve concrete problems

Glasgow Mashup

April 2012

Page 20: Tackling concrete digital preservation challenges with SPRUCE Paul Wheatley SPRUCE Project Manager University of Leeds Twitter: @prwheatley@prwheatley

–What is this digital collection?–What risks are associated with this digital collection?–Separate collection content from temporary/other files.–Identify and weed duplicate or similar files.–Is the metadata consistent with the content?–Are all the pages present in each issue?–Are all digitised pages in focus?–Are any files damaged?–Are the files compliant with a particular profile?

• See the results here: http://bit.ly/spruce-results

What questions do practitioners want answered?

Page 21: Tackling concrete digital preservation challenges with SPRUCE Paul Wheatley SPRUCE Project Manager University of Leeds Twitter: @prwheatley@prwheatley

• Work with practitioners to develop a business case for their work

• Make small funding awards to further develop and embed the work begun in the mashups

Make it sustainable

York Mashup

September 2011

Page 22: Tackling concrete digital preservation challenges with SPRUCE Paul Wheatley SPRUCE Project Manager University of Leeds Twitter: @prwheatley@prwheatley

• Sharing requirements• Sharing experiences: what tools worked well, what approaches should be avoided

• Building on existing tools, rather than re-inventing the wheel

• Libraries + Information Science question and answer site:

–http://libraries.stackexchange.com/

• More recommended collaborative activities:–http://bit.ly/spruce-collaborate

Online collaboration

Page 23: Tackling concrete digital preservation challenges with SPRUCE Paul Wheatley SPRUCE Project Manager University of Leeds Twitter: @prwheatley@prwheatley

Thanks for listening! Any quesions?

Paul Wheatley

SPRUCE Project Manager

University of Leeds

Twitter: @prwheatley

Email: [email protected]

http://openplanetsfoundation.org/blogs/paul