41
Challenges of Digital Preservation MA / CS 109 April 22, 2011 Andrea Goethals Manager of Digital Preservation & Repository Services Harvard Library

Challenges of Digital Preservation

  • Upload
    mina

  • View
    61

  • Download
    0

Embed Size (px)

DESCRIPTION

Challenges of Digital Preservation. MA / CS 109 April 22, 2011 Andrea Goethals Manager of Digital Preservation & Repository Services Harvard Library. “Digital Content”?. Digitized (born-analog). Born-digital Tweets Web sites Email Documents PDF Word, OpenOffice … Spreadsheets - PowerPoint PPT Presentation

Citation preview

Page 1: Challenges of Digital Preservation

Challenges of Digital PreservationMA / CS 109April 22, 2011

Andrea GoethalsManager of Digital Preservation & Repository ServicesHarvard Library

Page 2: Challenges of Digital Preservation
Page 3: Challenges of Digital Preservation

“Digital Content”?Digitized (born-

analog)Born-digital

◦ Tweets◦ Web sites◦ Email◦ Documents

PDF Word, OpenOffice … Spreadsheets

◦ Data sets

Page 4: Challenges of Digital Preservation

Digital content is not new1957: 1st digital

image1969: ARPAnet1971: 1st email

sent1972: 1st

consumer-level video game

1975: 1st digital camera

Russell Kirsch’s son (source: NIST)

Page 5: Challenges of Digital Preservation

But has only recently exploded

1998: 1st Google index◦ 26 million pages

2000: Google index◦ 1 billion pages

2008: Google link processors◦ 1 trillion unique URIs◦ “… and the number of

individual Web pages out there is growing by several billion pages per day” – from the official Google blog

Page 6: Challenges of Digital Preservation

The coming tsunami2010: estimated

at 1.2 ZB (1 ZB is 1 million TBs)◦ DVDs stacked from

Earth to the Moon and back

2020: expected to grow by a factor of 44 to 35 ZB◦ DVDs stacked

halfway to MarsSource: 2010 IDC Digital Universe Study sponsored by EMC

Page 7: Challenges of Digital Preservation

Outpacing storage

Source: 2009 IDC Digital Universe Study sponsored by EMC

Page 8: Challenges of Digital Preservation

Why do we care?

Page 9: Challenges of Digital Preservation

May be historically significant

Captured March 19, 2011 for a Japan Earthquake collection created by Virginia Tech, Internet Archive (http://www.archive-it.org/public/collection.html?id=2438)

Page 10: Challenges of Digital Preservation

May be a work of art

YouTube Play. A Biennial of Creative Video (Oct. 2010 -)

Page 11: Challenges of Digital Preservation

May be an important reference

Only availabl

e in digital

form

Page 12: Challenges of Digital Preservation

May only be possible digitally

Page 13: Challenges of Digital Preservation

Who cares?Cultural heritage institutions

◦Libraries, archives◦Museums, historical societies◦Academic institutions

GovernmentsEntertainment, news and media

industryScientific communityFunding bodies (NSF, NIH)You?

Page 14: Challenges of Digital Preservation

Preservation historicallyArchives and libraries have been

preserving all kinds of analog material for centuries using:◦Environmental control◦Conservation treatments

Can store away until resources allow processing◦Benign neglect approach works well

Page 15: Challenges of Digital Preservation

Analog content is fairly durableEven damaged, may still be

identifiable, readable, usableAnatolian Cuneiform Tablet, circa 1850 BCE

Page 16: Challenges of Digital Preservation

In contrast digital content isEasily destroyedTransientHiddenRequires more active attention –

benign neglect approach doesn’t work

Page 17: Challenges of Digital Preservation

Digital content is easily destroyedBad peopleHardware or

software failuresHuman mistakes

◦ The slip of a finger can lead to catastrophic results

◦ “Help! Accidental deletion. I accidentally deleted 62 images… can you please recover them from backups?”

Page 18: Challenges of Digital Preservation

Digital content is transientAverage lifespan of a Web site is

between 44 and 100 days

Captured April 8, 2009 Visited October 13, 2010

Page 19: Challenges of Digital Preservation

Digital content is hiddenWhich is corrupt?

Page 20: Challenges of Digital Preservation

Digital content is hiddenBoth. Use helps but its not

enough to detect corruption.

Page 21: Challenges of Digital Preservation

But is it usable???It’s not enough to preserve the

digital bits◦AppleWorks?◦WordStar?◦Excel 1.0?

To use digital content we need software that can read the format

Page 22: Challenges of Digital Preservation

Reading formats

ffd8ffe000104a46494600010201008300830000ffed0fb050686f746f73686f7020332e30003842494d03e90a5072696e7420496e666f000000007800000000004800480000000002f40240ffeeffee030602520347052803fc00020000004800480000000002d80228000100000064000000010003030300000001270f0001000100000000000000000000000060080019019000000000000000000000000000000000000000000000000000000000000000003842494d03ed0a5265736f6c7574696f6e0000000010008313a3000200 ...

Page 23: Challenges of Digital Preservation

Reading formats

ffd8ffe000104a46494600010201008300830000ffed0fb050686f746f73686f7020332e30003842494d03e90a5072696e7420496e666f000000007800000000004800480000000002f40240ffeeffee030602520347052803fc00020000004800480000000002d80228000100000064000000010003030300000001270f0001000100000000000000000000000060080019019000000000000000000000000000000000000000000000000000000000000000003842494d03ed0a5265736f6c7574696f6e0000000010008313a3000200 ...

SOIAPP0 JFIF 1.2APP13 IPTCAPP2 ICCDQTSOF0 183x512DRIDHTSOSECS0RST0ECS1RST1ECS2...

Page 24: Challenges of Digital Preservation

Reading formats

ffd8ffe000104a46494600010201008300830000ffed0fb050686f746f73686f7020332e30003842494d03e90a5072696e7420496e666f000000007800000000004800480000000002f40240ffeeffee030602520347052803fc00020000004800480000000002d80228000100000064000000010003030300000001270f0001000100000000000000000000000060080019019000000000000000000000000000000000000000000000000000000000000000003842494d03ed0a5265736f6c7574696f6e0000000010008313a3000200 ...

SOIAPP0 JFIF 1.2APP13 IPTCAPP2 ICCDQTSOF0 183x512DRIDHTSOSECS0RST0ECS1RST1ECS2...

Page 25: Challenges of Digital Preservation

Access to information

informationcontent

bitsformats

SWHW

HW (paper)informationcontent

HW (paper)

symbols

language

Analog book

Unmediated access

Digital bookTechnology-mediated

access

Page 26: Challenges of Digital Preservation

Formats are key to digital preservation

informationcontent

bitsformats

SWHW

supp

ortin

g

tech

nolog

ies

digita

l

cont

ent

If the format of our content is unsupported by technology, we can’t access the content’s information!

Page 27: Challenges of Digital Preservation

Dependent on fleeting technologyWe are dependent on technology

to interpret (render, play, etc.) digital content

No technology sticks around – it all ages and disappears

Eventually all digital content in its original format becomes unusable!

Page 28: Challenges of Digital Preservation

Format obsolescenceKodak PhotoCD

◦Used by libraries in the 1990’s and into 2000’s as a preservation format

◦Best decoders were from Kodak and are no longer supported

◦Very few software decoders remaining – soon images in this format will be unusable

◦Harvard’s Digital Repository Service has 7,243 of these

Page 29: Challenges of Digital Preservation

Two sub-problemsKeep the bits

safeKeep the

information usable as technology changes

Page 30: Challenges of Digital Preservation

Safe bitsInfrastructure, polices, practices and

professional staff to counter risks◦High quality storage◦Redundancy (multiple copies, multiple

locations)◦Media refreshing (replacing)◦Security and access restrictions◦Content recovery◦Integrity monitoring (check for

corruption)…

Page 31: Challenges of Digital Preservation

Integrity monitoringMessage digests – unique

signatures for digital content◦Fixed-size bit strings

6326ec82b3200df4a87fc54356d2cb73◦Calculated by cryptographic hash

functions, e.g. MD5, SHA1, …Any changes to a file result in a

changed message digestUseful for detecting corruption

Page 32: Challenges of Digital Preservation

Usable informationPeople have to be able to find itPeople must be able to manage itDocument what’s important

(description, context, ownership, processing history)

Know what you are preserving (formats)…

Page 33: Challenges of Digital Preservation

A TIFF is a TIFF?Tiff 4.0Tiff 5.0Tiff 6.0Tiff 6.0 extension

YCbCr (Class Y)TIFF/IT (ISO

12639:2003)TIFF/EP (ISO 12234-

2:2001)RichTIFFEXIF 2.0

EXIF 2.1 (JEIDA-49-1998)

EXIF 2.2 (JEITA CP-3451)GeoTIFF 1.0TIFF-FX (RFC 2301)Class F (RFC 2306)RFC 1314Canon RAW

(.crw, .cr2, .tif)Nikon RAW (.nef)DNG (Adobe Digital

Negative)

Page 34: Challenges of Digital Preservation

Identifying formatsTechniques: “magic numbers”,

full parseFew tools

◦Support limited number of formats◦Accuracy varies

Some improvements◦File Information Tool Set (FITS)

fits.google.code◦NARA-sponsored research

Page 35: Challenges of Digital Preservation

Usable informationMake sure there’s technology to

support the formats! (technology watch)

Preservation strategies◦Technology preservation◦Creation of viewing software◦Emulation & variations:

Universal Virtual Machine Universal Virtual Computer

◦Format normalization◦Format migrations…

Page 36: Challenges of Digital Preservation

Key format migration considerationsWhat can’t be lost in the

transformation? “Significant properties”◦E.g. color, embedded metadata, resolution,

ICC profiles, interaction, attachments, fonts, links

◦How important are each of these properties? – weighted criteria

To what format? “Preservable” formatsWhat else must be changed? Ex: LinksHow many versions to keep?

Page 37: Challenges of Digital Preservation

Preservation lifecycle – a series of hand-offsCreate or acquire digital contentIngest into a preservation repository

◦Continuous cycle of: Monitoring Planning Intervention

◦Subject to collection management decisions

Transfer to next generation of the repository or to a different repository

Page 38: Challenges of Digital Preservation

Ongoing commitmentRequires continual proactive

program◦You can’t just start and stop◦Time frames are MUCH shorter than

for preservation of analog materialRequires ongoing investment in

infrastructure and staff

Page 39: Challenges of Digital Preservation

Can’t do it aloneDigital preservation activities

must be shared across institutions

Even collectively we don’t have adequate resources or understanding

Page 40: Challenges of Digital Preservation

Preservation communityCollaborative organizations

(NDSA, IIPC, OPF)Collaborative projectsStandards and best practicesShared infrastructure and tools

◦Formats registry◦Repository software◦Preservation planning tools◦Format tools

Page 41: Challenges of Digital Preservation