21
1 CS 502: Computing Methods for Digital Libraries Lecture 27 Preservation

1 CS 502: Computing Methods for Digital Libraries Lecture 27 Preservation

  • View
    222

  • Download
    3

Embed Size (px)

Citation preview

Page 1: 1 CS 502: Computing Methods for Digital Libraries Lecture 27 Preservation

1

CS 502: Computing Methods for Digital Libraries

Lecture 27

Preservation

Page 2: 1 CS 502: Computing Methods for Digital Libraries Lecture 27 Preservation

2

Administration

Online survey

http://create.hci.cornell.edu/cssurvey.cfm

Course evaluations

at end of class today

Page 3: 1 CS 502: Computing Methods for Digital Libraries Lecture 27 Preservation

3

Long-term preservation

Objective

Retain digital library materials over centuries

Longer than ...

• computer architectures (Wintel, Linux, 390, ...)

• magnetic storage (disks, tapes, ...)

• formats, protocols, applications (Unicode, Java, XML, ...)

• Internet or the web

for purposes that we have not yet considered

Page 4: 1 CS 502: Computing Methods for Digital Libraries Lecture 27 Preservation

4

Page 5: 1 CS 502: Computing Methods for Digital Libraries Lecture 27 Preservation

5

Page 6: 1 CS 502: Computing Methods for Digital Libraries Lecture 27 Preservation

6

Page 7: 1 CS 502: Computing Methods for Digital Libraries Lecture 27 Preservation

7

Page 8: 1 CS 502: Computing Methods for Digital Libraries Lecture 27 Preservation

8

Page 9: 1 CS 502: Computing Methods for Digital Libraries Lecture 27 Preservation

9

Levels of preservation

• Preserve full look and feel of digital material in its context

e.g., A video game with its hardware

• Preserve content with an access system but migrate the look and feel to new environments

e.g., successive versions of MS Windows

• Preserve raw content but no software system

e.g., UTF-8 text with XML/XSL mark-up, but no XML/XSL software

The complexity of preservation varies greatly with the level.

Page 10: 1 CS 502: Computing Methods for Digital Libraries Lecture 27 Preservation

10

Challenges: user needs

Digital information differs from print

May be useless without its environment.

Creator and subscriber may not have copies.

Numerous versions.

Example: A scientific journal on-line

If the author does not subscribe - no access to own article.

If the library does not renew subscription - no access to anything.

Page 11: 1 CS 502: Computing Methods for Digital Libraries Lecture 27 Preservation

11

Challenges: technical problems

Technical issues

Storage media have short life-span.

Formats and specifications change continually.

Computing environments are very complex.

Example: personal files

I have retained all my personal computer files since 1984, but have great difficulty in reading some of them.

Page 12: 1 CS 502: Computing Methods for Digital Libraries Lecture 27 Preservation

12

Challenges: economic and legal

Legal

Archives require permission to save information.

Institutions:

Library of Congress, National Archives, etc. do not provide the same services for electronic information that they provide for physical artifacts.

Example: discontinued serials

What happens if a journal publisher goes bankrupt, or a scientific archive does not get its grant renewed?

Page 13: 1 CS 502: Computing Methods for Digital Libraries Lecture 27 Preservation

13

Technical approaches: 1. Persistent storage

Material Approximate life (years)

Acid-free paper 500+

Microfilm 300

Optical disks 100?

Color film 25-50

CDs 20?

Magnetic disk and tape 5

• Persistent storage preserves raw content only

• Research in high-volume, long-term digital media in lacking

Page 14: 1 CS 502: Computing Methods for Digital Libraries Lecture 27 Preservation

14

Technical approaches2. Copying bits (refreshing)

Refreshing bits

Repeatedly copy bits from one storage medium to the next.

• A standard technique in data processing.• Benefits from the rapid fall in prices of storage devices.• Preserves raw content only.

Requires active management

Mirrors

Have many copies of the same information with independent management.

Page 15: 1 CS 502: Computing Methods for Digital Libraries Lecture 27 Preservation

15

Technical approaches3. Migration of content

Migration

• Retain content but change formats and representations to keep current with technology

• Used by journal publishers

• Preserves content and an access system

Example. Pension funds

The Social Security Administration has records of every FICA payment, which migrate between systems over many years.

Page 16: 1 CS 502: Computing Methods for Digital Libraries Lecture 27 Preservation

16

Technical approaches4. Emulation

Concept

• Record a full specification of the computing environment in which the digital information was created

• At time in future, emulate the original computing environment

• Would preserve full look and feel

Clearly not practical for complex computing systems

• Emulation is never perfect

• Computing environments are remarkably complex

But may be useful for parts of systems

e.g., Java virtual machine

Page 17: 1 CS 502: Computing Methods for Digital Libraries Lecture 27 Preservation

17

Technical approaches5. Digital archeology

After periods of neglect, archeologists are needed

• Recover data from old media

• Reverse engineer lost formats and specifications

• Experts in digital paleography (reading archaic scripts and formats)

Example. East Germany

German archivists are reconstructing the records of the East German state from worn out tapes, broken computer systems, undocumented data bases, and the recollections of staff.

Page 18: 1 CS 502: Computing Methods for Digital Libraries Lecture 27 Preservation

18

Preservation at publication

This is a period of experimentation and change in formats, protocols, object models, etc.

Some information is easier to preserve than others.

Longevity is more likely if:

Formats are widely used, in important applications.

Methods are simple, without using obscure options.

Coding schemes are easy to interpret.

Example. Internet RFC Series

The Internet RFC Series use text/ascii. The RFCs go back to 1969 and have no preservation problems. A few RFCs are in PostScript and already hard to decipher

Page 19: 1 CS 502: Computing Methods for Digital Libraries Lecture 27 Preservation

19

Metadata

Digital information needs interpretation

• Self-documentation is always good

• Persistent identification is vital

• Simple, standard metadata has a chance of long-life

• Authentication of material need not be complex (e.g., hash)

• History of changes (e.g., migration to different format)

Page 20: 1 CS 502: Computing Methods for Digital Libraries Lecture 27 Preservation

20

Preservation of specifications

Digital information needs a context

Therefore store the specifications of:

• Formats

• Database designs

• Technical documentation

• User manuals

...on high-quality archival materials, e.g., paper.

Page 21: 1 CS 502: Computing Methods for Digital Libraries Lecture 27 Preservation

21

Final word

Long-term preservation needs people

and organizations who want it!