36
1 Minerva The Web Preservation Project

1 Minerva The Web Preservation Project. 2 Team Members Library of Congress Roger Adkins Cassy Ammen Allene Hayes Melissa Levine Diane Kresh Jane Mandelbaum

  • View
    221

  • Download
    0

Embed Size (px)

Citation preview

Page 1: 1 Minerva The Web Preservation Project. 2 Team Members Library of Congress Roger Adkins Cassy Ammen Allene Hayes Melissa Levine Diane Kresh Jane Mandelbaum

1

Minerva

The Web Preservation Project

Page 2: 1 Minerva The Web Preservation Project. 2 Team Members Library of Congress Roger Adkins Cassy Ammen Allene Hayes Melissa Levine Diane Kresh Jane Mandelbaum

2

Team Members

Library of Congress

Roger AdkinsCassy AmmenAllene HayesMelissa LevineDiane KreshJane MandelbaumBarbara Tillett

Cornell University

William Arms

Internet Archive

Brewster KahleScott Kirkpatrick

Main Reading Room

Page 3: 1 Minerva The Web Preservation Project. 2 Team Members Library of Congress Roger Adkins Cassy Ammen Allene Hayes Melissa Levine Diane Kresh Jane Mandelbaum

3

1. Open Access Materials on the Web

Page 4: 1 Minerva The Web Preservation Project. 2 Team Members Library of Congress Roger Adkins Cassy Ammen Allene Hayes Melissa Levine Diane Kresh Jane Mandelbaum

4

Page 5: 1 Minerva The Web Preservation Project. 2 Team Members Library of Congress Roger Adkins Cassy Ammen Allene Hayes Melissa Levine Diane Kresh Jane Mandelbaum

5

Partnership with publishers

Publishers and libraries as partners

Selective collection of open access web

Librarianship in a new domain

Bulk collection of open access web

Automated processes

Approaches to Collecting and Preservation of the Web

OPEN ACCESS

CLOSED ACCESS

Page 6: 1 Minerva The Web Preservation Project. 2 Team Members Library of Congress Roger Adkins Cassy Ammen Allene Hayes Melissa Levine Diane Kresh Jane Mandelbaum

6

Web Preservation Project Pilot

• Small number of web sites nominated by selection officers. Three chosen for close study.

http://www.whitehouse.gov/ http://www.algore2000.com/ http://www.georgewbush.com/

• Copies downloaded using HTTrack mirroring program. Inspected for errors, anomalies, etc.

• Catalog records created using OCLC's CORC software Loaded into Library of Congress's ILS system.

• Trial web site developed to evaluate user access.

• Discussions with Copyright Office on legal issues.

Page 7: 1 Minerva The Web Preservation Project. 2 Team Members Library of Congress Roger Adkins Cassy Ammen Allene Hayes Melissa Levine Diane Kresh Jane Mandelbaum

7

Example: The Internet Archive

Page 8: 1 Minerva The Web Preservation Project. 2 Team Members Library of Congress Roger Adkins Cassy Ammen Allene Hayes Melissa Levine Diane Kresh Jane Mandelbaum

8

Example: National Library of Australia

Page 9: 1 Minerva The Web Preservation Project. 2 Team Members Library of Congress Roger Adkins Cassy Ammen Allene Hayes Melissa Levine Diane Kresh Jane Mandelbaum

9

Example: National Library of Sweden

Page 10: 1 Minerva The Web Preservation Project. 2 Team Members Library of Congress Roger Adkins Cassy Ammen Allene Hayes Melissa Levine Diane Kresh Jane Mandelbaum

10

2. Selection and Collection

Page 11: 1 Minerva The Web Preservation Project. 2 Team Members Library of Congress Roger Adkins Cassy Ammen Allene Hayes Melissa Levine Diane Kresh Jane Mandelbaum

11

Collecting: Making a Snapshot

Web site

SnapshotDownload

Archive

A web site is downloaded, using a mirroring program. A snapshot is stored in an archive.

Page 12: 1 Minerva The Web Preservation Project. 2 Team Members Library of Congress Roger Adkins Cassy Ammen Allene Hayes Melissa Levine Diane Kresh Jane Mandelbaum

12

Collecting: Periodic Snapshots

Web site

Archive

At selected time intervals additional snapshots are made.

Snapshot 1

Snapshot 2

Snapshot 3

Page 13: 1 Minerva The Web Preservation Project. 2 Team Members Library of Congress Roger Adkins Cassy Ammen Allene Hayes Melissa Levine Diane Kresh Jane Mandelbaum

13

Very Rough Estimates

There are no good estimates of how many Web sites the Library of Congress would wish to collect and preserve.

OCLC's Web Characterization Project (February 2000)

Public web sites: 2,900,000Annual increase: 700,000

If the Library of Congress collects 1%

Total number of sites: 30,000Annual number new and changed: 15,000

But these numbers are very rough estimates (guesses)!

Page 14: 1 Minerva The Web Preservation Project. 2 Team Members Library of Congress Roger Adkins Cassy Ammen Allene Hayes Melissa Levine Diane Kresh Jane Mandelbaum

14

Selection Decisions

Which sites to collect?

• Bulk -- collect all within a certain category• Selective -- collect sites selected by a librarian

How often to make snapshots?

• Monthly, weekly, or depending on circumstances

Which content to collect?

• HTML pages only• Text and images only• Everything

Page 15: 1 Minerva The Web Preservation Project. 2 Team Members Library of Congress Roger Adkins Cassy Ammen Allene Hayes Melissa Levine Diane Kresh Jane Mandelbaum

15

Examples of Selection Decisions

Selection Frequency Content

Internet Archive bulk monthly HTML + images

Pandora selective varies all

Kulturarw3 bulk sweeps all

Minerva selective irregular all

Page 16: 1 Minerva The Web Preservation Project. 2 Team Members Library of Congress Roger Adkins Cassy Ammen Allene Hayes Melissa Levine Diane Kresh Jane Mandelbaum

16

Selection Decisions: Recommendations

The Library needs a mixed strategy:

1. Selective selection, for known important sites

2. Bulk selection for selected categories (e.g., .gov sites)

3. Bulk collection without selection for other materials

Page 17: 1 Minerva The Web Preservation Project. 2 Team Members Library of Congress Roger Adkins Cassy Ammen Allene Hayes Melissa Levine Diane Kresh Jane Mandelbaum

17

3. Use of the Collections for Scholarship and Research

Page 18: 1 Minerva The Web Preservation Project. 2 Team Members Library of Congress Roger Adkins Cassy Ammen Allene Hayes Melissa Levine Diane Kresh Jane Mandelbaum

18

Analysis by Computer

Archive

Analysis by

computer

Computer programs can be used to analyze the snapshot files.

Snapshot 1

Snapshot 2

Snapshot 3

Page 19: 1 Minerva The Web Preservation Project. 2 Team Members Library of Congress Roger Adkins Cassy Ammen Allene Hayes Melissa Levine Diane Kresh Jane Mandelbaum

19

Analysis by Patron

Web site

Snapshot 1

Archive

Snapshot 2

Snapshot 3

Access 1

Access 2

Access 3Analysis by patron

People can study an access version of a site

Page 20: 1 Minerva The Web Preservation Project. 2 Team Members Library of Congress Roger Adkins Cassy Ammen Allene Hayes Melissa Levine Diane Kresh Jane Mandelbaum

20

Access Decisions

Style of access

• Analysis of snapshot files by computer• Analysis of access version by patron

Editing

• No editing (use snapshot files)• Minimal editing to make access version• Fuller editing to maintain experience• Automatic or by hand

Policy

• Who has access to the collections?

Page 21: 1 Minerva The Web Preservation Project. 2 Team Members Library of Congress Roger Adkins Cassy Ammen Allene Hayes Melissa Levine Diane Kresh Jane Mandelbaum

21

Examples of Access Decisions

Style Editing

Internet Archive computer no

Pandora researcher yes

Minerva researcher yes

Page 22: 1 Minerva The Web Preservation Project. 2 Team Members Library of Congress Roger Adkins Cassy Ammen Allene Hayes Melissa Levine Diane Kresh Jane Mandelbaum

22

Recommendations about the Use of the Collections for Scholarship

and Research

The Library should support the use of the collection in a variety of ways.

1. Computer analysis of snapshot files

2. Automated editing to create access versions of all selected sites, without human checking.

3. Human editing of a few, very important sites.

Page 23: 1 Minerva The Web Preservation Project. 2 Team Members Library of Congress Roger Adkins Cassy Ammen Allene Hayes Melissa Levine Diane Kresh Jane Mandelbaum

23

4. Information Discovery

Page 24: 1 Minerva The Web Preservation Project. 2 Team Members Library of Congress Roger Adkins Cassy Ammen Allene Hayes Melissa Levine Diane Kresh Jane Mandelbaum

24

Options for Information Discovery

Very large numbers of Web sites will be collected and preserved. Some form of index or catalog is required.

Options

• List of sites (e.g., Internet Archive)

Access by URL + date

• Automatic index (e.g., Web search engines)

• Catalog (e.g., MARC or Dublin Core)

Catalog record for individual site or group of sites Access through Library catalog

Page 25: 1 Minerva The Web Preservation Project. 2 Team Members Library of Congress Roger Adkins Cassy Ammen Allene Hayes Melissa Levine Diane Kresh Jane Mandelbaum

25

Information Discovery: Web Preservation Project

Procedure

• MARC catalog records created using OCLC's CORC system.• Loaded into Library of Congress's ILS.

Observations about procedure

• Cataloguing effort similar to other electronic files.• Some similarities to serials.• No significant workflow difficulties.

Page 26: 1 Minerva The Web Preservation Project. 2 Team Members Library of Congress Roger Adkins Cassy Ammen Allene Hayes Melissa Levine Diane Kresh Jane Mandelbaum

26

Cataloguing Observations

• Detailed information is continually changing.

• Difficulty in selecting title (HTML <title> is often poor).

• Problems with identifiers (multiple, changing URLs).

• Collection level records suitable for special events.

It is difficult to evaluate cataloguing strategy because of lack of knowledge of user needs.

Page 27: 1 Minerva The Web Preservation Project. 2 Team Members Library of Congress Roger Adkins Cassy Ammen Allene Hayes Melissa Levine Diane Kresh Jane Mandelbaum

27

Recommendations about Information Discovery

1. The Library should experiment with various approaches to indexing and cataloguing Web sites, including automated indexing, Dublin Core and MARC cataloguing.

2. The Library will probably not be able to afford individual catalog records for all Web sites that are collected.

Page 28: 1 Minerva The Web Preservation Project. 2 Team Members Library of Congress Roger Adkins Cassy Ammen Allene Hayes Melissa Levine Diane Kresh Jane Mandelbaum

28

5. Storage and Preservation

Page 29: 1 Minerva The Web Preservation Project. 2 Team Members Library of Congress Roger Adkins Cassy Ammen Allene Hayes Melissa Levine Diane Kresh Jane Mandelbaum

29

Archive

AccessionControl

Web CrawlerProcess

Catalog ExternalAccess

Workflow

snapshot

Analysis by patron

Analysis by computer

Web site

Page 30: 1 Minerva The Web Preservation Project. 2 Team Members Library of Congress Roger Adkins Cassy Ammen Allene Hayes Melissa Levine Diane Kresh Jane Mandelbaum

30

Preservation Objective

Objective is to preserve the digital collections in a manner that makes them usable for scholarship and research in the future.

What is preserved?

• Preservation of bits

• Preservation of content

• Preservation of experience

How is it used?

• Analysis by computer program

• Viewed by human researcher

Page 31: 1 Minerva The Web Preservation Project. 2 Team Members Library of Congress Roger Adkins Cassy Ammen Allene Hayes Melissa Levine Diane Kresh Jane Mandelbaum

31

Process of Preservation

Version 1Time 0

Time 1

Time 2

This process may be applied to either the snapshot or the access version.

Version 2

Version 3

Page 32: 1 Minerva The Web Preservation Project. 2 Team Members Library of Congress Roger Adkins Cassy Ammen Allene Hayes Melissa Levine Diane Kresh Jane Mandelbaum

32

Storage Decisions: Identification

Identification of Web site

• URL, but Web sites may change their URL• URN (e.g., Handle or PURL)

Identification and provenance of versions

• Web site identifier• Collection information (date, time, etc.)• History of changes

Recommendations

1. Assign URN (e.g., Handle) to each Web site.

2. Store provenance metadata with every file.

Page 33: 1 Minerva The Web Preservation Project. 2 Team Members Library of Congress Roger Adkins Cassy Ammen Allene Hayes Melissa Levine Diane Kresh Jane Mandelbaum

33

Preservation Recommendations

1. Keep the unedited snapshot files by repeated refreshing.

2. Use automated migration of individual files as the basic technique for keeping Web sites (more of less) functional at moderate cost.

3. Use manual editing for a small number of particularly important sites.

In general, it is not possible to maintain the experience of using Web sites as technology changes, even with expensive editing.

Page 34: 1 Minerva The Web Preservation Project. 2 Team Members Library of Congress Roger Adkins Cassy Ammen Allene Hayes Melissa Levine Diane Kresh Jane Mandelbaum

34

6. General Recommendations

Page 35: 1 Minerva The Web Preservation Project. 2 Team Members Library of Congress Roger Adkins Cassy Ammen Allene Hayes Melissa Levine Diane Kresh Jane Mandelbaum

35

General Recommendations

1. Collection and preservation of Web materials should be seen as a single program.

2. The program needs a full-time team of librarians and technical staff.

3. Some aspects can be subcontracted to specialists (e.g., the Web crawler), but the leadership must come from the Library.

4. The Library should seek partnerships with other libraries and archives.

5. Most processes will be automatic, with skilled attention given to a small number of particularly important sites.

Page 36: 1 Minerva The Web Preservation Project. 2 Team Members Library of Congress Roger Adkins Cassy Ammen Allene Hayes Melissa Levine Diane Kresh Jane Mandelbaum

36

Demonstration of Pilot System