Lsr vpresntation

Preview:

Citation preview

Problems and Issues in Selecting, Harvesting, and Cataloging Web

Resources

Joanne Archer and John SchalowUniversity of Maryland Libraries

Jargon

CrawlerWeb Harvesting

Seed

Harvest

Crawl

Wayback Machine

Options for Web Harvesting

In House Program

i.e. Pandora, Web Curator Tool

Pro: flexibility

Con: $$$

i.e. HTTrack, Adobe Web Capture

Pro: inexpensive

Con: not-scalable

Off the Shelf

Software

Third Party

Subscription

i.e. Web Archiving Service

Archive-It

Pro: Ease-of-use

Con: $

Key Questions for Harvesting Projects

unique

ness

ephemerality

research valueharvest frequency

scope

Maryland’s Pilot Harvests(2008-2010)

Historic Preservation Maryland State Documents

Why harvest these areas?

• Collections are unique

• Builds on existing strengths in print collections

• Large amount of material migrating to the web

Key Questions for Harvesting Projects

unique

ness

ephemerality

research valueharvest frequency

scope

Harvesting

Harvesting Challenges:• Javascript• Streaming media• Form and database driven content• Password protected sites• Robot.txt files• Multiple hosts/subdomains

Single host = www.preservemd.org

Multiple hosts = www.umd.edu

www.lib.umd.edu

End-User Access

End-User Access

collection note

subjectheading

general material designation

URLs

uniform title

Conclusions

Challenges• Start up costs• What to collect• Metadata creation

BUT We are well prepared to meet the challenges

Questions?

• Joanne Archer: jarcher@umd.edu

• John Schalow: schalow@umd.edu

Recommended