Allan.arvidson@kb.se krister.persson@kb.se 1999-06-04 Kulturarw³ The Swedish WWW Archive Eller, att...

Preview:

Citation preview

allan.arvidson@kb.sekrister.persson@kb.se

1999-06-04

Kulturarw³

The Swedish

WWW Archive

Eller, att fånga den VärldsVidaVäven

http://kulturarw3.kb.se

allan.arvidson@kb.sekrister.persson@kb.se

1999-06-04

Goals

• All www and gopher pages in Sweden– pictures, video etc

– .se and generic TLD’s

– suecana

• All articles in electronic journals

• All Swedish newsgroups / mailing lists

Limitations versus RA and ALB

allan.arvidson@kb.sekrister.persson@kb.se

1999-06-04

Organisation

• Project group: two persons

• Steering group: four persons

• Reference group: representatives from ALB, RA, Lund Univ, SUNET etc

• International cooperation– NWA - Nordic Web Archive

– Nat. Libraries

allan.arvidson@kb.sekrister.persson@kb.se

1999-06-04

Strategy

Selection?

• How to know what is important?

• Labour intense

Collect everything using automatic software

• Gets everything

• Less labour intense

• Computer memory is cheap

allan.arvidson@kb.sekrister.persson@kb.se

1999-06-04

Strategy

• Take snapshots of the Swedish web a few times a year.

• In the future, take newspapers every day, others every month etc

What is Sweden?

• .se

• .com, .org and .net with Swedish address/telephone number

• Swedish .nu (Niue)

• Suecana

allan.arvidson@kb.sekrister.persson@kb.se

1999-06-04

Robot, Software

• Modified version of Nordic Web Index’s robot software (NetLab, Lunds univ.)

• Important!indexing is not archiving!

• Save data in MIME format

• Temporary storage, media: DLT-tape – Data rate, 5MB/s

– Capacity, 20 GB

– Durability, 1 000 000 pass

– Data integrity, error detection and correction

– Low cost per GB stored

– Access time too long (?)

allan.arvidson@kb.sekrister.persson@kb.se

1999-06-04

Statistics

• 15 MURL (including duplicates)

• 240 Gbytes

• 54 000 sites– 32 300 .se

– 14 800 .com, .org, .net and .edu

– ~100 suecana

– 6 800 Niue

• Compare, legal deposit– Printed materiel: 1,7 km/year

– The web: approx. 50 km on swedish web.

allan.arvidson@kb.sekrister.persson@kb.se

1999-06-04

Statistics

• 363 different MIME types found. Many the same, some garbage.

• 7.0 M text/html

• 4.2 M image/gif

• 3.0 M image/jpeg

• 0.3 M text/plain

• 0.5 M others

• text/html + image/gif + image/jpeg + text/plain comprises 97% of the documents.

allan.arvidson@kb.sekrister.persson@kb.se

1999-06-04

The Archive

• Goals– Create copies of the swedish web at several

times (compare index services)

– Surf the web in space and time

– Search

– Accessible in the future migration

allan.arvidson@kb.sekrister.persson@kb.se

1999-06-04

The Archive

Disk

(Optical disk)

Magnetic tape

HSM: Most data on magnetic tapes, staged to disk when needed

allan.arvidson@kb.sekrister.persson@kb.se

1999-06-04

The Archive

What are we archiving? – Magnetic tapes?– Bits and bytes?– Intellectual content?

Recommended