20
1 Preserving access: Making more informed “guesses” about what works Prepared by: Maxine Davis, Collaboration Research Officer Presented by: David Pearson, Acting Director Web Archiving & Digital Preservation, National Library of Australia IIPC Open Day, San Francisco, 7 October 2009

Preserving access

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: Preserving access

1

Preserving access:Making more informed “guesses” about what works

Prepared by: Maxine Davis, Collaboration Research OfficerPresented by: David Pearson, Acting Director

Web Archiving & Digital Preservation, National Library of Australia

IIPC Open Day, San Francisco, 7 October 2009

Page 2: Preserving access

2

Presentation Outline

• The problem

• Case study: PANDORA Web Archive • Some approaches & options

– Approach 1: Unified Digital Format Registry (UDFR)

– Approach 2: Wikipedia– Approach 3: Another way documenting

what web archives actually use/d

Page 3: Preserving access

3

The problem

• The World Wide Web is constantly evolving– Requires combinations of software/hardware

to render web content– But what is used for creation and access

changes• Web archives

– Contain snapshots of websites taken at different times (different sites or same sites multiple times)

– Lots of files, many file formats, various versions

– Aim for ongoing access

Page 4: Preserving access

4

Process of version “creep”in the archive

• Mixed accessibility resulting from:– Different browsers, plug-ins, operating

systems in use (then and now)– Backwards compatibility not guaranteed– Changes in standards and coding practices

(deprecated, dead & non-standard tags)– Obsolescence of file formats & renderers

• Changes to access paths– Incremental loss of access not directly

obvious– Alternative access paths not specified

Page 5: Preserving access

5

Case study: PANDORA Australia’s Web Archive (1)

• Selective archive began collecting 1996– Sites individually selected by NLA &

partners– As at July 2009 over 70.6 million files– Accessible over the web using standard

web browser

• .au whole domain harvests– 4 annual harvests 2005-2008 completed,

2009 underway with Internet Archive– Combined harvests 05-08 ~ 2.3 billion files– Not currently publicly available

Page 6: Preserving access

6

Case study:PANDORA Australia’s Web Archive (2)

Page 7: Preserving access

7

IIPC Preservation Working Group discussions

• Need for documenting the technical environment

• Support required for alternative preservation action strategies– Emulation of past environments– Migration to standard formats– Risk notification– Recording conversion and alternate

access paths

• Exploring different approaches• Sharing information sensible

Page 8: Preserving access

8

Technical information of interest

• Browsers + plug-ins/helper applications versions & dependencies

• Used approximately when?

• Appropriate for which individual/ type of file format or whole archive?

Page 9: Preserving access

9

Already documented?

• Manufacturer/vendor’s websites• Developer’s networks, forums, blogs,

etc.• File format registries• File extension resources• Software archives/download sites• Internet history websites• Internet statistics websites• Wikipedia

Page 10: Preserving access

10

Possible Approach 1: UDFR• Digital format registry will result from

proposed merger of PRONOM and GDFR

• Pros– Considerable intellectual investment already– Could be used for general digital preservation and

potential interaction with other tools

• Cons– Under development– Web archive requirements need to be specified, use

cases developed, changes to data model, population with relevant data and regular updating

– Temporal aspect not currently catered for – Entry point Individual file format or software type [could

be a pro?]

Page 11: Preserving access

11

Possible Approach 2: Wikipedia (1)

• Pros– Existing free, web-based

collaborative multilingual project

– Draws together a rich set of information

• browsers, layout engines, plug-ins & software, statistics, creators, standards, etc.

• lists, history, comparisons, timelines, links to internal & external references

– Updated by many voluntary contributors

Page 12: Preserving access

12

Possible Approach 2: Wikipedia (2)

• Cons– General audience, not specific to web archive

requirements or specific web archive– Amount of detail varies (between different

language versions, articles)– Can be edited by multiple users (+ & -) – Not designed to interact with other digital

preservation tools as UDFR has potential to do

Page 13: Preserving access

13

Extract example

Page 14: Preserving access

14

Possible Approach 3: Documenting what web archives are using/used• Pros

– Time based software suite approach – Starting point for

• Potential UDFR seed list• Identifying commonly used software• Inferring additional software requirements• Identifying alternate access paths

• Cons– Easier to document current versions– Obscure/obsolete material in our collections

may be unknown

Page 15: Preserving access

15

Individual web archives as sources of information• Analysis of archive contents & harvesting

statistics

• Web archivists observations & records– UK Web Archive Technology Watch blog

• Website usage statistics – Browser versions & operating systems– Indicative of popularity

• Archived sites – Plug-in requirements, file type information– May include useful information websites – Internet Archive complementary collection

Page 16: Preserving access

16

Example: NLA Web archiving software environment July 2009• Operating system: Windows XP• Computer: Windows PC, Intel Pentium 4• Browser: Internet Explorer 7 (main browser),

IE8, Firefox 3.0• Additional software:

– Adobe Reader 8– Adobe Shockwave Player– Adobe Flash Player 10– Real Player 10– Apple QuickTime 7– Windows Media Player 11– Java 6 Update 11– JavaScript enabled– Word, Excel, PowerPoint 2003– WinZip

Page 17: Preserving access

17

Example: Earlier NLA Software Environment

PKUnzip ?WinZipWinZip

Word, Excel, PowerPoint Word, Excel, PowerPoint Word, Excel, PowerPoint

JavaScript enabled JavaScript enabled JavaScript enabled

Java?Java ? Java ?

Netscape Media Player?Windows Media Player6.4?

Windows Media Player 9?

QuickTime Apple QuickTime Apple QuickTime

Real Audio playerReal Player Real Player

?Macromedia Flash player

Macromedia Flash player

Macromedia Shockwave Macromedia ShockwaveMacromedia Shockwave

Acrobat Reader Acrobat Reader Adobe Acrobat Reader

Netscape Navigator 1, 2 or 3?

Netscape Navigator 4.08IE6 (since June 2002)

Windows PCWindows PCWindows PC

Windows 3.1/ Windows for Workgroups

Windows 95 Windows 2000

199620002005

Page 18: Preserving access

18

Example: Comparison NLA and BnF software environments

Internet Explorer

Adobe ReaderAdobe Flash player

Adobe Shockwave player

VLC Media player

Real player

Word, Excel & PowerPoint Viewers

Java Virtual Machine

Internet Explorer

Acrobat Reader*Macromedia Flash player*Windows Media Player*

QuickTime*

Java Virtual Machine (Microsoft)*

Later additions:

Firefox

RealOne Player 10*Software versions progressively updated to latest compatible with Windows XP

Internet Explorer 7 and 8

Firefox 3.0

Adobe Reader 8Adobe Shockwave Player

Adobe Flash Player 10

Real Player 10Apple QuickTime 7

Windows Media Player 11

Java 6 Update 11

JavaScript enabledWord, Excel, PowerPoint 2003WinZip

BnF public in-house access software 2008

BnF Librarian’s software since 2005

NLA web archivist’s software 2009

Page 19: Preserving access

19

Going forward

• Is it worth pursuing approach 3?• If so where would we record

(IIPC PWG wiki?, other suggestions)?

• Interested in contributing?

Page 20: Preserving access

20

Questions?

Contact• David Pearson

[email protected]• Maxine Davis

[email protected]

Report to IIPC PWG by end October 2009

Everything, for EveryoneForever