View
557
Download
0
Embed Size (px)
DESCRIPTION
Citation preview
1
Preserving access:Making more informed “guesses” about what works
Prepared by: Maxine Davis, Collaboration Research OfficerPresented by: David Pearson, Acting Director
Web Archiving & Digital Preservation, National Library of Australia
IIPC Open Day, San Francisco, 7 October 2009
2
Presentation Outline
• The problem
• Case study: PANDORA Web Archive • Some approaches & options
– Approach 1: Unified Digital Format Registry (UDFR)
– Approach 2: Wikipedia– Approach 3: Another way documenting
what web archives actually use/d
3
The problem
• The World Wide Web is constantly evolving– Requires combinations of software/hardware
to render web content– But what is used for creation and access
changes• Web archives
– Contain snapshots of websites taken at different times (different sites or same sites multiple times)
– Lots of files, many file formats, various versions
– Aim for ongoing access
4
Process of version “creep”in the archive
• Mixed accessibility resulting from:– Different browsers, plug-ins, operating
systems in use (then and now)– Backwards compatibility not guaranteed– Changes in standards and coding practices
(deprecated, dead & non-standard tags)– Obsolescence of file formats & renderers
• Changes to access paths– Incremental loss of access not directly
obvious– Alternative access paths not specified
5
Case study: PANDORA Australia’s Web Archive (1)
• Selective archive began collecting 1996– Sites individually selected by NLA &
partners– As at July 2009 over 70.6 million files– Accessible over the web using standard
web browser
• .au whole domain harvests– 4 annual harvests 2005-2008 completed,
2009 underway with Internet Archive– Combined harvests 05-08 ~ 2.3 billion files– Not currently publicly available
6
Case study:PANDORA Australia’s Web Archive (2)
7
IIPC Preservation Working Group discussions
• Need for documenting the technical environment
• Support required for alternative preservation action strategies– Emulation of past environments– Migration to standard formats– Risk notification– Recording conversion and alternate
access paths
• Exploring different approaches• Sharing information sensible
8
Technical information of interest
• Browsers + plug-ins/helper applications versions & dependencies
• Used approximately when?
• Appropriate for which individual/ type of file format or whole archive?
9
Already documented?
• Manufacturer/vendor’s websites• Developer’s networks, forums, blogs,
etc.• File format registries• File extension resources• Software archives/download sites• Internet history websites• Internet statistics websites• Wikipedia
10
Possible Approach 1: UDFR• Digital format registry will result from
proposed merger of PRONOM and GDFR
• Pros– Considerable intellectual investment already– Could be used for general digital preservation and
potential interaction with other tools
• Cons– Under development– Web archive requirements need to be specified, use
cases developed, changes to data model, population with relevant data and regular updating
– Temporal aspect not currently catered for – Entry point Individual file format or software type [could
be a pro?]
11
Possible Approach 2: Wikipedia (1)
• Pros– Existing free, web-based
collaborative multilingual project
– Draws together a rich set of information
• browsers, layout engines, plug-ins & software, statistics, creators, standards, etc.
• lists, history, comparisons, timelines, links to internal & external references
– Updated by many voluntary contributors
12
Possible Approach 2: Wikipedia (2)
• Cons– General audience, not specific to web archive
requirements or specific web archive– Amount of detail varies (between different
language versions, articles)– Can be edited by multiple users (+ & -) – Not designed to interact with other digital
preservation tools as UDFR has potential to do
13
Extract example
14
Possible Approach 3: Documenting what web archives are using/used• Pros
– Time based software suite approach – Starting point for
• Potential UDFR seed list• Identifying commonly used software• Inferring additional software requirements• Identifying alternate access paths
• Cons– Easier to document current versions– Obscure/obsolete material in our collections
may be unknown
15
Individual web archives as sources of information• Analysis of archive contents & harvesting
statistics
• Web archivists observations & records– UK Web Archive Technology Watch blog
• Website usage statistics – Browser versions & operating systems– Indicative of popularity
• Archived sites – Plug-in requirements, file type information– May include useful information websites – Internet Archive complementary collection
16
Example: NLA Web archiving software environment July 2009• Operating system: Windows XP• Computer: Windows PC, Intel Pentium 4• Browser: Internet Explorer 7 (main browser),
IE8, Firefox 3.0• Additional software:
– Adobe Reader 8– Adobe Shockwave Player– Adobe Flash Player 10– Real Player 10– Apple QuickTime 7– Windows Media Player 11– Java 6 Update 11– JavaScript enabled– Word, Excel, PowerPoint 2003– WinZip
17
Example: Earlier NLA Software Environment
PKUnzip ?WinZipWinZip
Word, Excel, PowerPoint Word, Excel, PowerPoint Word, Excel, PowerPoint
JavaScript enabled JavaScript enabled JavaScript enabled
Java?Java ? Java ?
Netscape Media Player?Windows Media Player6.4?
Windows Media Player 9?
QuickTime Apple QuickTime Apple QuickTime
Real Audio playerReal Player Real Player
?Macromedia Flash player
Macromedia Flash player
Macromedia Shockwave Macromedia ShockwaveMacromedia Shockwave
Acrobat Reader Acrobat Reader Adobe Acrobat Reader
Netscape Navigator 1, 2 or 3?
Netscape Navigator 4.08IE6 (since June 2002)
Windows PCWindows PCWindows PC
Windows 3.1/ Windows for Workgroups
Windows 95 Windows 2000
199620002005
18
Example: Comparison NLA and BnF software environments
Internet Explorer
Adobe ReaderAdobe Flash player
Adobe Shockwave player
VLC Media player
Real player
Word, Excel & PowerPoint Viewers
Java Virtual Machine
Internet Explorer
Acrobat Reader*Macromedia Flash player*Windows Media Player*
QuickTime*
Java Virtual Machine (Microsoft)*
Later additions:
Firefox
RealOne Player 10*Software versions progressively updated to latest compatible with Windows XP
Internet Explorer 7 and 8
Firefox 3.0
Adobe Reader 8Adobe Shockwave Player
Adobe Flash Player 10
Real Player 10Apple QuickTime 7
Windows Media Player 11
Java 6 Update 11
JavaScript enabledWord, Excel, PowerPoint 2003WinZip
BnF public in-house access software 2008
BnF Librarian’s software since 2005
NLA web archivist’s software 2009
19
Going forward
• Is it worth pursuing approach 3?• If so where would we record
(IIPC PWG wiki?, other suggestions)?
• Interested in contributing?
20
Questions?
Contact• David Pearson
[email protected]• Maxine Davis
Report to IIPC PWG by end October 2009
Everything, for EveryoneForever