53
Memento Update CNI Task Force Meeting, Spring 2011 1 Memento http://mementoweb.org/ Herbert Van de Sompel Robert Sanderson Michael L. Nelson Big Leaps Towards Seamless Navigation of the Web of the Past

Memento: Big Leaps Towards Seamless Navigation of the Web of the Past

Embed Size (px)

DESCRIPTION

These slides provide an explanation of the Memento Framework (time travel for the Web) from the perspective of resource versioning. It also details progress that has been made with deploying the framework since it was first introduced in November 2009, including standardization, development of tools, and advocacy. In addition, it touches upon new challenges (discovery, branding) and announces plans to make transactional Web archiving software available in the course of 2011.

Citation preview

Page 1: Memento: Big Leaps Towards Seamless Navigation of the Web of the Past

Memento Update CNI Task Force Meeting, Spring 2011 1

Memento http://mementoweb.org/

Herbert Van de Sompel Robert Sanderson Michael L. Nelson

Big Leaps Towards Seamless Navigation of the Web of the Past

Page 2: Memento: Big Leaps Towards Seamless Navigation of the Web of the Past

Memento Update CNI Task Force Meeting, Spring 2011 2

Overview of Memento Framework

Deployment Progress

Memento and Data

Memento and Discovery

Memento and Branding

Alternative Web Archiving Strategies

Page 3: Memento: Big Leaps Towards Seamless Navigation of the Web of the Past

Memento Update CNI Task Force Meeting, Spring 2011 3

Overview of Memento Framework

Progress

Memento and Data

Memento and Discovery

Memento and Branding

Alternative Web Archiving Strategies

Page 4: Memento: Big Leaps Towards Seamless Navigation of the Web of the Past

Memento Update CNI Task Force Meeting, Spring 2011 4

Memento wants to make it easy

to access the Web of the Past.

Page 5: Memento: Big Leaps Towards Seamless Navigation of the Web of the Past

Memento Update CNI Task Force Meeting, Spring 2011 5

Tate Online Today

Select Date March 16 2008

Tate Online March 16 2008

From National Archives

Page 6: Memento: Big Leaps Towards Seamless Navigation of the Web of the Past

Memento Update CNI Task Force Meeting, Spring 2011 6

Memento achieves this by introducing

a uniform version access capability to

integrate the present and past Web.

Page 7: Memento: Big Leaps Towards Seamless Navigation of the Web of the Past

Memento Update CNI Task Force Meeting, Spring 2011 7

Content Management Systems:

•  Designed to be aware of all versions of a resource;

•  Self-contained;

•  Variety of proprietary version mechanisms;

•  Versions interlinked using proprietary mechanisms.

Page 8: Memento: Big Leaps Towards Seamless Navigation of the Web of the Past

Memento Update CNI Task Force Meeting, Spring 2011 8

World Wide Web:

•  Designed to forget about prior versions of a resource;

•  Distributed.

Page 9: Memento: Big Leaps Towards Seamless Navigation of the Web of the Past

Memento Update CNI Task Force Meeting, Spring 2011 9

There are resource versions on the Web:

•  Content Management Systems;

•  Web Archives;

•  Transactional archives;

•  Search engine caches.

Page 10: Memento: Big Leaps Towards Seamless Navigation of the Web of the Past

Memento Update CNI Task Force Meeting, Spring 2011 10

But the Web architecture has a hard time dealing with them:

•  Cannot talk about a resource as it used to exist;

•  Cannot access a prior version knowing the current one;

•  Cannot access the current version knowing a prior one;

Current approaches are ad hoc and localized.

Page 11: Memento: Big Leaps Towards Seamless Navigation of the Web of the Past

Memento Update CNI Task Force Meeting, Spring 2011 11

Memento:

•  Regards the Web as a big Content Management System

•  Introduces a uniform capability to access versions on the Web;

•  Does not build new archives but leverages all systems that host versions: Web archives, Content Management Systems, Software Version Systems, etc.

Page 12: Memento: Big Leaps Towards Seamless Navigation of the Web of the Past

Memento Update CNI Task Force Meeting, Spring 2011 12

Memento’s version access approach:

•  Is distributed: versions may exist on several servers;

•  Uses time as a global version indicator;

•  Is based on the primitives of the Web: resource, resource state, representation, content negotiation, link.

Page 13: Memento: Big Leaps Towards Seamless Navigation of the Web of the Past

Memento Update CNI Task Force Meeting, Spring 2011 13

Original Resource and Versions

Page 14: Memento: Big Leaps Towards Seamless Navigation of the Web of the Past

Memento Update CNI Task Force Meeting, Spring 2011 14

Bridge from Present to Past

Page 15: Memento: Big Leaps Towards Seamless Navigation of the Web of the Past

Memento Update CNI Task Force Meeting, Spring 2011 15

Bridge from Past to Present

Page 16: Memento: Big Leaps Towards Seamless Navigation of the Web of the Past

Memento Update CNI Task Force Meeting, Spring 2011 16

Memento Framework

Page 17: Memento: Big Leaps Towards Seamless Navigation of the Web of the Past

Memento Update CNI Task Force Meeting, Spring 2011 17

Multiple Archives

Page 18: Memento: Big Leaps Towards Seamless Navigation of the Web of the Past

Memento Update CNI Task Force Meeting, Spring 2011 18

Memento Client-Server Interaction

Page 19: Memento: Big Leaps Towards Seamless Navigation of the Web of the Past

Memento Update CNI Task Force Meeting, Spring 2011 19

Overview of Memento Framework

Deployment Progress

Memento and Data

Memento and Discovery

Memento and Branding

Alternative Web Archiving Strategies

Page 20: Memento: Big Leaps Towards Seamless Navigation of the Web of the Past

Memento Update CNI Task Force Meeting, Spring 2011 20

Significant progress has been made towards

seamless navigation of the Web of the Past.

Page 21: Memento: Big Leaps Towards Seamless Navigation of the Web of the Past

Memento Update CNI Task Force Meeting, Spring 2011 21

Standardization:

•  Standardization process started via the IETF;

•  Interest from IETF and W3C;

•  Encouraged by major Web architects, including: Tim Berners-Lee, Mark Nottingham, Michael Hausenblas.

https://datatracker.ietf.org/doc/draft-vandesompel-memento/

Page 22: Memento: Big Leaps Towards Seamless Navigation of the Web of the Past

Memento Update CNI Task Force Meeting, Spring 2011 22

Memento Clients:

•  Several client tools developed by us and others;

•  Add-ons for FireFox (operational) and Internet Explorer (experimental);

•  Applications for Android (operational) and iPhone/iPad (in development);

•  Paper in next issue of Code4Lib Journal.

http://www.mementoweb.org/tools/

Page 23: Memento: Big Leaps Towards Seamless Navigation of the Web of the Past

Memento Update CNI Task Force Meeting, Spring 2011 23

Memento server support (1):

•  Memento-compliant Wayback software:

•  Used by Internet Archive.

•  Available to Web archives, worldwide.

•  Please have your favorite Web Archive install this new version 1.6!

http://www.mementoweb.org/tools/

Page 24: Memento: Big Leaps Towards Seamless Navigation of the Web of the Past

Memento Update CNI Task Force Meeting, Spring 2011 24

Memento server support (2):

•  Plug-in for MediaWiki (operational);

•  Used on W3C’s main wiki.

•  Please install it for your MediaWiki!

http://www.mementoweb.org/tools/

Page 25: Memento: Big Leaps Towards Seamless Navigation of the Web of the Past

Memento Update CNI Task Force Meeting, Spring 2011 25

Memento Server Validator

•  Server side client:

•  Attempts to perform all Memento actions against a given URI

•  Reports success/failure of the interactions and warnings for optional aspects

•  Kept up to date with IETF Internet Draft

http://www.mementoweb.org/tools/

Page 26: Memento: Big Leaps Towards Seamless Navigation of the Web of the Past

Memento Update CNI Task Force Meeting, Spring 2011 26

Memento Proxy Support

•  Several systems that host Mementos made Memento-compliant “by proxy”:

•  All major Web Archives that do not yet run Memento-compliant Wayback software

•  3,000+ MediaWiki systems, including Wikipedia

•  We want all of these to become natively Memento compliant!

Page 27: Memento: Big Leaps Towards Seamless Navigation of the Web of the Past

Memento Update CNI Task Force Meeting, Spring 2011 27

Memento Website:

•  Ongoing effort to add materials that support understanding and adoption: •  Introduction to Memento •  How to recognize

Mementos, TimeGates, Original Resources?

•  Guidelines for servers that host Mementos (Web Archives, CMS, snapshot archives, etc.)

http://www.mementoweb.org/guide/

Page 28: Memento: Big Leaps Towards Seamless Navigation of the Web of the Past

Memento Update CNI Task Force Meeting, Spring 2011 28

Funding:

•  2007-2010: US $250K grant from Library of Congress; •  Approx. 50K on Memento.

•  2010-2011: US $1 Million follow-up grant from Library of Congress.

•  For: Specification, outreach, tool development, further research.

Page 29: Memento: Big Leaps Towards Seamless Navigation of the Web of the Past

Memento Update CNI Task Force Meeting, Spring 2011 29

Overview of Memento Framework

Deployment Progress

Memento and Data

Memento and Discovery

Memento and Branding

Alternative Web Archiving Strategies

Page 30: Memento: Big Leaps Towards Seamless Navigation of the Web of the Past

Memento Update CNI Task Force Meeting, Spring 2011 30

Memento Time Travel is really powerful.

Time-Series Data via HTTP follow-your-nose.

Page 31: Memento: Big Leaps Towards Seamless Navigation of the Web of the Past

Memento Update CNI Task Force Meeting, Spring 2011 31

Memento Framework

Page 32: Memento: Big Leaps Towards Seamless Navigation of the Web of the Past

Memento Update CNI Task Force Meeting, Spring 2011 32

Original Resource: http://dbpedia.org/resource/France

Memento Framework & Time Series

Page 33: Memento: Big Leaps Towards Seamless Navigation of the Web of the Past

Memento Update CNI Task Force Meeting, Spring 2011 33

Time Travel across DBpedia Versions

Data collected through HTTP Navigation

paper at http://arxiv.org/abs/1003.3661

Page 34: Memento: Big Leaps Towards Seamless Navigation of the Web of the Past

Memento Update CNI Task Force Meeting, Spring 2011 34

Overview of Memento Framework

Deployment Progress

Memento and Data

Memento and Discovery

Memento and Branding

Alternative Web Archiving Strategies

Page 35: Memento: Big Leaps Towards Seamless Navigation of the Web of the Past

Memento Update CNI Task Force Meeting, Spring 2011 35

Very few Web sites provide a “timegate” link.

Need additional mechanisms to support Discovery.

Page 36: Memento: Big Leaps Towards Seamless Navigation of the Web of the Past

Memento Update CNI Task Force Meeting, Spring 2011 36

Batch discovery of Mementos: TimeMaps

A TimeMap minimally lists:

•  URI and datetime of Mementos known to an archive •  URI of Original Resource

TimeMaps can be aggregated across systems that host Mementos

Page 37: Memento: Big Leaps Towards Seamless Navigation of the Web of the Past

Memento Update CNI Task Force Meeting, Spring 2011 37

Batch discovery of Mementos: Feed of TimeMaps

•  System that host Mementos exposes Feed (e.g. Atom) of TimeMaps to allow applications to remain in sync with its evolving Memento collection:

•  One Atom entry per Original Resource for which system hosts Mementos; •  The entry provides a “timemap” link to a TimeMap for the Original Resource; •  The datetime value of the updated field of the entry changes when additional Memento for Original Resource becomes available (i.e. TimeMap changes); •  The ID of the entry is a tag URI based on URI of Original Resource.

Will be proposed to IIPC

Page 38: Memento: Big Leaps Towards Seamless Navigation of the Web of the Past

Memento Update CNI Task Force Meeting, Spring 2011 38

Batch discovery of Mementos: robots.txt

•  robots.txt file is used by Web servers to convey crawling policies;

•  Add a directive to support discovery of Mementos known to the server:

•  Pointer to a single Memento can suffice as the robot can crawl on from there •  Mementos allow for discovery of TimeMaps via HTTP links. •  e.g. jcdl.org hosts snapshot archives of prior JCDL conferences and adds the following to its robots.txt:

Memento: jcdl.org/archive/2002/index.html

Will be promoted via Internet Draft

Page 39: Memento: Big Leaps Towards Seamless Navigation of the Web of the Past

Memento Update CNI Task Force Meeting, Spring 2011 39

Batch discovery of TimeGates: robots.txt

•  robots.txt file is used by Web servers to convey crawling policies;

•  Add a directive to support discovery of TimeGates known to the server:

•  TimeGates can be on server itself or on external server •  Value for the directive is typcially a regular expression •  e.g example.org could point at TimeGates in its associated transactional ta.org via robots.txt:

TimeGate: ta.org/timegate/http://example.org/*

Will be promoted via Internet Draft

Page 40: Memento: Big Leaps Towards Seamless Navigation of the Web of the Past

Memento Update CNI Task Force Meeting, Spring 2011 40

Discovery of Systems that Host Mementos: Registry/Feed

•  Registry of collections of Mementos, e.g. of Web Archives, Transactional Archives, etc.

•  Feed of registry records.

•  A registry record details essential characteristics of a Memento collection.

•  cf VOiD collection description for Linked Data.

Will be researched

Page 41: Memento: Big Leaps Towards Seamless Navigation of the Web of the Past

Memento Update CNI Task Force Meeting, Spring 2011 41

Overview of Memento Framework

Deployment Progress

Memento and Data

Memento and Discovery

Memento and Branding

Alternative Web Archiving Strategies

Page 42: Memento: Big Leaps Towards Seamless Navigation of the Web of the Past

Memento Update CNI Task Force Meeting, Spring 2011 42

Memento can recreate pages using resources from different archives.

This poses a branding challenge for archives.

Page 43: Memento: Big Leaps Towards Seamless Navigation of the Web of the Past

Memento Update CNI Task Force Meeting, Spring 2011 43

Current Branding Practice for Web Archives

Page and embedded resources from same Web Archive

Branding for

page and

embedded resources

Page 44: Memento: Big Leaps Towards Seamless Navigation of the Web of the Past

Memento Update CNI Task Force Meeting, Spring 2011 44

Branding for Web Archives in Memento Mode

Will be researched

Page and embedded resources from various Web Archives

Page branding

No branding

No branding

Page 45: Memento: Big Leaps Towards Seamless Navigation of the Web of the Past

Memento Update CNI Task Force Meeting, Spring 2011 45

Overview of Memento Framework

Deployment Progress

Memento and Data

Memento and Discovery

Memento and Branding

Alternative Web Archiving Strategies

Page 46: Memento: Big Leaps Towards Seamless Navigation of the Web of the Past

Memento Update CNI Task Force Meeting, Spring 2011 46

Crawl-based Archives host distinct observations.

Transactional Archives never miss an update.

Page 47: Memento: Big Leaps Towards Seamless Navigation of the Web of the Past

Memento Update CNI Task Force Meeting, Spring 2011 47

Crawl-Based Web Archives

Observations

For example: Heritrix crawler for Internet Archive

Page 48: Memento: Big Leaps Towards Seamless Navigation of the Web of the Past

Memento Update CNI Task Force Meeting, Spring 2011 48

•  Collect discreet observations of resources, not their entire evolution.

•  Can be rejected (robots.txt, by user-agent, by host IP)

•  Can be deceived (cloaking, by geo-location, by user-agent).

•  Coverage of particular Web server dependent on crawl-strategy.

Crawl-Based Web Archives

Page 49: Memento: Big Leaps Towards Seamless Navigation of the Web of the Past

Memento Update CNI Task Force Meeting, Spring 2011 49

Server-Side Transactional Web Archives

Change History

For example: TTApache, PageVault, Vignette Web Capture

Page 50: Memento: Big Leaps Towards Seamless Navigation of the Web of the Past

Memento Update CNI Task Force Meeting, Spring 2011 50

•  Collect all representations served by to-be-archived server.

•  To-be-archived server needs to cooperate. •  Incentives e.g. institutional memory, official record of Web presence.

•  Archival coverage restricted by to-be-archived server, does not include external servers (e.g. embedded resources).

•  To be archived server can submit falsified information.

•  Archival collection management: what to keep, what not (e.g. significant changes, deduplication, …).

Server-Side Transactional Web Archives

Page 51: Memento: Big Leaps Towards Seamless Navigation of the Web of the Past

Memento Update CNI Task Force Meeting, Spring 2011 51

Development of Transactional Web Archive Software

Submit: •  Java-Grizzly-Jersey submission interface application; •  Berkeley DB metadata store; •  FS store for body and headers.

Capture: •  Apache connection filter module (mod_ta) captures URI, headers, body; •  Module POSTs in real-time to transactional archive’s Submit URI.

Page 52: Memento: Big Leaps Towards Seamless Navigation of the Web of the Past

Memento Update CNI Task Force Meeting, Spring 2011 52

Development of Transactional Web Archive Software

Development timeline: •  Ongoing development (LANL) and testing (ODU); •  Submit/Access finalized; development focus on collection management. •  Expected release as open source, 3rd Quarter 2011.

Access: •  Transactional archive natively supports Memento; •  Immediate availability of archived content; •  Export of WARC, e.g. for long-term archiving in other environment.

Page 53: Memento: Big Leaps Towards Seamless Navigation of the Web of the Past

Memento Update CNI Task Force Meeting, Spring 2011 53

Memento http://mementoweb.org/

Herbert Van de Sompel Robert Sanderson Michael L. Nelson

Big Leaps Towards Seamless Navigation of the Web of the Past