78
Wanderer above the Sea of Fog – Caspar David Friedrich (1818) http://en.wikipedia.org/wiki/Wanderer_above_the_Sea_of_Fog @hvdsomp #idcc13

The Web as infrastructure for scholarly research and communication

Embed Size (px)

DESCRIPTION

Keynote presented at IDCC13, Amsterdam, The Netherlands, January 16 2013.

Citation preview

Page 1: The Web as infrastructure for scholarly research and communication

Wanderer above the Sea of Fog – Caspar David Friedrich (1818) http://en.wikipedia.org/wiki/Wanderer_above_the_Sea_of_Fog

@hvdsomp #idcc13

Page 2: The Web as infrastructure for scholarly research and communication

Herbert Van de Sompel

IDCC 2013, Amsterdam, The Netherlands, January 16 2013

Page 3: The Web as infrastructure for scholarly research and communication

Herbert Van de Sompel

IDCC 2013, Amsterdam, The Netherlands, January 16 2013

The Scholarly Record is Changing

•  The scholarly record is extending with a wide range of non-traditional assets emerging from eScience and eHumanities •  e.g. datasets, software, ontologies, workflows, online debate,

slides, blogs, videos, etc.

•  Many of these non-traditional assets: •  Have a wide range of relationships with and dependencies on

other assets – grouping assets •  Are becoming increasingly dynamic, and do not have the sense

of fixity that traditional assets such as journal articles or books have – versioning assets

Page 4: The Web as infrastructure for scholarly research and communication

Herbert Van de Sompel

IDCC 2013, Amsterdam, The Netherlands, January 16 2013

grouping assets

versioning assets

Page 5: The Web as infrastructure for scholarly research and communication

Herbert Van de Sompel

IDCC 2013, Amsterdam, The Netherlands, January 16 2013

discovering assets

Page 6: The Web as infrastructure for scholarly research and communication

Herbert Van de Sompel

IDCC 2013, Amsterdam, The Netherlands, January 16 2013

1999

•  OAI was a heroic effort to fundamentally transform scholarly communication •  By promoting communication via

preprints, non-peer-reviewed papers

•  The OAI took a technical approach to achieve the goal •  Make preprints easier to discover,

access – Protocol for Metadata Harvesting

Page 7: The Web as infrastructure for scholarly research and communication

Herbert Van de Sompel

IDCC 2013, Amsterdam, The Netherlands, January 16 2013

Page 8: The Web as infrastructure for scholarly research and communication

Herbert Van de Sompel

IDCC 2013, Amsterdam, The Netherlands, January 16 2013

Page 9: The Web as infrastructure for scholarly research and communication

Herbert Van de Sompel

IDCC 2013, Amsterdam, The Netherlands, January 16 2013

HTTP GET on record identifier

An HTTP link

Don’t trust HTTP

Just another HTTP baseURL

Page 10: The Web as infrastructure for scholarly research and communication

Herbert Van de Sompel

IDCC 2013, Amsterdam, The Netherlands, January 16 2013

Page 11: The Web as infrastructure for scholarly research and communication

Herbert Van de Sompel

IDCC 2013, Amsterdam, The Netherlands, January 16 2013

grouping assets

versioning assets

Page 12: The Web as infrastructure for scholarly research and communication

Herbert Van de Sompel

IDCC 2013, Amsterdam, The Netherlands, January 16 2013

2007

•  OAI-ORE observation: Scholarly assets are rapidly becoming compound, consisting of multiple resources with various: •  Relationships •  Interdependencies

•  How to convey this compound-ness in an interoperable manner so that applications can access, consume such assets?

Page 13: The Web as infrastructure for scholarly research and communication

Herbert Van de Sompel

IDCC 2013, Amsterdam, The Netherlands, January 16 2013

Page 14: The Web as infrastructure for scholarly research and communication

Herbert Van de Sompel

IDCC 2013, Amsterdam, The Netherlands, January 16 2013

Page 15: The Web as infrastructure for scholarly research and communication

Herbert Van de Sompel

IDCC 2013, Amsterdam, The Netherlands, January 16 2013

Page 16: The Web as infrastructure for scholarly research and communication

Herbert Van de Sompel

IDCC 2013, Amsterdam, The Netherlands, January 16 2013

Page 17: The Web as infrastructure for scholarly research and communication

Herbert Van de Sompel

IDCC 2013, Amsterdam, The Netherlands, January 16 2013

Page 18: The Web as infrastructure for scholarly research and communication

Herbert Van de Sompel

IDCC 2013, Amsterdam, The Netherlands, January 16 2013

Page 19: The Web as infrastructure for scholarly research and communication

Herbert Van de Sompel

IDCC 2013, Amsterdam, The Netherlands, January 16 2013

Page 20: The Web as infrastructure for scholarly research and communication

Herbert Van de Sompel

IDCC 2013, Amsterdam, The Netherlands, January 16 2013

See e.g. http://www.ctwatch.org/quarterly/articles/2007/08/interoperability-for-the-discovery-use-and-re-use-of-units-of-scholarly-communication/8/

index.html

Page 21: The Web as infrastructure for scholarly research and communication

Herbert Van de Sompel

IDCC 2013, Amsterdam, The Netherlands, January 16 2013

Page 22: The Web as infrastructure for scholarly research and communication

Herbert Van de Sompel

IDCC 2013, Amsterdam, The Netherlands, January 16 2013

grouping assets

versioning assets

Page 23: The Web as infrastructure for scholarly research and communication

Herbert Van de Sompel

IDCC 2013, Amsterdam, The Netherlands, January 16 2013

2009

•  Memento is about the Web and time: •  Resources evolve over time •  Only the current representation is

available from a resource’s URI •  How to seamlessly access prior

representation, if they exist?

•  Memento looks at this problem for the Web, in general

Digital Preservation Award 2010

Page 24: The Web as infrastructure for scholarly research and communication

Herbert Van de Sompel

IDCC 2013, Amsterdam, The Netherlands, January 16 2013

•  Memento has potential consequences for scholarly communication

•  Observation: Scholarly assets are becoming increasingly dynamic, and do not have the sense of fixity that traditional assets such as journal articles or books have •  Even traditional assets are becoming

increasingly dynamic and dependent on other assets, which may themselves be dynamic

2009

Page 25: The Web as infrastructure for scholarly research and communication

Herbert Van de Sompel

IDCC 2013, Amsterdam, The Netherlands, January 16 2013

Scientific Workflows, Services, Data, Workflow Engines

Carole Goble, JCDL 2012 Keynote https://dl.dropbox.com/u/617206/JCDL2012keynoteGoble.ppt

Page 26: The Web as infrastructure for scholarly research and communication

Herbert Van de Sompel

IDCC 2013, Amsterdam, The Netherlands, January 16 2013

From The Version of Record to A Version of the Record

•  The ever-evolving nature of some assets challenges the notion of fixity as “forever frozen” and begs considering the notion of the “state of the scholarly record at a specific moment in time”

•  It will become essential to be able to determine what the state of related and interdependent assets was at certain moments in time

Page 27: The Web as infrastructure for scholarly research and communication

Herbert Van de Sompel

IDCC 2013, Amsterdam, The Netherlands, January 16 2013

Two Perspectives on Memento

URI-M - http://web.archive.org/web/20010911203610/http://www.cnn.com/

Web Archive

URI-R - http://www.cnn.com/

Page 28: The Web as infrastructure for scholarly research and communication

Herbert Van de Sompel

IDCC 2013, Amsterdam, The Netherlands, January 16 2013

Two Perspectives on Memento

URI-M - http://en.wikipedia.org/w/index.php?title=September_11_attacks&oldid=282333

CMS

URI-R - http://en.wikipedia.org/wiki/September_11_attacks

Page 29: The Web as infrastructure for scholarly research and communication
Page 30: The Web as infrastructure for scholarly research and communication
Page 31: The Web as infrastructure for scholarly research and communication

Herbert Van de Sompel

IDCC 2013, Amsterdam, The Netherlands, January 16 2013

Page 32: The Web as infrastructure for scholarly research and communication
Page 33: The Web as infrastructure for scholarly research and communication

Herbert Van de Sompel

IDCC 2013, Amsterdam, The Netherlands, January 16 2013

•  How to get to the time-specific resources from the generic resource?

•  Memento addresses the problem in a resource-centric way: •  Resource, URI, state, representation,

link, content negotiation

2009

Page 34: The Web as infrastructure for scholarly research and communication
Page 35: The Web as infrastructure for scholarly research and communication
Page 36: The Web as infrastructure for scholarly research and communication

Herbert Van de Sompel

IDCC 2013, Amsterdam, The Netherlands, January 16 2013

Today Select Date Sep 12 2010 Sep 16 2010

From BL Archive

Access Versions via the original URI and datetime

Page 37: The Web as infrastructure for scholarly research and communication

Herbert Van de Sompel

IDCC 2013, Amsterdam, The Netherlands, January 16 2013

From The Version of Record to A Version of the Record

•  The ever-evolving nature of some assets challenges the notion of fixity as “forever frozen” and begs considering the notion of the “state of the scholarly record at a specific moment in time”

•  It will become essential to be able to determine what the state of related and interdependent assets was at certain moments in time

Page 38: The Web as infrastructure for scholarly research and communication
Page 39: The Web as infrastructure for scholarly research and communication
Page 40: The Web as infrastructure for scholarly research and communication
Page 41: The Web as infrastructure for scholarly research and communication
Page 42: The Web as infrastructure for scholarly research and communication
Page 43: The Web as infrastructure for scholarly research and communication
Page 44: The Web as infrastructure for scholarly research and communication
Page 45: The Web as infrastructure for scholarly research and communication
Page 46: The Web as infrastructure for scholarly research and communication

Herbert Van de Sompel

IDCC 2013, Amsterdam, The Netherlands, January 16 2013

•  Is it possible to reconstruct the Web-based scholarly record as it was at a certain point in time?

•  Consider a special case: Given a paper can one see the referenced materials as they were the time of publication of the paper?

•  ti: Time of publication •  Relationship: Cited resources

Recreating a Version of the Record

Page 47: The Web as infrastructure for scholarly research and communication

Herbert Van de Sompel

IDCC 2013, Amsterdam, The Netherlands, January 16 2013

Published September 15 2004

Page 48: The Web as infrastructure for scholarly research and communication

Herbert Van de Sompel

IDCC 2013, Amsterdam, The Netherlands, January 16 2013

Page 49: The Web as infrastructure for scholarly research and communication

Herbert Van de Sompel

IDCC 2013, Amsterdam, The Netherlands, January 16 2013

Page 50: The Web as infrastructure for scholarly research and communication

Herbert Van de Sompel

IDCC 2013, Amsterdam, The Netherlands, January 16 2013

Domain Gone

Page 51: The Web as infrastructure for scholarly research and communication

Herbert Van de Sompel

IDCC 2013, Amsterdam, The Netherlands, January 16 2013

Archived copy December 5 2003

Page 52: The Web as infrastructure for scholarly research and communication

Herbert Van de Sompel

IDCC 2013, Amsterdam, The Netherlands, January 16 2013

Page 53: The Web as infrastructure for scholarly research and communication

Herbert Van de Sompel

IDCC 2013, Amsterdam, The Netherlands, January 16 2013

Current version

Page 54: The Web as infrastructure for scholarly research and communication

Herbert Van de Sompel

IDCC 2013, Amsterdam, The Netherlands, January 16 2013

Archived copy December 11 2004

Page 55: The Web as infrastructure for scholarly research and communication

Herbert Van de Sompel

IDCC 2013, Amsterdam, The Netherlands, January 16 2013

Page 56: The Web as infrastructure for scholarly research and communication

Herbert Van de Sompel

IDCC 2013, Amsterdam, The Netherlands, January 16 2013

Resource gone

Page 57: The Web as infrastructure for scholarly research and communication

Herbert Van de Sompel

IDCC 2013, Amsterdam, The Netherlands, January 16 2013

Archived copy December 5 2003

Page 58: The Web as infrastructure for scholarly research and communication

Herbert Van de Sompel

IDCC 2013, Amsterdam, The Netherlands, January 16 2013

Page 59: The Web as infrastructure for scholarly research and communication

Herbert Van de Sompel

IDCC 2013, Amsterdam, The Netherlands, January 16 2013

Resource gone

Page 60: The Web as infrastructure for scholarly research and communication

Herbert Van de Sompel

IDCC 2013, Amsterdam, The Netherlands, January 16 2013

Archived copy unavailable

Page 61: The Web as infrastructure for scholarly research and communication

Herbert Van de Sompel

IDCC 2013, Amsterdam, The Netherlands, January 16 2013

•  Papers from arXiv: 400,000 papers => 144,000 unique URIs •  Papers from UNT ETD repository: 3,600 papers => 18,000 URIs •  Referenced URIs of established scholarly repositories removed (e.g. http://dx.doi.org), i.e. focusing in on the periphery of the scholarly record

•  Study looks into: •  Does the referenced resource still exist? •  Are there archived versions of of the referenced resource?

•  From around the time of publication of the citing paper?

•  Study does not look into dynamic aspects: •  If the referenced resource still exists, is its content same as at ti? •  Does an archived version have the same content as at ti?

Pilot Study at Scale with Memento

Sanderson, R., Phillips, M., and Van de Sompel, H. (2011) Analyzing the Persistence of Referenced Web Resources with Memento. Open Repositories 2011; Arxiv preprint. arXiv:1105.3459 ; http://arxiv.org/abs/1105.3459

Page 62: The Web as infrastructure for scholarly research and communication

Herbert Van de Sompel

IDCC 2013, Amsterdam, The Netherlands, January 16 2013

UNT

Page 63: The Web as infrastructure for scholarly research and communication

Herbert Van de Sompel

IDCC 2013, Amsterdam, The Netherlands, January 16 2013

arXiv

Page 64: The Web as infrastructure for scholarly research and communication

Herbert Van de Sompel

IDCC 2013, Amsterdam, The Netherlands, January 16 2013

The Good News ™

•  Despite there not being a pro-active effort to archive those resources, a considerable amount were

o  Because they had HTTP URIs and hence were archived as part of ongoing web archiving processes

o  In The Wild archiving comes for free with the web infrastructure

•  404 resources exist in web archives and Memento can access them via their original HTTP URI

o  Does that make an HTTP URI a PID?

Page 65: The Web as infrastructure for scholarly research and communication

Herbert Van de Sompel

IDCC 2013, Amsterdam, The Netherlands, January 16 2013

The Bad News ™

•  Many resources were not archived

•  For many resources there were no archival versions around ti

Page 66: The Web as infrastructure for scholarly research and communication
Page 67: The Web as infrastructure for scholarly research and communication
Page 68: The Web as infrastructure for scholarly research and communication
Page 69: The Web as infrastructure for scholarly research and communication

Herbert Van de Sompel

IDCC 2013, Amsterdam, The Netherlands, January 16 2013

Automatic Creation of Archival Snapshots

•  There is a need for a more pro-active approach to archive dynamic, interdependent assets, e.g.:

o  Web Archives as infrastructure o  Use CMS, wikis, datawikis with solid versioning mechanisms o  Archiving linked context at the time of publication o  Archive at the moment of use (social interaction,

downloading, annotating, etc.) o  Delineate which resources are considered in/out of a

scholarly assets (OAI-ORE) to understand what needs archiving

Page 70: The Web as infrastructure for scholarly research and communication

Herbert Van de Sompel

IDCC 2013, Amsterdam, The Netherlands, January 16 2013

discovering assets

Page 71: The Web as infrastructure for scholarly research and communication

Herbert Van de Sompel

IDCC 2013, Amsterdam, The Netherlands, January 16 2013

2012

•  ResourceSync is about allowing 3rd party systems and applications to remain synchronized with a server’s evolving resources.

•  Many use cases: •  Mirroring repository content •  Aggregating content •  Replicating datasets •  Exposing content to archives •  Keeping linked data applications that

leverage remote data up-to-date

•  Differing needs regarding: •  Coverage •  Accuracy •  Latency

Page 72: The Web as infrastructure for scholarly research and communication

Herbert Van de Sompel

IDCC 2013, Amsterdam, The Netherlands, January 16 2013

ResourceSync Approach

•  Resource centric; it’s all about the URI (again)

•  Introduces a set of modular capabilities that a server can implement to allow 3rd parties to remain in sync with its resources. Recurrently publish:

o  Resource Lists o  Change Lists o  Resource Dumps o  Change Dumps

•  All capabilities based on the Sitemap document formats and extensions thereof

o  Existing Sitemaps are off-the-shelf compliant

Page 73: The Web as infrastructure for scholarly research and communication

Herbert Van de Sompel

IDCC 2013, Amsterdam, The Netherlands, January 16 2013

ResourceSync Capabilities

Page 74: The Web as infrastructure for scholarly research and communication

Herbert Van de Sompel

IDCC 2013, Amsterdam, The Netherlands, January 16 2013

2012

•  Beta spec end 01/2013 •  http://www.openarchives.org/rs/

•  Feedback •  mailto:[email protected]

•  Papers in D-Lib Magazine •  http://dx.doi.org/10.145/september2012-

vandesompel •  http://dx.doi.org/10.145/january2013-klein

•  Paper in Ariadne •  http://www.ariande.ac.uk/issue70/lewis-et-

al

Page 75: The Web as infrastructure for scholarly research and communication

Herbert Van de Sompel

IDCC 2013, Amsterdam, The Netherlands, January 16 2013

1998 - 2013

Page 76: The Web as infrastructure for scholarly research and communication

Herbert Van de Sompel

IDCC 2013, Amsterdam, The Netherlands, January 16 2013

a stack of journals or a bunch of PDF files

a network of interconnected assets and actors

1998 - 2013

Page 77: The Web as infrastructure for scholarly research and communication

Herbert Van de Sompel

IDCC 2013, Amsterdam, The Netherlands, January 16 2013

Conclusion

•  OAI-ORE, Memento, ResourceSync illustrate the potential of leveraging the Web infrastructure for scholarly communication

•  This suggests that other special requirements of scholarly communication (certification, archiving, persistence, trust, annotation, metrics, …) may be addressable in an interoperable manner by leveraging the Web infrastructure

•  Wins: •  Long Term Sustainability: Reuse of infrastructure (network, software, platforms, standards, etc.) that the entire world depends on •  Integration of scholarly discourse with other Web-based discourse

Page 78: The Web as infrastructure for scholarly research and communication

Herbert Van de Sompel

IDCC 2013, Amsterdam, The Netherlands, January 16 2013 Wanderer above the Sea of Fog – Caspar David Friedrich (1818)

http://en.wikipedia.org/wiki/Wanderer_above_the_Sea_of_Fog

@hvdsomp #idcc13