View
393
Download
0
Tags:
Embed Size (px)
Citation preview
Reference Rot: Threat and Remedy
UKSG1530 March - 1 April 2015
Funded by the Andrew W. Mellon Foundation
Peter Burnhill & Richard WincewiczEDINA, University of Edinburgh
for the Hiberlink Team at University of Edinburgh & LANL Research Library
The Project Team 2013 – 2015, funded by the
Andrew W. Mellon Foundation
• Los Alamos National Laboratory:
Research Library: Herbert Van de Sompel
Harihar Shankar, [Martin Klein, Rob Sanderson],
• University of Edinburgh:
Language Technology Group: Claire Grover,
Beatrice Alex, Colin Matheson, Richard Tobin, [Ke “Adam” Zhou]
EDINA * : Peter Burnhill, Muriel Mewissen (Project Manager),
Neil Mayo, Tim Stickland, Richard Wincewicz,
Centre for Service Delivery & Digital Expertise
Funded by the Andrew W. Mellon Foundation
UKSG1530 March - 1 April 2015
… acts as part of the Jisc Family
edina.ac.uk
hiberlink.org
Overview
1. Introduction / Threat2. Analysis3. Large-scale Evidence4. Devising Remedy5. Summary
Tweet to #UKSG15
When what was referenced & cited ceases to say the same thing, or ‘has ceased to be’
http://www.snorgtees.com/this-parrot-has-ceased-to-be
1. The Threat of Reference Rot
“when links to web resources
no longer point to what they once did”
Reference Rot = Link Rot + Content Drift
Link Rot
‘Link Rot’
+ Content Drift: What is at end of URI has changed, or gone!
http://dl00.org
2000
http://dl00.org
2004
http://dl00.org
2005
http://dl00.org
2008
(a) Dynamic contentas values on webpage changes over time
(b) Static contentbut very different (often
unrelated) web pages
2. Analysis
Tweets on #hiberlink to
#UKSG15
Take a landmark publication, 10+ years ago
Few of those references to the Web now work as intended A re-direct [from RLG to OCLC] but ‘content drift’
Few of those references to the Web now work as intended A re-direct [from RLG to OCLC] but ‘content drift’
Fail !!
Reference no longer works: ‘link rot’
Fail !!
Reference no longer works: ‘link rot’
Fail !!
A re-direct but content not found
A re-direct but content not found
Fail !!
Successful link: URI worke as expected in December 2014
But sadly, now does not
Fail !!
Successful link: URI works as expected
Classic link rot: ‘Page Not Found’
Fail !!
reference to the Web is to an e-journal that is still current
Classic link rot: ‘Page Not Found’
Fail !!
URI works but content drift: reference is not as intended
Fail !!
=> Content of Citations Rot over Time!!
… meaning rotten references for the reader
… in what is then a rotten article!
… & sale of rotten goods & undermining the integrity of the scholarly record
3. Large-scale Evidence
Tweets on #hiberlink to
#UKSG15
Hiberlink Project Methodology
to discover answer to a 2-part question
Do references to web-based content (URIs) work?
• Focus on content on ‘the wild Web’
• not that which is in e-journals etc
i. Impact of Time: Is the URI still on the ‘Live Web’’?
• Allowed up to a maximum of 50 redirects
ii. Is a ‘Memento’ of that content in the ‘Archived Web’?
Memento: a prior version, what the Original Resource was like at some time in the past.
3. Large-scale Empirical Evidence
c. 400,000 articles across the three corpora (Row #5 in Table 2)
contained over a million web at large references (Row #4 in Table 3)
Klein M, Van de Sompel H, Sanderson R, Shankar H, Balakireva L, et al. (2014) Scholarly Context Not Found: One in Five Articles Suffers from Reference Rot. PLoS ONE 9(12): e115253. doi:10.1371/journal.pone.0115253http://127.0.0.1:8081/plosone/article?id=info:doi/10.1371/journal.pone.0115253
A Key Aspect of Hiberlink Project Methodology
1. Convert Scholarly Statement from PDF into XML
2. Locate the references & extract each and every URL
• Many technical challenges
• URL broken/newline; underscore as image
• Use up to 15 regular expression for matching; regard as URI
University of Edinburgh Language Technology Group: Beatrice Alex, Claire Grover, Colin Matheson, Richard Tobin, Ke Zhou
Scholarly Articles [in PMC] increasingly link to
Web Resources, not just back to other Articles
Scholarly Articles [in Elsevier] increasingly link to
Web Resources, not just back to other Articles
Mementos for URIs archived within 14 days of being referenced
PMC corpus
Klein M, Van de Sompel H, Sanderson R, Shankar H, Balakireva L, et al. (2014) Scholarly Context Not Found: One in Five Articles Suffers from Reference Rot. PLoS ONE 9(12): e115253. doi:10.1371/journal.pone.0115253http://127.0.0.1:8081/plosone/article?id=info:doi/10.1371/journal.pone.0115253
6 publicly accessible web archives for lookup: Internet Archive, archive.is (archive.today), Archive-It, BL Web Archive, UK National Archives Web Archive & Icelandic National Archive
Klein M, Van de Sompel H, Sanderson R, Shankar H, Balakireva L, et al. (2014) Scholarly Context Not Found: One in Five Articles Suffers from Reference Rot. PLoS ONE 9(12): e115253. doi:10.1371/journal.pone.0115253http://127.0.0.1:8081/plosone/article?id=info:doi/10.1371/journal.pone.0115253
Mementos for URIs archived within 14 days of being referenced
Elsevier corpus
6 publicly accessible web archives for lookup: Internet Archive, archive.is (archive.today), Archive-It, BL Web Archive, UK National Archives Web Archive & Icelandic National Archive
4. Devising Remedy for Reference Rot
Tweets on #hiberlink to
#UKSG15
The Remedy Is Quick Freeze & Archive
3 workflows in scholarly statement
①Preparation -> Study - > Compose -> Submission
②Publication -> Editing -> (Revision) -> Acceptance -> Issue
③Post-Publication-> Deposit/Ingest -> Reader Access -> Use
To identify the best opportunities for Intervention to make Remedy,to ‘flash-freeze’, either to avoid reference rot or to ‘stop the rot’
Identify the Actors & how to assist them do the right thing!
Ideally at the earliest moment of capture
… when the Authors are trawling for content
… for what an Author regards as significant
… or needs to provide as evidence
… re-factoring the HTML link that is returned
• http://www.newyorker.com/magazine/2015/01/26/cobweb
• Archive timestamp: 2015-02-19T09:46:36
• http://web.archive.org/web/20150219094636/http://www.n
ewyorker.com/magazine/2015/01/26/cobweb
Hiberlink Remedy: Components in a Robust Link
b) Augment Link with Datetime and Archive URI
a) Take simple URI - to article in New Yorker magazine (say)
Hiberlink.org
What Robust Hiberlinks look like
• Hiberlinks are modified <a> HTML elements
• Include archive URL and timestamp as
additional attributes
<a
href=“http://www.newyorker.com/magazine/2015/01/26/cobweb”
data-
versionurl=“http://web.archive.org/web/20150219094636/http://w
ww.newyorker.com/magazine/2015/01/26/cobweb”
data-versiondate=“2015-02-19T09:46:36”>Cobweb Article</a>
Help authors do the right thing:
① Triggering archiving of referenced web content
when it is noted, using a reference managereg EndNote, Reference Manager, Zotero
– Hiberlink Plug-in developed for Zotero
② Returns Datetime URI for archived content that
can be used in the citation
Remedy To Avoid Reference Rot
https://www.zotero.org/
Zotero workflow
Create reference
Add URL
Update URL
Duplicate reference
Pass URL to archive service
Receive archive URL
Store data in database
Add data to reference
Using the Plugin in Zotero
Opportunity / Time for a Demo ?
So what should we expect of the Publisher?
Beyond the assurance thatthe fish / references / articles
sold are not rotten
Help Publishers do the right thing
The next best opportunity for Quick Freeze• to avoid reference rot & to ‘stop the rot’
① Study: Preparation -> (Review) -> Submission
② Publication: Editorial -> (Revision) -> Issue ③ Post-Publication: Deposit/Ingest -> Provide/Access -> Use
Actors:
① The Author
② The Editor / Publisher
③ The Access Platform / Librarian /Archival Organisations
OJS plugin
1. Parses the document
• Converts .pdf to .html
• Extracts URIs
2. Archives the content for each reference
• The Author and Editor can choose which version is
used as the archival copy
3. Creates an HTML version of the document
• including a link to the archived version of each of the
references
Well Published References & Augmented Links
Post-Publication (& other bulk processing)
The last ‘best’ opportunity for Quick Freeze• not to avoid reference rot but to ‘stop the rot’
① Study: Preparation -> (Review) -> Submission
• Should note & act for each URI, one by one
② Publication: Editorial -> (Revision) -> Issue
• (Probably) should examine each one by one
③Post-Publication: Deposit/Ingest
• Cannot hope to process one by one
Post-Publication (& other bulk processing)
The last ‘best’ opportunity for Quick Freeze• not to avoid reference rot but to ‘stop the rot’
① Study: Preparation -> (Review) -> Submission
• Should note & act for each URI, one by one
② Publication: Editorial -> (Revision) -> Issue
• (Probably) should examine each one by one
③Post-Publication: Deposit/Ingest
• Cannot hope to process one by one
Actors:
① The Author
② The Editor / Publisher
③Access Platforms / Archival Organisations/ Librarians
& each article contains many references
Recall Key Aspect of Hiberlink Methodology
1. Convert Scholarly Statement from PDF into XML
2. Locate the references & extract each and every URL
• Many technical challenges
• URL broken/newline; underscore as image
• Use up to 15 regular expression for matching; regard as URI
=> Edinburgh Parser [github.com/hiberlink]
University of Edinburgh Language Technology Group: Beatrice Alex, Claire Grover, Colin Matheson, Richard Tobin, Ke Zhou
Time to Build Infrastructure:
HiberActive
Publishing platform HiberActive
External archival service
(e.g. Internet Archive)
• Asynchronous (returns Robust Link)
• Distributed (archived with different organisations)
• Lightweight (leveraging HTTP & what already exists)
5. Summary
Tweets on #hiberlink to
#UKSG15
Hiberlink Outcomes
1. Defined the Threat of Reference Rot
2. Quantified the extent and way in which it
exists & undermines the Scholarly Record
3. Pointed to potential & practical Remedy
Hiberlink Outcomes1. Defined the Threat of Reference Rot
2. Quantified the extent and way in which it exists & undermines the
Scholarly Record
3. Pointed to potential & practical Remedy
As project comes to an end (June 2015) so we wish to:
• Tell the world about these achievements
• Engage with others
– to build infrastructure
– To prompt adoption (copying) of prototypes by 3rd
parties
• such as reference managers, editorial systems, publication
systems, archival systems
Thank you,
Questions welcome
http://hiberlink.org #hiberlink
Email: [email protected]
Still Time to Tweet to #UKSG15
Funded by the Andrew W. Mellon Foundation