58
Reference Rot: Threat and Remedy UKSG15 30 March - 1 April 2015 Funded by the Andrew W. Mellon Foundation Peter Burnhill & Richard Wincewicz EDINA, University of Edinburgh for the Hiberlink Team at University of Edinburgh & LANL Research Library

Reference Rot: Threat and Remedy

Embed Size (px)

Citation preview

Page 1: Reference Rot: Threat and Remedy

Reference Rot: Threat and Remedy

UKSG1530 March - 1 April 2015

Funded by the Andrew W. Mellon Foundation

Peter Burnhill & Richard WincewiczEDINA, University of Edinburgh

for the Hiberlink Team at University of Edinburgh & LANL Research Library

Page 2: Reference Rot: Threat and Remedy

The Project Team 2013 – 2015, funded by the

Andrew W. Mellon Foundation

• Los Alamos National Laboratory:

Research Library: Herbert Van de Sompel

Harihar Shankar, [Martin Klein, Rob Sanderson],

• University of Edinburgh:

Language Technology Group: Claire Grover,

Beatrice Alex, Colin Matheson, Richard Tobin, [Ke “Adam” Zhou]

EDINA * : Peter Burnhill, Muriel Mewissen (Project Manager),

Neil Mayo, Tim Stickland, Richard Wincewicz,

Centre for Service Delivery & Digital Expertise

Funded by the Andrew W. Mellon Foundation

UKSG1530 March - 1 April 2015

Page 3: Reference Rot: Threat and Remedy

… acts as part of the Jisc Family

edina.ac.uk

Page 4: Reference Rot: Threat and Remedy

hiberlink.org

Overview

1. Introduction / Threat2. Analysis3. Large-scale Evidence4. Devising Remedy5. Summary

Tweet to #UKSG15

Page 5: Reference Rot: Threat and Remedy

When what was referenced & cited ceases to say the same thing, or ‘has ceased to be’

http://www.snorgtees.com/this-parrot-has-ceased-to-be

1. The Threat of Reference Rot

“when links to web resources

no longer point to what they once did”

Reference Rot = Link Rot + Content Drift

Page 6: Reference Rot: Threat and Remedy

Link Rot

‘Link Rot’

Page 7: Reference Rot: Threat and Remedy

+ Content Drift: What is at end of URI has changed, or gone!

http://dl00.org

2000

http://dl00.org

2004

http://dl00.org

2005

http://dl00.org

2008

(a) Dynamic contentas values on webpage changes over time

(b) Static contentbut very different (often

unrelated) web pages

Page 9: Reference Rot: Threat and Remedy

Take a landmark publication, 10+ years ago

Page 10: Reference Rot: Threat and Remedy

Few of those references to the Web now work as intended A re-direct [from RLG to OCLC] but ‘content drift’

Page 11: Reference Rot: Threat and Remedy

Few of those references to the Web now work as intended A re-direct [from RLG to OCLC] but ‘content drift’

Fail !!

Page 12: Reference Rot: Threat and Remedy

Reference no longer works: ‘link rot’

Fail !!

Page 13: Reference Rot: Threat and Remedy

Reference no longer works: ‘link rot’

Fail !!

Page 14: Reference Rot: Threat and Remedy

A re-direct but content not found

Page 15: Reference Rot: Threat and Remedy

A re-direct but content not found

Fail !!

Page 16: Reference Rot: Threat and Remedy

Successful link: URI worke as expected in December 2014

Page 17: Reference Rot: Threat and Remedy

But sadly, now does not

Fail !!

Page 18: Reference Rot: Threat and Remedy

Successful link: URI works as expected

Page 19: Reference Rot: Threat and Remedy

Classic link rot: ‘Page Not Found’

Fail !!

Page 20: Reference Rot: Threat and Remedy

reference to the Web is to an e-journal that is still current

Page 21: Reference Rot: Threat and Remedy

Classic link rot: ‘Page Not Found’

Fail !!

Page 22: Reference Rot: Threat and Remedy

URI works but content drift: reference is not as intended

Fail !!

Page 23: Reference Rot: Threat and Remedy

=> Content of Citations Rot over Time!!

Page 24: Reference Rot: Threat and Remedy

… meaning rotten references for the reader

Page 25: Reference Rot: Threat and Remedy

… in what is then a rotten article!

… & sale of rotten goods & undermining the integrity of the scholarly record

Page 27: Reference Rot: Threat and Remedy

Hiberlink Project Methodology

to discover answer to a 2-part question

Do references to web-based content (URIs) work?

• Focus on content on ‘the wild Web’

• not that which is in e-journals etc

i. Impact of Time: Is the URI still on the ‘Live Web’’?

• Allowed up to a maximum of 50 redirects

ii. Is a ‘Memento’ of that content in the ‘Archived Web’?

Memento: a prior version, what the Original Resource was like at some time in the past.

Page 28: Reference Rot: Threat and Remedy

3. Large-scale Empirical Evidence

c. 400,000 articles across the three corpora (Row #5 in Table 2)

contained over a million web at large references (Row #4 in Table 3)

Klein M, Van de Sompel H, Sanderson R, Shankar H, Balakireva L, et al. (2014) Scholarly Context Not Found: One in Five Articles Suffers from Reference Rot. PLoS ONE 9(12): e115253. doi:10.1371/journal.pone.0115253http://127.0.0.1:8081/plosone/article?id=info:doi/10.1371/journal.pone.0115253

Page 29: Reference Rot: Threat and Remedy

A Key Aspect of Hiberlink Project Methodology

1. Convert Scholarly Statement from PDF into XML

2. Locate the references & extract each and every URL

• Many technical challenges

• URL broken/newline; underscore as image

• Use up to 15 regular expression for matching; regard as URI

University of Edinburgh Language Technology Group: Beatrice Alex, Claire Grover, Colin Matheson, Richard Tobin, Ke Zhou

Page 30: Reference Rot: Threat and Remedy

Scholarly Articles [in PMC] increasingly link to

Web Resources, not just back to other Articles

Page 31: Reference Rot: Threat and Remedy

Scholarly Articles [in Elsevier] increasingly link to

Web Resources, not just back to other Articles

Page 32: Reference Rot: Threat and Remedy

Mementos for URIs archived within 14 days of being referenced

PMC corpus

Klein M, Van de Sompel H, Sanderson R, Shankar H, Balakireva L, et al. (2014) Scholarly Context Not Found: One in Five Articles Suffers from Reference Rot. PLoS ONE 9(12): e115253. doi:10.1371/journal.pone.0115253http://127.0.0.1:8081/plosone/article?id=info:doi/10.1371/journal.pone.0115253

6 publicly accessible web archives for lookup: Internet Archive, archive.is (archive.today), Archive-It, BL Web Archive, UK National Archives Web Archive & Icelandic National Archive

Page 33: Reference Rot: Threat and Remedy

Klein M, Van de Sompel H, Sanderson R, Shankar H, Balakireva L, et al. (2014) Scholarly Context Not Found: One in Five Articles Suffers from Reference Rot. PLoS ONE 9(12): e115253. doi:10.1371/journal.pone.0115253http://127.0.0.1:8081/plosone/article?id=info:doi/10.1371/journal.pone.0115253

Mementos for URIs archived within 14 days of being referenced

Elsevier corpus

6 publicly accessible web archives for lookup: Internet Archive, archive.is (archive.today), Archive-It, BL Web Archive, UK National Archives Web Archive & Icelandic National Archive

Page 34: Reference Rot: Threat and Remedy

4. Devising Remedy for Reference Rot

Tweets on #hiberlink to

#UKSG15

Page 35: Reference Rot: Threat and Remedy

The Remedy Is Quick Freeze & Archive

Page 36: Reference Rot: Threat and Remedy

3 workflows in scholarly statement

①Preparation -> Study - > Compose -> Submission

②Publication -> Editing -> (Revision) -> Acceptance -> Issue

③Post-Publication-> Deposit/Ingest -> Reader Access -> Use

To identify the best opportunities for Intervention to make Remedy,to ‘flash-freeze’, either to avoid reference rot or to ‘stop the rot’

Identify the Actors & how to assist them do the right thing!

Page 37: Reference Rot: Threat and Remedy

Ideally at the earliest moment of capture

Page 38: Reference Rot: Threat and Remedy

… when the Authors are trawling for content

Page 39: Reference Rot: Threat and Remedy

… for what an Author regards as significant

Page 40: Reference Rot: Threat and Remedy

… or needs to provide as evidence

Page 41: Reference Rot: Threat and Remedy

… re-factoring the HTML link that is returned

• http://www.newyorker.com/magazine/2015/01/26/cobweb

• Archive timestamp: 2015-02-19T09:46:36

• http://web.archive.org/web/20150219094636/http://www.n

ewyorker.com/magazine/2015/01/26/cobweb

Hiberlink Remedy: Components in a Robust Link

b) Augment Link with Datetime and Archive URI

a) Take simple URI - to article in New Yorker magazine (say)

Hiberlink.org

Page 42: Reference Rot: Threat and Remedy

What Robust Hiberlinks look like

• Hiberlinks are modified <a> HTML elements

• Include archive URL and timestamp as

additional attributes

<a

href=“http://www.newyorker.com/magazine/2015/01/26/cobweb”

data-

versionurl=“http://web.archive.org/web/20150219094636/http://w

ww.newyorker.com/magazine/2015/01/26/cobweb”

data-versiondate=“2015-02-19T09:46:36”>Cobweb Article</a>

Page 43: Reference Rot: Threat and Remedy

Help authors do the right thing:

① Triggering archiving of referenced web content

when it is noted, using a reference managereg EndNote, Reference Manager, Zotero

– Hiberlink Plug-in developed for Zotero

② Returns Datetime URI for archived content that

can be used in the citation

Remedy To Avoid Reference Rot

https://www.zotero.org/

Page 44: Reference Rot: Threat and Remedy

Zotero workflow

Create reference

Add URL

Update URL

Duplicate reference

Pass URL to archive service

Receive archive URL

Store data in database

Add data to reference

Page 45: Reference Rot: Threat and Remedy

Using the Plugin in Zotero

Opportunity / Time for a Demo ?

Page 46: Reference Rot: Threat and Remedy

So what should we expect of the Publisher?

Beyond the assurance thatthe fish / references / articles

sold are not rotten

Page 47: Reference Rot: Threat and Remedy

Help Publishers do the right thing

The next best opportunity for Quick Freeze• to avoid reference rot & to ‘stop the rot’

① Study: Preparation -> (Review) -> Submission

② Publication: Editorial -> (Revision) -> Issue ③ Post-Publication: Deposit/Ingest -> Provide/Access -> Use

Actors:

① The Author

② The Editor / Publisher

③ The Access Platform / Librarian /Archival Organisations

Page 48: Reference Rot: Threat and Remedy

OJS plugin

1. Parses the document

• Converts .pdf to .html

• Extracts URIs

2. Archives the content for each reference

• The Author and Editor can choose which version is

used as the archival copy

3. Creates an HTML version of the document

• including a link to the archived version of each of the

references

Page 49: Reference Rot: Threat and Remedy

Well Published References & Augmented Links

Page 50: Reference Rot: Threat and Remedy

Post-Publication (& other bulk processing)

The last ‘best’ opportunity for Quick Freeze• not to avoid reference rot but to ‘stop the rot’

① Study: Preparation -> (Review) -> Submission

• Should note & act for each URI, one by one

② Publication: Editorial -> (Revision) -> Issue

• (Probably) should examine each one by one

③Post-Publication: Deposit/Ingest

• Cannot hope to process one by one

Page 51: Reference Rot: Threat and Remedy

Post-Publication (& other bulk processing)

The last ‘best’ opportunity for Quick Freeze• not to avoid reference rot but to ‘stop the rot’

① Study: Preparation -> (Review) -> Submission

• Should note & act for each URI, one by one

② Publication: Editorial -> (Revision) -> Issue

• (Probably) should examine each one by one

③Post-Publication: Deposit/Ingest

• Cannot hope to process one by one

Actors:

① The Author

② The Editor / Publisher

③Access Platforms / Archival Organisations/ Librarians

Page 52: Reference Rot: Threat and Remedy

& each article contains many references

Page 53: Reference Rot: Threat and Remedy

Recall Key Aspect of Hiberlink Methodology

1. Convert Scholarly Statement from PDF into XML

2. Locate the references & extract each and every URL

• Many technical challenges

• URL broken/newline; underscore as image

• Use up to 15 regular expression for matching; regard as URI

=> Edinburgh Parser [github.com/hiberlink]

University of Edinburgh Language Technology Group: Beatrice Alex, Claire Grover, Colin Matheson, Richard Tobin, Ke Zhou

Page 54: Reference Rot: Threat and Remedy

Time to Build Infrastructure:

HiberActive

Publishing platform HiberActive

External archival service

(e.g. Internet Archive)

• Asynchronous (returns Robust Link)

• Distributed (archived with different organisations)

• Lightweight (leveraging HTTP & what already exists)

Page 56: Reference Rot: Threat and Remedy

Hiberlink Outcomes

1. Defined the Threat of Reference Rot

2. Quantified the extent and way in which it

exists & undermines the Scholarly Record

3. Pointed to potential & practical Remedy

Page 57: Reference Rot: Threat and Remedy

Hiberlink Outcomes1. Defined the Threat of Reference Rot

2. Quantified the extent and way in which it exists & undermines the

Scholarly Record

3. Pointed to potential & practical Remedy

As project comes to an end (June 2015) so we wish to:

• Tell the world about these achievements

• Engage with others

– to build infrastructure

– To prompt adoption (copying) of prototypes by 3rd

parties

• such as reference managers, editorial systems, publication

systems, archival systems

Page 58: Reference Rot: Threat and Remedy

Thank you,

Questions welcome

http://hiberlink.org #hiberlink

Email: [email protected]

Still Time to Tweet to #UKSG15

Funded by the Andrew W. Mellon Foundation