Transcript
Page 1: Economic Sustainabilityof Digital Preservation - Prof. David Rosenthal, Chief Scientist LOCKSS, Stanford University Libraries

Economic Sustainability ofDigital Preservation

David S. H. Rosenthal

LOCKSS ProgramStanford University Libraries

http://www.lockss.org/http://blog.dshr.org/

© 2014 David S. H. Rosenthal

Page 2: Economic Sustainabilityof Digital Preservation - Prof. David Rosenthal, Chief Scientist LOCKSS, Stanford University Libraries

Journals move to the Web

● Access for current readers better:● Links, search, data spreadsheets behind graphs, ...● No need to go to the library

● Access for future readers worse:● Not purchase but rental, no rent payment no access● Not many copies, but one on short­lived rewritable media

Page 3: Economic Sustainabilityof Digital Preservation - Prof. David Rosenthal, Chief Scientist LOCKSS, Stanford University Libraries

Paper Libraries

● Interesting example of fault­tolerance:● Loosely­coupled network of many independent peers● Each storing a selection of available content● On durable, somewhat tamper­evident media● Market in copies, fewer copies   more care→

● Easy to find a copy, hard to find all copies● Inter­library loan & copy to repair loss or damage

Page 4: Economic Sustainabilityof Digital Preservation - Prof. David Rosenthal, Chief Scientist LOCKSS, Stanford University Libraries

LOCKSS Program

● LOCKSS box acts as persistent Web cache:● Crawls Web to pre­load with subscribed content● If can't get publisher copy, readers get library copy● Boxes cooperate to detect, repair loss & damage

● Timeline:● 1998 NSF funded prototype● 1999 NSF, Sun funded alpha: 1 journal, 15 boxes● 2000 Mellon, Sun funded beta: ~40 libraries● 2004 Production● 2005 Mellon matching grant● 2007 Sustainability!

Page 5: Economic Sustainabilityof Digital Preservation - Prof. David Rosenthal, Chief Scientist LOCKSS, Stanford University Libraries

LOCKSS: Businesses

● Develop & support use of LOCKSS software:● Free & open­source, but pay for support (cf. Red Hat)● ~150 libraries using the software

● Under contract, run CLOCKSS network:● Dark archive of e­journals & e­books● Not­for­profit managed jointly by publishers and libraries● 12 nodes worldwide● Triggered if unavailable from any publisher, CC license● Certified “Trustworthy Repository” ­ score 13/15● Technologies, Technical Infrastructure, Security – 5/5

Page 6: Economic Sustainabilityof Digital Preservation - Prof. David Rosenthal, Chief Scientist LOCKSS, Stanford University Libraries

The Half­Empty Archive

● E­journals: less than half preserved● ARL vs. Keepers: ~40% of serials preserved● Faria et al.: <50% of serials preserved

● Public web pages ­ Ainsworth et al.:● Search engine sampled URLs: ~2/3 preserved● Bit.ly random URLs: ~1/3 preserved

● Choices:● Do nothing● Double the budget● Halve the cost per unit content

Page 7: Economic Sustainabilityof Digital Preservation - Prof. David Rosenthal, Chief Scientist LOCKSS, Stanford University Libraries

Cost Data?

● Lots of research into preservation costs:● CMDP, LIFE, KRDS, PrestoPrime, ENSURE, ...● Serious lack of usable data● Inconsistent accounting, hidden costs, content variability

● My rule of thumb summarizing the research:● Ingest 1/2, preservation 1/3, access 1/6 of lifetime cost

● 4C project ­ please submit cost data to:● http://www.4cproject.eu/● Curation Cost Exchange

Page 8: Economic Sustainabilityof Digital Preservation - Prof. David Rosenthal, Chief Scientist LOCKSS, Stanford University Libraries

Kryder's Law

● Bit density on disk platters:● Doubles every 18 months

● Thus $ per GB:● Drops 30­40% per year

● If you can afford to store stuff for a few years● You can afford to store it forever

Page 9: Economic Sustainabilityof Digital Preservation - Prof. David Rosenthal, Chief Scientist LOCKSS, Stanford University Libraries
Page 10: Economic Sustainabilityof Digital Preservation - Prof. David Rosenthal, Chief Scientist LOCKSS, Stanford University Libraries

Source: Preeti Gupta, SSRC, UC Santa Cruz

Page 11: Economic Sustainabilityof Digital Preservation - Prof. David Rosenthal, Chief Scientist LOCKSS, Stanford University Libraries

Stored Safe in the Cloud?

● Cloud storage sold as “cheaper”:● If all charges accounted for, not cheaper for preservation● Its made of the same disks you use locally● Economies of scale captured by the provider

● Cloud storage locks you in:● Free to store, costs to access● Changing providers slow, expensive – you will be gouged● Not a competitive market – dominated by Amazon

● To avoid lock­in, must keep a copy yourself● To allow you to change providers without paying arm+leg

Page 12: Economic Sustainabilityof Digital Preservation - Prof. David Rosenthal, Chief Scientist LOCKSS, Stanford University Libraries

Blue Ribbon Task Force

● Sustainable Digital Preservation & Access:● 2­year study, report in 2010● NSF, Mellon, Library of Congress, JISC, CLIR, NARA

● Preservation has to be justified by access:● D'oh!● Dark archives (e.g. CLOCKSS) hard to sustain● Scholars don't like to, no budget to, pay for access to data

Page 13: Economic Sustainabilityof Digital Preservation - Prof. David Rosenthal, Chief Scientist LOCKSS, Stanford University Libraries

“Big Data”

● Research on past access to archives:● Rare, sparse, except for integrity checks● “Cold” data

● Future access will be different:● Scholars want to data­mine from archive collections● Access much more intense, expensive● Data “warm” to “hot”

● How much more expensive?● Compare S3 (warm) vs Glacier (cold)● S3 2.5 times more expensive

Page 14: Economic Sustainabilityof Digital Preservation - Prof. David Rosenthal, Chief Scientist LOCKSS, Stanford University Libraries

Cloud For Access?

● Cloud ideal for data­mining from collections:● Spiky demand● Charging mechanism

● Amazon Free Public Datasets:● No charge to data owner● Amazon charges readers for compute they use for access

● Library of Congress & Twitter feed (public):● Store copy in Amazon Reduced Redundancy Storage● Charge scholars for access to pay storage cost of copy● Scholars pay Amazon for compute to access copy

Page 15: Economic Sustainabilityof Digital Preservation - Prof. David Rosenthal, Chief Scientist LOCKSS, Stanford University Libraries

Sustaining Open SourcePreservation

● Open source essential for preservation:● No “just trust me” like closed­source encryption● … or cloud storage

● Niche market – not like Linux, Apache, ...:● No foundation with large industry sponsors● Red Hat needs frequent, visible upgrades to motivate $● Hard to devote resources to infrastructure improvements

● Mellon recognizes this problem:● Small grant for infrastructure● AJAX crawler, Shibboleth support, protocol improvements

Page 16: Economic Sustainabilityof Digital Preservation - Prof. David Rosenthal, Chief Scientist LOCKSS, Stanford University Libraries

A Petabyte for a Century

● Black Box:● Put PB in, wait 100yrs, take PB out● Whatever media, replication, algorithms you like inside● 50% chance every bit undamaged

Page 17: Economic Sustainabilityof Digital Preservation - Prof. David Rosenthal, Chief Scientist LOCKSS, Stanford University Libraries

A Petabyte for a Century

● Black Box:● Put PB in, wait 100yrs, take PB out● Whatever media, replication, algorithms you like inside● 50% chance every bit undamaged

● This defines bit half­life:● Approx 60M times the age of the Universe● No feasible benchmark of adequate reliability

● Stuff will get lost or damaged:● Only question is “how much damage for how many $?”

Page 18: Economic Sustainabilityof Digital Preservation - Prof. David Rosenthal, Chief Scientist LOCKSS, Stanford University Libraries

Threat Model

Media failure

Hardware failure

Software failure

Network failure

Obsolescence

Natural Disaster

Page 19: Economic Sustainabilityof Digital Preservation - Prof. David Rosenthal, Chief Scientist LOCKSS, Stanford University Libraries

Threat Model

Media failure

Hardware failure

Software failure

Network failure

Obsolescence

Natural Disaster

Operator error

External Attack

Insider Attack

Economic Failure

Organization Failure

Page 20: Economic Sustainabilityof Digital Preservation - Prof. David Rosenthal, Chief Scientist LOCKSS, Stanford University Libraries

Is More Reliable Better?

● Two systems, same budget for a decade:● A) zero loss rate● B) 1%/yr loss rate, 50% less $/yr than A per unit content● B's loss rate is clearly unacceptable

Page 21: Economic Sustainabilityof Digital Preservation - Prof. David Rosenthal, Chief Scientist LOCKSS, Stanford University Libraries

Is More Reliable Better?

● Two systems, same budget for a decade:● A) zero loss rate● B) 1%/yr loss rate, 50% less $/yr than A per unit content● B's loss rate is clearly unacceptable

● After a decade:● B preserves 1.89 times as much at the same cost

● After 3 decades:● B preserves more than 5 times as much

Page 22: Economic Sustainabilityof Digital Preservation - Prof. David Rosenthal, Chief Scientist LOCKSS, Stanford University Libraries

The Good News

● Sustainable digital preservation possible:● LOCKSS is an example

Page 23: Economic Sustainabilityof Digital Preservation - Prof. David Rosenthal, Chief Scientist LOCKSS, Stanford University Libraries

The Bad News

● Expectations way out of line with reality:● Can't preserve as much as people assume is being● Nor as reliably as people assume it is being preserved

● Mismatch will get worse:● Expect lots more data, no more money● Expect costs to drop rapidly, experts say slowly if at all

● Technology won't save us:● Research data, libraries, archives niche market● Hard problems, no big payoff for solution, so little research● Build systems from stuff designed to do something else