Upload
prelida-project
View
136
Download
4
Embed Size (px)
DESCRIPTION
by D. Giaretta (APARSEN), presented at the 3rd PRELIDA Consolidation and Dissemination Workshop, Riva, Italy, October, 17, 2014. More information about the workshop at: prelida.eu
Citation preview
State of the ArtSUMMARY OF D3.1 STATE OF THE ART
D GIARETTA
Outline Preservation – State of the Art Challenges for Linked Data Options Conclusions
EC policy – a brief history – a personal view
EC support for DP research for creating digital objects
Data Digitisation
e-Infrastructure to Digital Agenda
National funding Significantly more than EC funding What is the EC role?
DP research: approx 100M€ from EC
From Research on Digital Preservation within projects co-funded by the European Union in the ICT programme, 2011, Stephan Strodl et al http://cordis.europa.eu/fp7/ict/creativity/report-research-digital-preservation_en.pdf
Situation now
The digital preservation community has failed in persuading the EC that there is need for more funding for DP research◦We do not have a consistent story about:◦ Costs◦ Rights◦ Methods etc◦ “Emulate or Migrate” inadequate!◦ Who is doing it right
Luxembourg unit which previously funded DP research – name changed to “Creativity” - now shows no funding for digital preservation research
EC expects results from the previous 100 M € research by deploying solutions
Digital Preservation – some quotes: Head of unit funding the Digital Preservation projects asked repeatedly:◦“Who pays and why?”
NSF colleague:◦“Digital preservation is like VAT – people don’t
like it”
Value pyramid
From Riding the Wave
“The Digital Agenda for Europe outlines policies and actions to maximise the benefit of the digital revolution for all. Supporting research and innovation is a key priority of the Agenda, essential if we want to establish a flourishing digital economy.”
Neelie Kroes,
Vice-President of the EC, responsible for the Digital Agenda
Data is the new gold.“We have a huge goldmine… Let’s start mining it.”Neelie Kroes
That is the magic to find value amid the mass of data. The right infrastructure, the right networks, the right computing capacity and, last but not least, the right analysis methods and algorithms help us break through the mountains of rock to find the gold within.
……but
Gold is precious because ◦it is rare ◦it does not combine with other elements◦it does not perish
……..but……….
Data is valuable because ◦there is so much of it◦it is more valuable when it is combined together◦BUT it is far from imperishable
Role for Linked Data
OR
Preservation – State of the Art
Problems when preserving data
Preserve?
Preserve what?
For how long?
How to test?
Which people?
Which organisations?
How well?
• Metadata? – What kind? How much?
Difficulties in digital preservation
Many different terminologies
Many different views of preservation
Many different kinds of digital objects◦ Documents◦ Data◦…… and new types of objects
Tools and Services◦Which ones work for which digital objects?◦Which tools/techniques fit together?◦ How to integrate new tools
Consistent training needed
Risks vs Cost
Who can you trust?
}Need a consistent, coherent approach to digital preservation- APARSEN.
Need an Audit and Certification system – ISO 16363
OAIS – ISO 14721
Preservation techniquesFor each technique
look for evidence – what evidence?
must at least make sure we consider different types of data◦rendered vs non-rendered◦composite vs simple◦dynamic vs static◦active vs passive
must look at all types of threats
Basic preservation activities
Libraries say:
“Emulate or migrate”
◦Works well with data only in special cases◦ Can repeat what was done before instead of new things
◦ Does not help with building cross-disciplinary communities
• Can repeat what has been done before
BUT• Cannot use new applications
• Convert to format which new software can use
BUT• What if there are many
software systems?
Contains numbers – need meaning
16
...to be combined and processed to get this
17
Level 2 Level 0 Level 1
ProcessingProcessing/c
ombining
...or this
18
OAIS Information model: Representation Information
The Information Model is keyRecursion ends at
KNOWLEDGEBASE of the DESIGNATED COMMUNITY
(this knowledge will change over time and region)
Does not demand that ALL Representation Information be collected at once.
A process which can be tested
FITS FILE
FITS DICTIONARYFITS
STANDARD
PDF SOFTWAREJAVA VM
PDF STANDARD
FITS JAVA SOFTWARE
DICTIONARY SPECIFICATION
XML SPECIFICATION
UNICODE SPECIFICATION
Rep Info Network
Additional technique: add Representation Information
Descriptions of the digitally encoded objectIdeal description allows a machine to extract information
Migration
OAIS defines various types of Migration:◦Do not change the bits ◦Refresh◦Replicate
◦Change the packaging but not the content◦Repackage
◦Change the content◦Transform (usually non-reversible)◦Need to consider “Transformational Information Properties” – important for
AUTHENTICITY◦Related to “Significant properties”◦Add appropriate Representation Information for the new format
22
AND – be prepared toHand-over
Preservation requires funding Funding for a dataset (or a repository) may stop Need to be ready to hand over everything needed for preservation◦OAIS (ISO 14721) defines “Archival Information Package
(AIP).◦Issues:◦ Storage naming conventions◦Representation Information ◦ Provenance◦ ….
Preserving digitally encoded information
Ensure that digitally encoded information are understandable and usable over the long term Long term could start at just a few years Chain of preservation
Need to do something because things become “unfamiliar” over timeBut the same techniques enable use of data which is “unfamiliar” right now
When things changes We need to:
◦Know something has changed
◦ Identify the implications of that change
◦Decide on the best course of action for preservation
◦What RepInfo we need to fill the gaps
◦ Created by someone else or creating a new one
◦ If transformed: how to maintain data authenticity
◦Alternatively: hand it over to another repository
◦Make sure data continues to be usable
Orchestration Service
Gap Identification Service
Preservation Strategy Tk
RepInfo Registry Service
Authenticity Toolkit
Packaging Tk
Data Virtualisation Toolkit
Process Virtualisation Toolkit
RepInfo Toolkit
SCIDIP-ES
Storage Service
Gap Identification
Service
Orchestration Service
RepInfo Registry Service
Preservation Strategy Toolkit
Data Virtualisation
Toolkit
Process Virtualisation
Toolkit
Authenticity Toolkit
Packaging Toolkit
RepInfo Toolkit
Finding Aid
Toolkit
Cloud Storage
External Access/Use
Services
Persistent ID i/f Service
External PI
services
ISO Certification Organisation
Certification Toolkit
Services: run on remote servers
Toolkits Runs on local machines
• These SUPPLEMENT what repositories do (customised for repositories)
• Make it easier for repositories to do preservation – share the effort
Preservation objectives The same digital object may be preserved with different aims in mind by different repositories:For a digital document
Re-print the pages?To understand the numbers printed in the page to
do further research
For a piece of performance artReplay a recording of a particular performance?Re-perform the work?
For a scientific data fileUnderstand the numbers?Understand the numbers in the context of a
particular theory?
Preservation, Value and Re-use
(re-)usability the essential test for success of preservation◦ Usability usually essential for justifying cost of preservation
Impossible to insist on common formats, semantics or software◦ How to avoid N2 problem?
Impossible to know what formats, semantics or software will be used in future
Needs appropriate Representation Information ◦ for preservation (use in the future when things have become unfamiliar)◦ for use now (use of unfamiliar data i.e. most of it!)◦ automated (re-)use as far as possible
APARSEN is bringing together a coherent, consistent, evidence-based approach to digital preservation involving tools, services, consultancy and training.
Classification of objects
must at least make sure we consider different types of data◦rendered vs non-rendered◦composite vs simple◦dynamic vs static◦Active vs passive
RDF Triple: dynamic/complex/non-rendered/passive
Key questions about the what is to be preservedWhat is the object to be preserved?The specific piece of RDF?The specific RDF plus data pointed toThe underlying database (if any)? The whole linked “world”?
What are the preservation objectives?The RDF and whole inference system?Just the RDF?Just the underlying database (if any)?
Key questions about RDF
What Representation information is needed for the LD?Schema?Additional semantics?Evolution of links e.g. replace this host by a new one)?Snapshots?
What Transformation?One version of RDF to another?Move to replacement for RDF?Change of underlying database?Authenticity??
Who to hand over toWhat to do with the URIs? – maintain or change?What to do with the underlying database (if any)?
Key questions about the things the RDF points toWill they be preserved?How to find the Representation Information?Will the Persistent Identifiers change?
Joint Key QuestionsWho will pay, and why?
For which things?
Are some things more valuable – and therefore more likely to be preserved?What happens when some things disappear?
OptionsBe clear about what is meantUnderstand what is possibleStart with what is agreed as valuableDon’t promise too much
Input to standardsSee http://www.iso16363.org
Audit and Certification of Trustworthy repositoriesForum: OAIS Futures
ConclusionsA great deal of funding (€100M) has been invested in digital preservation research by the EU
EC is not putting further funding into digital preservation research
There are technical challenges
The biggest challenge is to be clear about what the preservation aims are for Linked Data