SCAPE - Building Digital Preservation Infrastructure

Preview:

DESCRIPTION

Dr. Ross King, AIT Austrian Institute of Technology GmbH, gave an invited talk about the FP7 project SCAPE at the eSciDoc Days in Berlin, October 27, 2011, https://www.escidoc.org/JSPWiki/en/ESciDocDays.

Citation preview

SCAPE

Dr. Ross King AIT Austrian Institute of Technology GmbH

eSciDoc Days Berlin, October 27, 2011

SCAPE Building Digital Preservation Infrastructure

SCAPE Digital Preservation

• For the first time, the rate of increase of information creation is beginning to exceed the rate of increase in storage capacity.

• This massive volume of digital material raises a number of issues: • What is worth preserving? • How to preserve so much? • How to access preserved data? • How to create incentives to

preserve?

2 07.11.2011

http://arstechnica.com/business/consumerization-of-it/2011/09/information-explosion-how-rapidly-expanding-storage-spurs-innovation.ars

SCAPE Digital Preservation

• Standards, best-practices, and technologies utilized in order to ensure access to digital information over time

• How long?

“Digital documents last forever – or five years, whichever comes first.” http://www.clir.org/pubs/reports/rothenberg/introduction.html

• Generally we mean decades or centuries

3 07.11.2011

SCAPE SCAPE – what is it about?

• Planning and managing computing-intensive (digital) preservation processes such as the large-scale ingestion or migration of large (multi-Terabyte) data sets

SCAPE is a follow-up to the highly successful FP6 IP Planets.

SCAPE SCAPE Project Data

• Project instrument: FP7 Integrated Project • 6. Call

• Objective ICT-2009.4.1: Digital Libraries and Digital Preservation

• Target outcome (a) Scalable systems and services for preserving digital content

• Duration: 42 months • February 2011 – July 2014

• Budget: 11.3 Million Euro • Funded: 8.6 Million Euro

SCAPE SCAPE Consortium

Number Partner name Partner short name Country 1 (coordinator) AIT Austrian Institute of Technology GmbH AIT AT

2 British Library BL UK 3 Internet Memory Foundation IMF NL 4 Ex Libris Ltd EXL IL 5 Fachinformationszentrum Karlsruhe FIZ DE 6 Koninklijke Bibliotheek KB NL 7 KEEP Solutions KEEPS PT 8 Microsoft Research MSR UK 9 Österreichische Nationalbibliothek ONB AT

10 Open Planets Foundation OPF UK 11 Statsbiblioteket Aarhus SB DK 12 Science and Technology Facilities Council STFC UK 13 Technische Universität Berlin TUB DE 14 Technische Universität Wien TUW AT 15 University of Manchester UNIMAN UK 16 Pierre & Marie Curie Université Paris 6 UPMC FR

SCAPE SCAPE Project Overview

SCAPE will enhance the state of the art in digital preservation in three ways: • Infrastructure and tools for scalable preservation actions • A framework for automated, quality-assured preservation workflows • Integration of these components with policy-based automated preservation planning and watch

SCAPE results will be validated in three large-scale testbeds: • Digital Repositories • Web Content • Research Data Sets

The SCAPE Consortium brings together a broad spectrum of expertise from • Memory institutions • Data centres • Research labs • Universities • Industrial firms

7

Preservation Components

Quality Assurance Scalable Components

Automation-ready Tools

Platform

Automation Workflows

Parallelization Virtualization

Planning and Watch

Institutional Policies Technical Watch

Automated Planning

Testbeds

Corpora Integration

Benchmarking Validation

Takeup

Stakeholders Communities Dissemination

Training Activities Sustainability

Cross-project Activities

Project Management Technical Coordination

Research Roadmap

SCAPE Selected SCAPE Testbed Scenarios

• Characterise large video files • The master MPEG2 files are so large that it is difficult to apply JHOVE and

insufficient detail is provided. A detailed characterisation of the MPEG2 streams is needed in order to identify technical dependencies for extracting from or rendering the MPEG2 stream. This would enable preservation risks related to current access services to be monitored and action taken as necessary to ensure continued access and preservation.

• Carry out large scale migrations • Migrating from one format to another introduces the possibility of damaging the

content or failing to capture significant properties of the original in the resulting destination format.

• Specific requirements include: • Solution tools that operate reliably at scale (80TB, 2 million pages) • Automated QA, ideally with no manual intervention on a file by file basis • QA performed by independent process from the migration process • QA demonstrates strong evidence of significant properties being captured

in the destination format

• Quality assurance in web harvesting • For large scale crawls, automation of the quality control processes is a necessary

requirement. Currently, this process relies on random sampling and very basic quantitative checks. 8

from digitalbevaring.dk

SCAPE Selected SCAPE Challenges

• Bridging the gap between test workflows and scalable workflows

• Applying Map/Reduce to binary data • Locality of data

• Bring the data to the computation, or bring the computation to the data?

• Repository Integration • Repository Consistency • Scalable Ingest

• Preservation Planning • How to scale? • How to automate?

• Research data sets • How to preserve contextual information?

9

from digitalbevaring.dk

SCAPE SCAPE Solutions

• SCAPE Platform • HADOOP, Stratosphere • Virtualized cluster • Repository integration

• HBASE, HDFS - Fedora

• Three levels of parallelization • Distribution of files • Splitting binary files • Parallelisation of algorithms

• Mapping Taverna to HADOOP

10

from digitalbevaring.dk

SCAPE SCAPE Solutions

• Automated Planning and Watch • Building on the Planets PLATO tool • Automated watch based on

• Results Evaluation Framework (REF) database • Monitoring trends in web harvests

• Automated planning based on semantically formalized policies

• Automated Quality Assurance • QA in web harvesting through automated comparison of

rendered pages – combined structural and image analysis

11

SCAPE SCAPE Achievements

• Public Website • http://www.scape-project.eu/

• Development Infrastructure • Hosted by the Open Planets Foundation and GitHub • Development Wiki

• http://wiki.opf-labs.org/display/SP/Home

• Deliverables • First Deliverables available for download

• Publications • 13 in the first nine months, including 6 at iPres next week • Report: comparative analysis of identification tools

• Platform • 10-node, 20 TB experimental cluster hosted by AIT

12

SCAPE SCAPE Contact Information

• http://www.scape-project.eu/

• office@list.scape-project.eu

• Dr. Ross King AIT Austrian Institute of Technology GmbH Donau-City-Strasse 1 A-1220 Wien

13

SCAPE

Thank you for your attention!

14

Recommended