E-ARK-iPRES2016-Bern-October-2016

European Archival Records and Knowledge Preservation

#earkproject www.eark-project.eu @EARKProject

An OAIS-oriented System for Fast Package Creation, Search, and Access

Sven Schlarb, Rainer Schmidt, Roman Karl, Mihai Bartha, Jan Rörden, Janet Delve, Kuldar Aas

Presenter: Sven Schlarb <[email protected]>

AIT Austrian Institute of Technology

IPRES 2016Bern, October 3, 2016

THEE-ARK PROJECT

ISCO-FUNDED

BY THEEUROPEAN

COMMISSIONUNDER THE

ICT-PSPPROGRAMME

www.eark-project.eu

● E-ARK has defined a basic structure and recommended metadata standards for information packages.

● E-ARK has created a reference implementation covering the functional entities for ingest, archiving, and access according to the OAIS reference model.

● The SME partners KEEP Solutions and ESS have adapted their archiving solutions.

– RODA repository (KEEP)

– ESS Preservation Platform (ESS) ● AIT has developed an environment for processing information packages

(SIP, AIP, DIP).

– Providing a graphical front-end called earkweb.● AIT has developed a scalable backend repository for storing, discovering, and

accessing data contained in information packages.

– Initially based on the Lily repository project, now Cloudera Search.

Main outcomes

• Modular package transformation workflows & metadata creation

• Parallelize full-text indexing

•Fast random access to individual files

•Aggregating data using facet queries •Data mining (Classification, NER)

Faceted Search & Data Mining

Access

Full-text indexing & search

Package transformation and Ingest

Reference Implementation Functionality

• Pre-Ingest (Producer)

– Tasks: SIP Creation, Validation, Submission

– E-ARK Tools: Database Preservation Tooklit, RODA/EPP Tools, earkweb

• Ingest

– Tasks: SIP Validation, Archival Processing, AIP Creation

– E-ARK Tools: earkweb, RODA, EPP

• Archival Storage

– Tasks: Storage to Archival Repository

– E-ARK Tools: Lily Repository (Cloudera Search), RODA, EPP

• Data Management

– Tasks: Discover, Select, and Manipulate Records

– E-ARK Tools: Lily Repository, RODA, EPP

• Access

– Tasks: DIP Creation and activation (e.g. within an RDBMS)

– E-ARK Tools: earkweb, RODA

E-ARK Archival Workflow

SIP

E-ARK Information Package (simplified)

representations

metadata

[schemas/documentation]

Structural metadata

Provenance metadata

Technical metadata

Descriptive metadata

SIP

DIP

DIP

Li fe cy cle

Metadata edits

MigrationsAdd emulation info

• earkweb is based on Phython and the Celery task execution system.

– Create archival workflows from predefined tasks which can be executed in parallel on a computer cluster.

– Examples are data validation, format migration, content extraction, database transformation, packaging, interfacing with storage systems.

– earkweb provides a graphical interface and can be used interactively as well as in batch mode.

earkweb

• The E-ARK Lily repository provides a scalable backend for storing, discovering, and accessing AIPs based on technologies like SolR, MapReduce, and HBase.

– The repository is entirely distributed allowing us to handle huge amounts of data

– It provides full-text search, browsing, random access to data contained in IPs.

– It provides APIs allowing one to carry out computations (like data mining tasks) across the archived content.

E-ARK Lily/Hadoop Repository

6/30/16

Worker Worker Worker Worker

Staging/Storage AreaNAS

<<package transfer>>

decoupled

<<notification>>

<<search and retrieval>>

Information package status

Task results

Cluster Deployment Stack

Standalone Deployment Stack

6/30/16

Worker Worker Worker Worker

Staging/Storage AreaNAS <<indexing>>

<<search and retrieval>>

Information package status

Task results

Search & Access• Search within and across information packages

– Full text index for office documents, PDF, MS Word, etc.– Search based on defined fields, e.g. size, mime-type, package, etc. – Results directly linked with the Lily content repository

• Faceted queries allowing to cluster search results into different categories

• Spatio-temporal search in geographical datasets• Filter search according to estimated text category

(machine learning/text classification)

Data Mining/NLP

• Purpose: ● Show how to analyse digital resources contained in

the archive in an exemplary manner.• Selected use cases:

● Location names occurring in texts.● Named entity recognition and incorporation of geo-

information● Text classification

Location names occurring in texts

● StanfordNER for NER● nominatim (database behind

openstreetmap.org) for georeferencing● peripleo for visualization

Location names occurring in texts

Peripleo - PELAGIOS Project

Geographical/timeline search


● Provided: GML data and TIFF images of maps with metadata (coordinate system, time, etc.)

● Convert GML data to Peripleo RDF● Translate coordinate system if necessary● Use peripleo to search for and visualize

regions and filter by time

Geographical/timeline search


Text classification using scikit-learn

● Prepare data to train SVM classifier● Dump full-texts of the repository into re-usable packages

● Apply text classification and update SolR records accordingly

Database archiving, rebuilding and analysis

source: wikipedia

SIARD

RDBMS data(up to 80TB)

e.g. Postgres e.g. Oracle

Submit ... Archive ... Reconstruct ... Analyse.

• National Archive of Hungary

– Full scale cluster deployment of earkweb and Hadoop/lily back-end.

– Ingest, search, and access on large-volumes of AIPs.

• National Archive of Slovenia

– earkweb and Peripleo installation for ingesting, visualising, and searching geo-data.

• Danish National Archives

– earkweb standalone installation

Current Pilots

Want to try it out?

• Single-machine deployment of the E-ARK Reference Implementation available online: http://earkdev.ait.ac.at/earkweb

• Oracle Virtualbox VM (Standalone Deployment!) available for download:http://earkdev.ait.ac.at/eark/pilots/eark-pilot-vm.ova

• General information about E-ARK: http://www.eark-project.eu

Documents

E-ARK-iPRES2016-Bern-October-2016