Upload
sven-schlarb
View
22
Download
0
Embed Size (px)
Citation preview
European Archival Records and Knowledge Preservation
#earkproject www.eark-project.eu @EARKProject
An OAIS-oriented System for Fast Package Creation, Search, and Access
Sven Schlarb, Rainer Schmidt, Roman Karl, Mihai Bartha, Jan Rörden, Janet Delve, Kuldar Aas
Presenter: Sven Schlarb <[email protected]>
AIT Austrian Institute of Technology
IPRES 2016Bern, October 3, 2016
THEE-ARK PROJECT
ISCO-FUNDED
BY THEEUROPEAN
COMMISSIONUNDER THE
ICT-PSPPROGRAMME
www.eark-project.eu
● E-ARK has defined a basic structure and recommended metadata standards for information packages.
● E-ARK has created a reference implementation covering the functional entities for ingest, archiving, and access according to the OAIS reference model.
● The SME partners KEEP Solutions and ESS have adapted their archiving solutions.
– RODA repository (KEEP)
– ESS Preservation Platform (ESS) ● AIT has developed an environment for processing information packages
(SIP, AIP, DIP).
– Providing a graphical front-end called earkweb.● AIT has developed a scalable backend repository for storing, discovering, and
accessing data contained in information packages.
– Initially based on the Lily repository project, now Cloudera Search.
Main outcomes
• Modular package transformation workflows & metadata creation
• Parallelize full-text indexing
•Fast random access to individual files
•Aggregating data using facet queries •Data mining (Classification, NER)
Faceted Search & Data Mining
Access
Full-text indexing & search
Package transformation and Ingest
Reference Implementation Functionality
• Pre-Ingest (Producer)
– Tasks: SIP Creation, Validation, Submission
– E-ARK Tools: Database Preservation Tooklit, RODA/EPP Tools, earkweb
• Ingest
– Tasks: SIP Validation, Archival Processing, AIP Creation
– E-ARK Tools: earkweb, RODA, EPP
• Archival Storage
– Tasks: Storage to Archival Repository
– E-ARK Tools: Lily Repository (Cloudera Search), RODA, EPP
• Data Management
– Tasks: Discover, Select, and Manipulate Records
– E-ARK Tools: Lily Repository, RODA, EPP
• Access
– Tasks: DIP Creation and activation (e.g. within an RDBMS)
– E-ARK Tools: earkweb, RODA
E-ARK Archival Workflow
SIP
E-ARK Information Package (simplified)
representations
metadata
[schemas/documentation]
Structural metadata
Provenance metadata
Technical metadata
Descriptive metadata
SIP
DIP
DIP
Li fe cy cle
Metadata edits
MigrationsAdd emulation info
• earkweb is based on Phython and the Celery task execution system.
– Create archival workflows from predefined tasks which can be executed in parallel on a computer cluster.
– Examples are data validation, format migration, content extraction, database transformation, packaging, interfacing with storage systems.
– earkweb provides a graphical interface and can be used interactively as well as in batch mode.
earkweb
• The E-ARK Lily repository provides a scalable backend for storing, discovering, and accessing AIPs based on technologies like SolR, MapReduce, and HBase.
– The repository is entirely distributed allowing us to handle huge amounts of data
– It provides full-text search, browsing, random access to data contained in IPs.
– It provides APIs allowing one to carry out computations (like data mining tasks) across the archived content.
E-ARK Lily/Hadoop Repository
6/30/16
Worker Worker Worker Worker
Staging/Storage AreaNAS
<<package transfer>>
decoupled
<<notification>>
<<search and retrieval>>
Information package status
Task results
Cluster Deployment Stack
Standalone Deployment Stack
6/30/16
Worker Worker Worker Worker
Staging/Storage AreaNAS <<indexing>>
<<search and retrieval>>
Information package status
Task results
Search & Access• Search within and across information packages
– Full text index for office documents, PDF, MS Word, etc.– Search based on defined fields, e.g. size, mime-type, package, etc. – Results directly linked with the Lily content repository
• Faceted queries allowing to cluster search results into different categories
• Spatio-temporal search in geographical datasets• Filter search according to estimated text category
(machine learning/text classification)
Data Mining/NLP
• Purpose: ● Show how to analyse digital resources contained in
the archive in an exemplary manner.• Selected use cases:
● Location names occurring in texts.● Named entity recognition and incorporation of geo-
information● Text classification
Location names occurring in texts
● StanfordNER for NER● nominatim (database behind
openstreetmap.org) for georeferencing● peripleo for visualization
Location names occurring in texts
Peripleo - PELAGIOS Project
Geographical/timeline search
Peripleo - PELAGIOS Project
● Provided: GML data and TIFF images of maps with metadata (coordinate system, time, etc.)
● Convert GML data to Peripleo RDF● Translate coordinate system if necessary● Use peripleo to search for and visualize
regions and filter by time
Geographical/timeline search
Peripleo - PELAGIOS Project
Text classification using scikit-learn
● Prepare data to train SVM classifier● Dump full-texts of the repository into re-usable packages
● Apply text classification and update SolR records accordingly
Database archiving, rebuilding and analysis
source: wikipedia
SIARD
RDBMS data(up to 80TB)
e.g. Postgres e.g. Oracle
Submit ... Archive ... Reconstruct ... Analyse.
• National Archive of Hungary
– Full scale cluster deployment of earkweb and Hadoop/lily back-end.
– Ingest, search, and access on large-volumes of AIPs.
• National Archive of Slovenia
– earkweb and Peripleo installation for ingesting, visualising, and searching geo-data.
• Danish National Archives
– earkweb standalone installation
Current Pilots
Want to try it out?
• Single-machine deployment of the E-ARK Reference Implementation available online: http://earkdev.ait.ac.at/earkweb
• Oracle Virtualbox VM (Standalone Deployment!) available for download:http://earkdev.ait.ac.at/eark/pilots/eark-pilot-vm.ova
• General information about E-ARK: http://www.eark-project.eu