26
The Use of Big Data Techniques for Digital Archiving Sven Schlarb, Austrian Institute of Technology Tuesday 15 th March 2016, Cambridge

The Use of Big Data Techniques for Digital Archiving

Embed Size (px)

Citation preview

Page 1: The Use of Big Data Techniques for Digital Archiving

The Use of Big Data Techniques for Digital Archiving

Sven Schlarb, Austrian Institute of Technology

Tuesday 15th March 2016, Cambridge

Page 2: The Use of Big Data Techniques for Digital Archiving

OUTLINE• E-ARK Project Overview• Technical Background • Integrated Prototype• Data Mining Use Cases

Page 3: The Use of Big Data Techniques for Digital Archiving

Project Overview

Page 4: The Use of Big Data Techniques for Digital Archiving

THEE-ARK PROJECT

ISCO-FUNDED

BY THEEUROPEAN

COMMISSIONUNDER THE

ICT-PSPPROGRAMME

www.eark-project.eu

Page 5: The Use of Big Data Techniques for Digital Archiving

Advisory BoardsArchival

• Archives of Emilia-Romagna, Italy• Directorate-General of the Book, of

Archives & of Libraries, Portugal• EC Archives & Records Management• EC Historical Archives• German Federal Archives• National Archives of Bulgaria• National Archives of Finland• National Archives of France• National Archives of Sweden• National Archives of the

Netherlands• Polish Data Archive• Queensland State Archives• Swiss Federal Archives• UK National Archives• UK Parliamentary Archives

Commercial Technial

• Arkivum• ARMA Europe• DigitalForever• Discovery Garden• Microsoft Research• Open Preservation Foundation• Open Text Initiative• Preservica• Versity

Data Providers

• Danish Agency for Digitisation• Estonian Ministry of Economic

Affairs & Communication• Estonian Unemployment Insurance

Fund• James Lappin, RM Consultant

Page 6: The Use of Big Data Techniques for Digital Archiving

Project mission

• Improve access to the archived records of European Archives

• Create guidelines and recommended practices

• Cover relational databases, record management systems, and geographical data

• Create open source implementation evaluated in several pilots

Page 7: The Use of Big Data Techniques for Digital Archiving

Outcomes

Standardisation of available best-practices• Common terminology

(Knowledge Center)• SIP, AIP and DIP

format specifications• Pre-ingest, ingest and

access workflows

Open source tools• Scalable, modular,

and reusable implementation of specifications

• Individual deployments (Pilots) and an integrated reference implementation

Page 8: The Use of Big Data Techniques for Digital Archiving

Technical Background

Page 9: The Use of Big Data Techniques for Digital Archiving

Hadoop ClusterTask Trackers

Data Nodes

Job Tracker

Name Node

Page 10: The Use of Big Data Techniques for Digital Archiving

Hadoop = MapReduce + HDFSDistributed processing (MapReduce)

Distributed Storage (HDFS)

example: 2 x Quad-Core-CPUs:10 Map (Parallelisierung)4 Reduce (Aggregation)

example: 4 x 1 TB Hard-Disks (replication factor 3):ca. 1,33 TB

HADOO

P

Page 11: The Use of Big Data Techniques for Digital Archiving

Sort

Shuffle

Merge

Input data

Input split 1

Record 1

Record 2

Record 3

Input split 2

Record 4

Record 5

Record 6

Input split 3

Record 7

Record 8

Record 9

Task1

Map Reduce

Task 2

Task 3

Output data

Aggregated Result

Aggregated Result

Map/Reduce in a nutshell

Page 12: The Use of Big Data Techniques for Digital Archiving

E-ARK Integrated Prototype Architecture & Implementation

Page 13: The Use of Big Data Techniques for Digital Archiving

Base technology stack

Page 14: The Use of Big Data Techniques for Digital Archiving

E-ARK Web

“Integrated” Prototype?

Page 15: The Use of Big Data Techniques for Digital Archiving

AIP to DIPSIP to AIP

Hadoop Distributed File System

NASWorking area

Search and AccessLily Repository

DIP Delivery

WorkersCelery

Information Package processing & Access Repository

Page 16: The Use of Big Data Techniques for Digital Archiving

Access Repository - Interfaces

Page 17: The Use of Big Data Techniques for Digital Archiving

Ingest and Preservation

Access

E-ARKSIP

SIP Creation

Tools

Archival records

Content and Records

Management Systems

SIP – AIPConversion

E-ARKAIP

CMISInterface

Data Mining

Interface

Digital preservation systems

AIP - DIPConversion

Scalable Computation

E-ARKDIP

Archival Search , Access and

Display Tools

Content and Records

Management Systems

Data MiningShowcase

Page 18: The Use of Big Data Techniques for Digital Archiving
Page 19: The Use of Big Data Techniques for Digital Archiving
Page 20: The Use of Big Data Techniques for Digital Archiving
Page 21: The Use of Big Data Techniques for Digital Archiving

E-ARK Data Mining

Page 22: The Use of Big Data Techniques for Digital Archiving

Geographical/timeline search

Peripleo - PELAGIOS Project

Page 23: The Use of Big Data Techniques for Digital Archiving

Geographical/timeline search

Peripleo - PELAGIOS Project

Page 24: The Use of Big Data Techniques for Digital Archiving

Text mining: Text classification

Training

• Train classifier using annotated text corpus• SVM – based on statistical features

Classification

• Scan for texts during ingest (or run MR after)• Text category estimation

Search

• Add category as a searcheable field to Lily index• Full-text search using Lily‘s SolR search interface

Page 25: The Use of Big Data Techniques for Digital Archiving

OLAP (Online Analytical Processing)

• Database archiving and re-use (SIARD2)

• Normalization - OLAP/Oracle Data Warehouse

Page 26: The Use of Big Data Techniques for Digital Archiving

Thank you!

• http://www.eark-project.eu• https://github.com/eark-project