Upload
leslie-douglas
View
214
Download
1
Tags:
Embed Size (px)
Citation preview
Funded by:
© AHDS
Sherpa DP – a Technical Architecture for a
Disaggregated Preservation Service
Mark HedgesArts and Humanities Data Service
King’s College London
Funded by:
© AHDS
Digital repositories: Dealing with the digital deluge, Manchester, 5 June 2007
SHERPA DP ProjectDevelopment Partners: AHDS at King’s College London
(Lead), Nottingham, Glasgow, Edinburgh, White Rose Consortium, London Leap Consortium
Objective: To create a shared, distributed preservation environment for the SHERPA project framed around the OAIS Reference Model.
Notes:Participating repositories all based on DSpace or EPrints.Relatively simple data objects (eprints).
Funded by:
© AHDS
Digital repositories: Dealing with the digital deluge, Manchester, 5 June 2007
Distributed OAIS ModelInstitutional Repository (Content Provider)
Consumer
DataManagement
Ingest
Administration
SIP DIP
ArchivalStorage
ProducerAccess
Preservation Service (Service Provider)
DataManagement
Access
Archival Storage
DIPAIP
Ingest
SIP
Administration
Preservation Planning
Funded by:
© AHDS
Digital repositories: Dealing with the digital deluge, Manchester, 5 June 2007
Distributed WorkflowSu b mit d a ta& me ta da ta
Validationsuc c esful
RequestResubmission
N o
Ismetadata
c omplete?
Enhanc eMetadata
Copy SIP torepository
store
E-print inappropr iate
depositformat
Migrate todissemination
formatN o
Transfer D IPto storage
Make availablein c atalogue
Researc her(Consumer)
ac c esses data
Metadatatransfer
C reatetec hnic alm etadata
GenerateAIP
R iska sse ssme n t
Im plem entPres ervation
S trategy
Is s u e sid en tified
N o p ro b le m sid e n tified
Sc heduleO bs oles c enc e
Monitoring
Trans fer AIP toPres ervation
s tore
Service Provider (Preservation Service)
Yes
Yes
Content Provider (Institutional Repository)
Rec ord detailsof migration
ac tion
N o
Yes
Validationsuc c esful
RequestResubmission
N o
Ye s
Datatransfer
Resolveissues
R iska sse ssme n t
Fo rm a t co ns id e re da t-ris k
No
ob
sole
sce
nce
pro
ble
ms
iden
tifie
d
Fo rm a t a t-ris k
Generatereplac em ent
D IP
Funded by:
© AHDS
Digital repositories: Dealing with the digital deluge, Manchester, 5 June 2007
System Architecture
F e d o ra C o reR e p o sito ry Se rv ice
W e b In te rface
R e la tion a l D Blo ca l file syste m
Sh e rpa DP Se rv ice s
In g e stSe rv ice s
Po st- Ing e stSe rv ice s
En q u irySe rv ice s
F e d o ra G e n e r icSe a rch
F e d o ra Se rv ice s
In g e stF u n ctio n s
Po st- Ing e stF u n ctio n s
En q u ir ie s
H T T P
H T T P
SO A PH T T P
SO A PH T T P
SO
AP
HT
TP
SO
AP
HT
TP
Exte rna llyR e fe ren ce d
C o n ten t
Exte rna lse rv ices, e .g .
D R O ID
R E ST
R E ST
Funded by:
© AHDS
Digital repositories: Dealing with the digital deluge, Manchester, 5 June 2007
Key preservation actions at ingest
• Integrity/fixity checks.• File format identification.• Preservation metadata creation.• Implement preservation strategy• File format normalisation.• Others …
Funded by:
© AHDS
Digital repositories: Dealing with the digital deluge, Manchester, 5 June 2007
Requirements• Scalability: need to handle
increasingly large quantities of data• Generation and management of
extensive set of preservation metadata
• Audit trail/provenance metadata: knowledge held in explicit machine-processable form
Funded by:
© AHDS
Digital repositories: Dealing with the digital deluge, Manchester, 5 June 2007
More Requirements
• Distributed architecture • Integration of specialised tools • Follow standards to allow flexible
integration of future tools• Automate workflow where possible,
but also allow human interaction
Funded by:
© AHDS
Digital repositories: Dealing with the digital deluge, Manchester, 5 June 2007
Approach• Web services encapsulating
preservation actions • Web interface for points in the
process where human input required
• Linked by workflow management tool
Funded by:
© AHDS
Digital repositories: Dealing with the digital deluge, Manchester, 5 June 2007
Workflow management• Large number of tools available
– Taverna– BPEL (Active BPEL)– jBPM– others …
• Settled on jBPM
Funded by:
© AHDS
Digital repositories: Dealing with the digital deluge, Manchester, 5 June 2007
jBPM• Web services and UI functions chained
together to form a workflow or “Business Process”
• Open source, flexible, extensible workflow management system
• Bridges the gap between users and developers by giving them a common language
• Packaged as a J2EE application - can run on any J2EE application server like JBoss, Tomcat, etc.
Funded by:
© AHDS
Digital repositories: Dealing with the digital deluge, Manchester, 5 June 2007
Preservation Metadata• Approach based on PREMIS data
dictionary • PREMIS data model based on five
categories: intellectual entities, objects, agents, events, rights
• Implementing a subset of this model
• … with some format-specific extensions (e.g. MIX for images)
Funded by:
© AHDS
Digital repositories: Dealing with the digital deluge, Manchester, 5 June 2007
Available Tools• Stand-alone specialised tools that perform
preservation-related tasks • File format identification, e.g. DROID-
PRONOM– Developed by The National Archives– Identification of file formats based on their file
signatures
• Technical metadata generation, e.g. JHOVE– Extensible framework for format validation– Perform format-specific identification,
validation, and characterization of a digital object
• File format migration tools (e.g. XENA, Open Office)
Funded by:
© AHDS
Digital repositories: Dealing with the digital deluge, Manchester, 5 June 2007
Available tools and workflow• Tools written in different languages• Define generic interfaces for
preservation actions• Wrap the tools used as web services to
promote:– Interoperability– Loose coupling, flexibility– Reusability
Funded by:
© AHDS
Digital repositories: Dealing with the digital deluge, Manchester, 5 June 2007
Workflow in jBPM
Funded by:
© AHDS
Digital repositories: Dealing with the digital deluge, Manchester, 5 June 2007
jBPM (jPDL)
Funded by:
© AHDS
Digital repositories: Dealing with the digital deluge, Manchester, 5 June 2007
Node and ActionHandler
Funded by:
© AHDS
Digital repositories: Dealing with the digital deluge, Manchester, 5 June 2007
Workflow Inputs & Outputs
WORKFLOW
SUBMISSIONINFORMATION
PACKAGE(SIP)
ARCHIVALINFORMATION
PACKAGE(AIP)
DISSEMINATIONINFORMATION
PACKAGES(DIPs)
Funded by:
© AHDS
Digital repositories: Dealing with the digital deluge, Manchester, 5 June 2007
Workflow Outputs• Multiple METS packages (atomic
model), each containing (some of):– data– Descriptive metadata– PREMIS object metadata (technical)– PREMIS event metadata– PREMIS relationship metadata– Format-specific technical metadata
(e.g. MIX)
Funded by:
© AHDS
Digital repositories: Dealing with the digital deluge, Manchester, 5 June 2007
Fedora object modele- p r in t
o r ig in a lm an if es ta t io n
n o r m alis edm an if es ta t io n
h asM
anife
sta t
ion h asM
an ife s ta tion
o r ig in a l f ile 1 o r ig in a l f ile 2m ig r a ted f ile 1
u p d a ted v er s io no f e - p r in t
h as Ver s io n
ha sP a r
t
h a sP a r t h asP
a rt h a sP a r t
h a sP a r t
hasPart
isD e riv e d F ro m
Funded by:
© AHDS
Digital repositories: Dealing with the digital deluge, Manchester, 5 June 2007
Issues with automation• Preserving content – what do we actually
want to preserve?• Significant properties – soft concept,
hard to quantify (INSPECT)• Lack of suitable tools – expensive,
outputs unreliable
Funded by:
© AHDS
Digital repositories: Dealing with the digital deluge, Manchester, 5 June 2007
Next Steps• SHERPA DP 2 (2007-2008), looking at:
- Additional repository types- More complex object types- different methods of data transfer
• Generalise system• Add post-ingest preservation actions• Add semantics for dynamic service
discovery• Resource discovery metadata generation
Funded by:
© AHDS
Digital repositories: Dealing with the digital deluge, Manchester, 5 June 2007
Questions
Contact: [email protected]