28
Implementing Archivematica for research data preservation at York and Hull Jenny Mitcham (Digital Archivist) - University of York Jisc RDN event - 06 September 2016

Implementing Archivematica, research data network

Embed Size (px)

Citation preview

Page 1: Implementing Archivematica, research data network

Implementing Archivematica for research data preservation at York and Hull

Jenny Mitcham (Digital Archivist) - University of York

Jisc RDN event - 06 September 2016

Page 2: Implementing Archivematica, research data network

What I’m going to cover

This is a presentation in 4 parts:

1. Background to our project2. Implementing Archivematica3. The challenges of preserving research data4. Future plans

Page 3: Implementing Archivematica, research data network

Part one: The Filling the Digital Preservation Gap project

Page 4: Implementing Archivematica, research data network

Filling the digital preservation gap:Project aim

“…to investigate Archivematica and explore how it might be used to provide digital preservation functionality within a wider infrastructure for Research Data Management.”

Page 5: Implementing Archivematica, research data network

Project structure• Phase 1 – explore: testing, research,

thinking -produce a report (3 months)• Phase 2 – develop: make

Archivematica better for RDM, plan implementation - report (4 months)

• Phase 3 – implement: set up proof of concepts at York and Hull and further investigation of file format problem (6 months)

Page 6: Implementing Archivematica, research data network

The teamUniversity of Hull:• Chris Awre – Head of Information Services,

Library and Learning Innovation• Richard Green – Independent Consultant• Simon Wilson – University Archivist

University of York:• Julie Allinson – Manager, Digital York• Jen Mitcham – Digital Archivist

Artefactual Systems

Funded by Jisc (Research Data Spring)

Page 7: Implementing Archivematica, research data network

Part two: Implementing Archivematica

Page 8: Implementing Archivematica, research data network

What are we trying to achieve?Demonstrate that it is possible to:• pull metadata from PURE / pull content from Box• capture further data to help us manage the dataset• automatically initiate ingest by Archivematica• set up Archivematica to package the data up for longer term preservation (automatically)• provide a dissemination copy of the data for our Hydra repository

...basically what we said in our implementation plans

Page 9: Implementing Archivematica, research data network

In addition…• Keep an eye on the broader picture– How can preservation processes for research data be used for other materials e.g., archives

• Consider different use cases for research data organisation on deposit– Single file, multiple files, hierarchical files, etc.– With or without associated metadata

• Share experiences across two institutions with different environments

Page 10: Implementing Archivematica, research data network

How did we approach it?We wanted to work in a way that:• was useful to others• was open and accessible• had the bigger picture in mindSo we are:• sharing code on github• working in google docs• engaging Hydra and Archivematica communities• blogging and talking at events like this

Page 11: Implementing Archivematica, research data network

What does it look like? York

Page 12: Implementing Archivematica, research data network
Page 13: Implementing Archivematica, research data network
Page 14: Implementing Archivematica, research data network

What does it look like? Hull

Page 15: Implementing Archivematica, research data network

What were the challenges?• mostly time!– recruiting suitably skilled developer at short notice– relying on Artefactual Systems who have their own list of priorities and timescales– working with local IT department and different priorities

• outstanding tasks from phase 2 which needed further development• integration/APIs (eg with PURE and Box)

Page 16: Implementing Archivematica, research data network

What worked well?• Re-using existing code (rather than re-inventing the wheel)– The puree gem from Lancaster University: this is a way of pulling metadata out of PURE and it saved us a huge amount of work– Automation tools from Artefactual Systems: a lightweight method of automating transfers within Archivematica. We funded a webinar about this in phase 2 of our project.

• Flexibility and capacity in house to do the work

Page 17: Implementing Archivematica, research data network

Part three: The challenges of preserving files that we can’t identify

Page 18: Implementing Archivematica, research data network

A quick look at file formats

Research data file formats are:• Numerous• Sometimes a bit obscure• Sometimes very big• Ever-changing• Often very newThis means they can be hard to preserve... The first hurdle is that we can’t identify them. If we can’t identify them how can we carry out preservation activities?

Page 19: Implementing Archivematica, research data network

Research data applications in use at York

Page 20: Implementing Archivematica, research data network

The NDSA Levels of Digital Preservation:

Level 2 requires you to know what you’ve got ...and levels 3 and 4 build on this

Page 21: Implementing Archivematica, research data network

Can we identify our research data?

We ran Droid* over the research data deposited with us over the past year. Out of 3752 individual files:• only 37% (1382) of the files were identified (with varying degrees of accuracy)• there were 34 different identified file formats in the sample

* Droid is a free tool from The National Archives that can be used to automatically identify file formats

Page 22: Implementing Archivematica, research data network

Identified research data filesFiles identified by Droid (listed by file type)...note that native files of the software in the previous graph of research data applications are not represented

Page 23: Implementing Archivematica, research data network

Unidentified research data files• Files not identified by Droid (listed by file ext)• 107 different file extensions not identified– huge number with no extension (help!)– how do we solve the .dat file problem?

Page 24: Implementing Archivematica, research data network

What is the project doing to solve the file identification problem?

• We have sponsored the development of 8 new file format signature records in PRONOM for different types of research data• We have created our own research data file signatures for inclusion in PRONOM (and blogged about it to encourage others to do the same)• We have been talking to TNA about how to engage the community more

Page 25: Implementing Archivematica, research data network

Part four: Future plans

Page 26: Implementing Archivematica, research data network

Future plans• We have a week left to finish our active project work (eeeek!)• ...and look out for our phase 3 report in mid October (and other dissemination outputs)• We need to work out how to move from ‘proof of concept’ to production– York will be establishing how to move seamlessly from this project into the Jisc Shared Service–Hull will be using the work to inform a City of Culture digital archive

Page 27: Implementing Archivematica, research data network

Where to find out more

Page 28: Implementing Archivematica, research data network

Do talk to me if you are interested in finding out more about this project

Useful links:Project website: http://www.york.ac.uk/borthwick/archivematicaDigital archiving blog: http://digital-archiving.blogspot.co.uk/Archivematica: https://www.archivematica.org/en/Phase 1 report http://dx.doi.org/10.6084/m9.figshare.1481170Phase 2 report https://dx.doi.org/10.6084/m9.figshare.2073220