Download pdf - The JISC UMF DataFlow Project - vidaas.oucs.ox.ac.ukvidaas.oucs.ox.ac.uk/docs/David Shotton - overview of DataFlow.pdf · The JISC UMF DataFlow Project ... DataStage data packages

Image BioInformatics Research Group

Department of Zoology

University of Oxford, UK

http:/ibrg.zoo.ox.ac.uk

DataFlow VIDaaS Launch Event

Saïd Business School, Oxford University 2 March 2012

The JISC UMF DataFlow Project

Introduction to DataStage

© David Shotton, 2012 Published under the Creative Commons Attribution-Noncommercial-Share Alike 3.0 Licence

e-mail: [email protected]

David Shotton (PI, JISC UMF DataFlow Project)

http://www.dataflow.ox.ac.uk

And the winning platform is . . .

At Queen Mary College, there is a JISC MRD Project entitled

Sustainable Management of Digital Music Research Data

http://rdm.c4dm.eecs.qmul.ac.uk/

After carefully reviewing several data management systems last December,

including Fedora Commons, DataVerse and DSpace, they concluded:

“On paper, DataFlow is a winner: it meets (almost) all our requirements,

especially because of DataStage, something other platforms don't offer.

“DataStage would be particularly appreciated, because it would make the

integration of the system in the research workflow much less disruptive.

“Sadly, the availability of DataFlow software will come too late to be useful

for our short project (October 2011 – March 2012).”

Well, now the DataFlow software systems, DataStage and DataBank,

are available, and we hope they will meet the needs of many of you here

Why don’t researchers publish data?

Three pressures presently prevent researchers from publishing their data

Information overload and pressure of work

With twenty new papers each week, a researcher can never catch up – there is just too much new scientific information being produced now

Have to run to stand still - no time for ‘fringe’ activities like data curation

Departmental pressure for financial viability, determined by the REF

pressure to win grants and to publish in high impact journals

negligible incentives and academic reward in terms of peer esteem, tenure or promotion for data publication activities

Cognitive overhead and skill barriers to best-practice data management

metadata concepts are foreign to most biomedical researchers

large amount of effort involved in preparing data for publication

[From evidence submitted 5 August 2011 to the Royal Society’s Science as a Public Enterprise policy study]

Easing the pain of data archiving and publication

- the principle of ‘sheer curation’ (http://en.wikipedia.org/wiki/Sheer_curation)

Create a data management infrastructure that:

works with you rather than against you

accommodates the data management tools with

which you are already familiar (e.g. spreadsheets)

provides services that are of immediate benefit in

your day-to-day activities (e.g. shared file access)

makes data management, data publication and data

archiving activities sufficiently lightweight, intuitive

and ‘transparent’ that they are easily achieved,

without imposing a significant cognitive overhead

By achieving this, we can bridge the gap between

laboratory and repository

Making data management as simple as possible

Managing data using a two-tier infrastructure

Researchers can save files to a secure private DataStage file store

This is purely for their own benefit

‘Just a file store’ - does not pose a cognitive overhead – “sheer curation”

Requires no software installation on the researchers’ computers

Designed for deployment at the research group level, locally or on a cloud

Primary access is as a mapped network drive, “Drive D:”, on each computer

You save files to DataStage just as you would to your local hard drive

No restrictions or limitations of file type – whatever you normally use

Web access allows users to browse files within DataStage

Advantages over a cheap hard drive from PC World under your desk:

Regular nightly automated backup – no need to remember to do so

Private, shared and collaborative areas, with controlled group access

Additional Web interface to DataStage, using the same user credentials

Can invite overseas colleagues to access your files, via password control

Tier One: DataStage


The special Web submission interface permits researchers to select and

package data files for publication and long-term repository archiving

Easy to do

When the researcher is ready

Minimal metadata requirement, to encourage usage

The selected files are put in a special directory, with optional sub-directories

The files are accompanied by a simple metadata stored as an RDF manifest

It is possible to represent data files stored elsewhere using URIs

useful for large data files that already have stable storage locations

Packaging uses the BagIt file packaging specification from the California Digital

Library (https://wiki.ucop.edu/display/Curation/BagIt)

The resulting files are then zipped into a single object for transmission to

DataBank, the institutional data repository

Spanning the tiers: DataStage to DataBank


DataBank is a scalable data repository designed for institutional deployment

Developed by the Bodleian Library, with a track record in preservation

Cloud-deployable

Easy for researcher to update a revised dataset if required

Data packages normally published under a ‘CCZero’ Open Data Waiver

Confidential data packages can be kept in a separate ‘dark’ repository

Data packages assigned DOIs, making them citable (for academic credit)

Optional user-defined embargo period to permit journal article publication

Upon receipt of a DataStage data package, DataBank

unzips the data package to give access to the files,

mints a DOI for the data package, and registers it with DataCite

display the RDF manifest metadata, and enriches it (e.g. with the DOI)

indexes the metadata, and provides a search and browse interface

DataBank is, in actuality, just an interface layer over a generic object store, as

Neil will explain later this morning

Tier Two: DataBank

DataFlow software services - summary

DataStage file system

Researchers

DataBank repository

Researchers, other users

Zipped BagIt Data Package with RDF metadata manifest

The DataStage / DataBank Beta Launch

The DataFlow Project has involved

taking our initial working DataStage and DataBank prototypes

undertaking a complete code review, rewriting where necessary

improving the user interfaces

preparing the software for deployment in two forms

as a Virtual Machine to run in a VMWare environment

as a Debian Package to install on the Ubuntu operating system

writing documentation to describe the installation and functionality

Beta releases v0.1 of these DataStage and DataBank services are now available

can be run locally or on a cloud

installation easy and customizable (e.g. your name & logo)

enable research groups and institutions to provide their members with

zero-cost data management solutions (apart from hosting costs)

cloud provision can expand and shrink with requirements

no need to build and staff your own local data centre

Acknowledgements . . . thanks to the JISC UMF for funding

and acknowledgement of the excellent work of my DataFlow colleagues:

Bhavana Ananda, Katherine Fletcher, Graham Klyne (IBRG)

Ian Chard, Neil Jefferies, Anusha Ranganathan (Bodleian Library)

Alex Dutton, Joseph Talbot (OU Computing Service)

Gabriel Hanganu, Sander van der Waal (OSS Watch)

Ross Gardler (Open Directive LLP)

Neil Caithness, Matteo Turilli, David Wallom (Oxford e-Research Centre)

Richard Jones, Ben O’Steen (Cottage Labs)

Stephanie Taylor (Critical Eye Communications)

Matthew Barker, Tom Ellis, Alex Hartwig (Cannonical Ltd)

. . . time for a user endorsement

Graham Klyne, architect of the original DataStage prototype

Bhavana Ananda, current DataStage developer

. . . and a DataStage demo

Chris Holland, Department of Zoology

New for Beta Release v0.2, early April 2012

Integration of SWORD v2 repository submission protocol

DataStage data packages can be submitted to any SWORD-compliant

repository (e.g. the Dryad Data Repository, www.datadryad.org)

DataBank will be able to ingest data packages from any SWORD client

DataBank, as well as DataStage, will by then have Debian packaging for ease

of deployment onto Ubuntu Linux hosts

Re-inclusion of WebDAV, to permit users to read and write via Web access

Deployment will be tested on a wider range of cloud hosting environments

for both VMWare virtual machine and Debian package installation

including the Eduserv academic cloud

User interface improvement and additional functionality on the basis of existing

plans and user feedback

Leading to a fully-featured release (Version 1.0) in May 2012

DataStage file system

Researchers

DataFlow services summary – adding SWORD

DataBank repository

Researchers, other users

SWORD deposit protocol

Zipped BagIt Data Package with RDF metadata manifest

The conventional research data lifecycle

Scholarly publications:

conference papers and

journal articles

Raw data in research note-

books and live PC files

Research results

and conclusions

Data selection

and interpretation

Publication

activities

Research datasets abandoned

on local hard drives or CD-ROMs

Hypothesis formulation

and project design

Experimentation

and data creation

Research plan

Institutional

repositories

The DataFlow-enhanced research data lifecycle

Scholarly publications:

conference papers and

journal articles

Raw data in research note-

books and live PC files

Research results

and conclusions

Hypothesis formulation

and project design

Experimentation

and data creation

Data selection

and interpretation

Publication

activities

Research plan

DataBank

repository

Archived

datasets

DataStage

filestore

Private yet

sharable

Open data on Web

Management

Dissemination

Preservation

So what have we got in DataStage?

‘Just a file store’, appearing as a mapped drive – easy to use

Customizable access controls to suit different types of groups

Does not require software installation on user’s computer

Uses standard software components found on every client machine

Cross-platform – Windows, Mac or Linux

DataStage server hosted on Ubuntu Linux system

Deployable locally, or on a cloud

FREE, apart from hosting costs

Has Web access, permitting Web apps to be built on top

For example, for data packaging and SWORD repository submission

Other Web apps possible . . .

Can be used for other things than just storing datasets

Wider applications of DataStage

Escaping the Ivory Tower

Applications in commerce

Applications in education

Adding a security app

Time-stamp each data file using irrevocable method

Encrypt each data file using, for example, the OpenPGP standard

Create a data package of time-stamped encrypted files

Compute the UNF (Universal Numeric Fingerprint) for date package, so one

can later ensure that it has not been altered

Applications:

Experimental data security for patent application – e.g. pharmaceuticals

Secure storage of financial data – many commercial companies

DataStage kernel

Data Packaging DataBank or other

SWORD repository

Data Packaging

Data Security

DataStage kernel SWORD deposit protocol

Security wrapper

Raspberry Pi computer

Designed by David Braben of the Raspberry Pi Foundation in Cambridge

First released on 29 February 2012

Size of a credit card, and cost ~£25 for a configured system

Intended to stimulate the teaching of basic computer science in schools

Raspberry Pi computer – schematic

Ethernet port, two USB ports, HDMI monitor socket

700 MHz ARM processor running Linux

Programmable in Python, C, BBC Basic

256 Mb RAM (eight times capacity of BBC Micro B)

Storage on SD card (16 Gb card costs about £10)

Samba file sharing permits connection to external drives

Pi Store (aka DataStage) for classroom data integration

Pi Store

One Pi Store for each class

A cloud-based data integration solution

Each pupil has a private directory to store stuff

Accessible from school or from home

The teacher has access to all pupils’ folders,

for example to permit marking homework

DataStage folders

Typically a researcher will use his private folder for daily work

The research group leader can read files in that folder

Files placed in the Shared folder can also be read by other group members,

and those place in the Collaborative folder can be written and read by all

DataStage metadata are limited

Intentionally, DataStage metadata are limited to author, title, identifier, date and

description

This is to encourage researchers to submit datasets to their repository, bearing

in mind Graham’s concept of ‘curation by addition’

Additional rich metadata can be included in a separate metadata file as part of

the entire data package, in XML or RDF format

DataBank can recognize such a file and index the metadata, extracting

elements for inclusion in the RDF manifest

Separately from the DataFlow Project, we have been developing a minimal

metadata information model for describing a research investigation and the

various research outputs (papers, datasets, protocols, workflows, etc.) that may

result from the investigation

Tanya Gray has encoded this as an XML model, and can dynamically create

from that model a Web form in which to enter such metadata

Such rich metadata can form part of a DataStage data package

MIIDI data model - Minimal

information for an Infectious

Disease Investigation

The MIIDI input form for Research Investigation information

The MIIDI input form for Journal Article information

MIIRO data model - Minimal

information for Investigations

and Research Outputs