Image BioInformatics Research Group
Department of Zoology
University of Oxford, UK
http:/ibrg.zoo.ox.ac.uk
DataFlow VIDaaS Launch Event
Saïd Business School, Oxford University 2 March 2012
The JISC UMF DataFlow Project
Introduction to DataStage
© David Shotton, 2012 Published under the Creative Commons Attribution-Noncommercial-Share Alike 3.0 Licence
e-mail: [email protected]
David Shotton (PI, JISC UMF DataFlow Project)
http://www.dataflow.ox.ac.uk
And the winning platform is . . .
At Queen Mary College, there is a JISC MRD Project entitled
Sustainable Management of Digital Music Research Data
http://rdm.c4dm.eecs.qmul.ac.uk/
After carefully reviewing several data management systems last December,
including Fedora Commons, DataVerse and DSpace, they concluded:
“On paper, DataFlow is a winner: it meets (almost) all our requirements,
especially because of DataStage, something other platforms don't offer.
“DataStage would be particularly appreciated, because it would make the
integration of the system in the research workflow much less disruptive.
“Sadly, the availability of DataFlow software will come too late to be useful
for our short project (October 2011 – March 2012).”
Well, now the DataFlow software systems, DataStage and DataBank,
are available, and we hope they will meet the needs of many of you here
Why don’t researchers publish data?
Three pressures presently prevent researchers from publishing their data
Information overload and pressure of work
With twenty new papers each week, a researcher can never catch up – there is just too much new scientific information being produced now
Have to run to stand still - no time for ‘fringe’ activities like data curation
Departmental pressure for financial viability, determined by the REF
pressure to win grants and to publish in high impact journals
negligible incentives and academic reward in terms of peer esteem, tenure or promotion for data publication activities
Cognitive overhead and skill barriers to best-practice data management
metadata concepts are foreign to most biomedical researchers
large amount of effort involved in preparing data for publication
[From evidence submitted 5 August 2011 to the Royal Society’s Science as a Public Enterprise policy study]
Easing the pain of data archiving and publication
- the principle of ‘sheer curation’ (http://en.wikipedia.org/wiki/Sheer_curation)
Create a data management infrastructure that:
works with you rather than against you
accommodates the data management tools with
which you are already familiar (e.g. spreadsheets)
provides services that are of immediate benefit in
your day-to-day activities (e.g. shared file access)
makes data management, data publication and data
archiving activities sufficiently lightweight, intuitive
and ‘transparent’ that they are easily achieved,
without imposing a significant cognitive overhead
By achieving this, we can bridge the gap between
laboratory and repository
Making data management as simple as possible
Managing data using a two-tier infrastructure
Researchers can save files to a secure private DataStage file store
This is purely for their own benefit
‘Just a file store’ - does not pose a cognitive overhead – “sheer curation”
Requires no software installation on the researchers’ computers
Designed for deployment at the research group level, locally or on a cloud
Primary access is as a mapped network drive, “Drive D:”, on each computer
You save files to DataStage just as you would to your local hard drive
No restrictions or limitations of file type – whatever you normally use
Web access allows users to browse files within DataStage
Advantages over a cheap hard drive from PC World under your desk:
Regular nightly automated backup – no need to remember to do so
Private, shared and collaborative areas, with controlled group access
Additional Web interface to DataStage, using the same user credentials
Can invite overseas colleagues to access your files, via password control
Tier One: DataStage
Managing data using a two-tier infrastructure
The special Web submission interface permits researchers to select and
package data files for publication and long-term repository archiving
Easy to do
When the researcher is ready
Minimal metadata requirement, to encourage usage
The selected files are put in a special directory, with optional sub-directories
The files are accompanied by a simple metadata stored as an RDF manifest
It is possible to represent data files stored elsewhere using URIs
useful for large data files that already have stable storage locations
Packaging uses the BagIt file packaging specification from the California Digital
Library (https://wiki.ucop.edu/display/Curation/BagIt)
The resulting files are then zipped into a single object for transmission to
DataBank, the institutional data repository
Spanning the tiers: DataStage to DataBank
Managing data using a two-tier infrastructure
DataBank is a scalable data repository designed for institutional deployment
Developed by the Bodleian Library, with a track record in preservation
Cloud-deployable
Easy for researcher to update a revised dataset if required
Data packages normally published under a ‘CCZero’ Open Data Waiver
Confidential data packages can be kept in a separate ‘dark’ repository
Data packages assigned DOIs, making them citable (for academic credit)
Optional user-defined embargo period to permit journal article publication
Upon receipt of a DataStage data package, DataBank
unzips the data package to give access to the files,
mints a DOI for the data package, and registers it with DataCite
display the RDF manifest metadata, and enriches it (e.g. with the DOI)
indexes the metadata, and provides a search and browse interface
DataBank is, in actuality, just an interface layer over a generic object store, as
Neil will explain later this morning
Tier Two: DataBank
DataFlow software services - summary
DataStage file system
Researchers
DataBank repository
Researchers, other users
Zipped BagIt Data Package with RDF metadata manifest
The DataStage / DataBank Beta Launch
The DataFlow Project has involved
taking our initial working DataStage and DataBank prototypes
undertaking a complete code review, rewriting where necessary
improving the user interfaces
preparing the software for deployment in two forms
as a Virtual Machine to run in a VMWare environment
as a Debian Package to install on the Ubuntu operating system
writing documentation to describe the installation and functionality
Beta releases v0.1 of these DataStage and DataBank services are now available
can be run locally or on a cloud
installation easy and customizable (e.g. your name & logo)
enable research groups and institutions to provide their members with
zero-cost data management solutions (apart from hosting costs)
cloud provision can expand and shrink with requirements
no need to build and staff your own local data centre
Acknowledgements . . . thanks to the JISC UMF for funding
and acknowledgement of the excellent work of my DataFlow colleagues:
Bhavana Ananda, Katherine Fletcher, Graham Klyne (IBRG)
Ian Chard, Neil Jefferies, Anusha Ranganathan (Bodleian Library)
Alex Dutton, Joseph Talbot (OU Computing Service)
Gabriel Hanganu, Sander van der Waal (OSS Watch)
Ross Gardler (Open Directive LLP)
Neil Caithness, Matteo Turilli, David Wallom (Oxford e-Research Centre)
Richard Jones, Ben O’Steen (Cottage Labs)
Stephanie Taylor (Critical Eye Communications)
Matthew Barker, Tom Ellis, Alex Hartwig (Cannonical Ltd)
. . . time for a user endorsement
Graham Klyne, architect of the original DataStage prototype
Bhavana Ananda, current DataStage developer
. . . and a DataStage demo
Chris Holland, Department of Zoology
New for Beta Release v0.2, early April 2012
Integration of SWORD v2 repository submission protocol
DataStage data packages can be submitted to any SWORD-compliant
repository (e.g. the Dryad Data Repository, www.datadryad.org)
DataBank will be able to ingest data packages from any SWORD client
DataBank, as well as DataStage, will by then have Debian packaging for ease
of deployment onto Ubuntu Linux hosts
Re-inclusion of WebDAV, to permit users to read and write via Web access
Deployment will be tested on a wider range of cloud hosting environments
for both VMWare virtual machine and Debian package installation
including the Eduserv academic cloud
User interface improvement and additional functionality on the basis of existing
plans and user feedback
Leading to a fully-featured release (Version 1.0) in May 2012
DataStage file system
Researchers
DataFlow services summary – adding SWORD
DataBank repository
Researchers, other users
SWORD deposit protocol
Zipped BagIt Data Package with RDF metadata manifest
The conventional research data lifecycle
Scholarly publications:
conference papers and
journal articles
Raw data in research note-
books and live PC files
Research results
and conclusions
Data selection
and interpretation
Publication
activities
Research datasets abandoned
on local hard drives or CD-ROMs
Hypothesis formulation
and project design
Experimentation
and data creation
Research plan
Institutional
repositories
The DataFlow-enhanced research data lifecycle
Scholarly publications:
conference papers and
journal articles
Raw data in research note-
books and live PC files
Research results
and conclusions
Hypothesis formulation
and project design
Experimentation
and data creation
Data selection
and interpretation
Publication
activities
Research plan
DataBank
repository
Archived
datasets
DataStage
filestore
Private yet
sharable
Open data on Web
Management
Dissemination
Preservation
So what have we got in DataStage?
‘Just a file store’, appearing as a mapped drive – easy to use
Customizable access controls to suit different types of groups
Does not require software installation on user’s computer
Uses standard software components found on every client machine
Cross-platform – Windows, Mac or Linux
DataStage server hosted on Ubuntu Linux system
Deployable locally, or on a cloud
FREE, apart from hosting costs
Has Web access, permitting Web apps to be built on top
For example, for data packaging and SWORD repository submission
Other Web apps possible . . .
Can be used for other things than just storing datasets
Wider applications of DataStage
Escaping the Ivory Tower
Applications in commerce
Applications in education
Adding a security app
Time-stamp each data file using irrevocable method
Encrypt each data file using, for example, the OpenPGP standard
Create a data package of time-stamped encrypted files
Compute the UNF (Universal Numeric Fingerprint) for date package, so one
can later ensure that it has not been altered
Applications:
Experimental data security for patent application – e.g. pharmaceuticals
Secure storage of financial data – many commercial companies
DataStage kernel
Data Packaging DataBank or other
SWORD repository
Data Packaging
Data Security
DataStage kernel SWORD deposit protocol
Security wrapper
Raspberry Pi computer
Designed by David Braben of the Raspberry Pi Foundation in Cambridge
First released on 29 February 2012
Size of a credit card, and cost ~£25 for a configured system
Intended to stimulate the teaching of basic computer science in schools
Raspberry Pi computer – schematic
Ethernet port, two USB ports, HDMI monitor socket
700 MHz ARM processor running Linux
Programmable in Python, C, BBC Basic
256 Mb RAM (eight times capacity of BBC Micro B)
Storage on SD card (16 Gb card costs about £10)
Samba file sharing permits connection to external drives
Pi Store (aka DataStage) for classroom data integration
Pi Store
One Pi Store for each class
A cloud-based data integration solution
Each pupil has a private directory to store stuff
Accessible from school or from home
The teacher has access to all pupils’ folders,
for example to permit marking homework
DataStage folders
Typically a researcher will use his private folder for daily work
The research group leader can read files in that folder
Files placed in the Shared folder can also be read by other group members,
and those place in the Collaborative folder can be written and read by all
DataStage metadata are limited
Intentionally, DataStage metadata are limited to author, title, identifier, date and
description
This is to encourage researchers to submit datasets to their repository, bearing
in mind Graham’s concept of ‘curation by addition’
Additional rich metadata can be included in a separate metadata file as part of
the entire data package, in XML or RDF format
DataBank can recognize such a file and index the metadata, extracting
elements for inclusion in the RDF manifest
Separately from the DataFlow Project, we have been developing a minimal
metadata information model for describing a research investigation and the
various research outputs (papers, datasets, protocols, workflows, etc.) that may
result from the investigation
Tanya Gray has encoded this as an XML model, and can dynamically create
from that model a Web form in which to enter such metadata
Such rich metadata can form part of a DataStage data package
MIIDI data model - Minimal
information for an Infectious
Disease Investigation
The MIIDI input form for Research Investigation information
The MIIDI input form for Journal Article information
MIIRO data model - Minimal
information for Investigations
and Research Outputs