Overview of DataFlow

Embed Size (px)

Citation preview

  • 7/31/2019 Overview of DataFlow

    1/29

    Image BioInformatics Research GroupDepartment of Zoology

    University of Oxford, UK

    http:/ibrg.zoo.ox.ac.uk

    DataFlow VIDaaS Launch Event

    Sad Business School, Oxford University2 March 2012

    The JISC UMF DataFlow Project

    Introduction to DataStage

    David Shotton, 2012 Published under the Creative Commons Attribution-Nonc ommercial-Share Alike 3.0 Licence

    e-mail: [email protected]

    David Shotton (PI, JISC UMF DataFlow Project)

    http://www.dataflow.ox.ac.uk

  • 7/31/2019 Overview of DataFlow

    2/29

    And the winning platform is . . .

    At Queen Mary College, there is a JISC MRD Project entitled

    Sustainable Management of Digital Music Research Data

    http://rdm.c4dm.eecs.qmul.ac.uk/

    After carefully reviewing several data management systems last December,

    including Fedora Commons, DataVerse and DSpace, they concluded:

    On paper, DataFlow is a winner: it meets (almost) all our requirements,especially because of DataStage, something other platforms don't offer.

    DataStage would be particularly appreciated, because it would make the

    integration of the system in the research workflow much less disruptive.

    Sadly, the availability of DataFlow software will come too late to be useful

    for our short project (October 2011March 2012).

    Well, now the DataFlow software systems, DataStage and DataBank,

    areavailable, and we hope they will meet the needs of many of you here

  • 7/31/2019 Overview of DataFlow

    3/29

    Why dont researchers publish data?

    Three pressures presently prevent researchers from publishing their data

    Information overload and pressure of work

    With twenty new papers each week, a researcher can never catch upthere is just too much new scientific information being produced now

    Have to run to stand still - no time for fringe activities like data curation

    Departmental pressure for financial viability, determined by the REF

    pressure to win grants and to publish in high impact journals

    negligible incentives and academic reward in terms of peer esteem,tenure or promotion for data publication activities

    Cognitive overhead and skill barriers to best-practice data management

    metadata concepts are foreign to most biomedical researchers

    large amount of effort involved in preparing data for publication

    [From evidence submitted 5 August 2011 to the Royal Societys Science as a Public

    Enterprisepolicy study]

  • 7/31/2019 Overview of DataFlow

    4/29

    Easing the pain of data archiving and publication

  • 7/31/2019 Overview of DataFlow

    5/29

    - the principle of sheer curation

    (http://en.wikipedia.org/wiki/Sheer_curation)

    Create a data management infrastructure that:

    works with you rather than against you

    accommodates the data management tools with

    which you are already familiar (e.g. spreadsheets)

    provides services that are of immediate benefit in

    your day-to-day activities (e.g. shared file access)

    makes data management, data publication and data

    archiving activities sufficiently lightweight, intuitive

    and transparent that they are easily achieved,

    without imposing a significant cognitive overhead

    By achieving this, we can bridge the gap between

    laboratory and repository

    Making data management as simple as possible

  • 7/31/2019 Overview of DataFlow

    6/29

    Managing data using a two-tier infrastructure

    Researchers can save files to a secure private DataStage file store

    This is purely for their own benefit

    Just a file store - does not pose a cognitive overheadsheer curation

    Requires no software installation on the researchers computers

    Designed for deployment at the research group level, locally or on a cloud

    Primary access is as a mapped network drive, Drive D:, on each computer

    You save files to DataStage just as you would to your local hard drive

    No restrictions or limitations of file type whatever you normally use

    Web access allows users to browse files within DataStage

    Advantages over a cheap hard drive from PC World under your desk:

    Regular nightly automated backup no need to remember to do so

    Private, shared and collaborative areas, with controlled group access

    Additional Web interface to DataStage, using the same user credentials

    Can invite overseas colleagues to access your files, via password control

    Tier One: DataStage

  • 7/31/2019 Overview of DataFlow

    7/29

    Managing data using a two-tier infrastructure

    The special Web submission interface permits researchers to select and

    package data files for publication and long-term repository archiving

    Easy to do

    When the researcher is ready

    Minimal metadata requirement, to encourage usage

    The selected files are put in a special directory, with optional sub-directories

    The files are accompanied by a simple metadata stored as an RDF manifest

    It is possible to represent data files stored elsewhere using URIs

    useful for large data files that already have stable storage locations

    Packaging uses the BagIt file packaging specification from the California Digital

    Library (https://wiki.ucop.edu/display/Curation/BagIt)

    The resulting files are then zipped into a single object for transmission to

    DataBank, the institutional data repository

    Spanning the tiers: DataStage to DataBank

  • 7/31/2019 Overview of DataFlow

    8/29

    Managing data using a two-tier infrastructure

    DataBank is a scalable data repository designed for institutional deployment

    Developed by the Bodleian Library, with a track record in preservation

    Cloud-deployable

    Easy for researcher to update a revised dataset if required

    Data packages normally published under a CCZero Open Data Waiver

    Confidential data packages can be kept in a separate dark repository

    Data packages assigned DOIs, making them citable (for academic credit)

    Optional user-defined embargo period to permit journal article publication

    Upon receipt of a DataStage data package, DataBank

    unzips the data package to give access to the files,

    mints a DOI for the data package, and registers it with DataCite

    display the RDF manifest metadata, and enriches it (e.g. with the DOI)

    indexes the metadata, and provides a search and browse interface

    DataBank is, in actuality, just an interface layer over a generic object store, as

    Neil will explain later this morning

    Tier Two: DataBank

  • 7/31/2019 Overview of DataFlow

    9/29

    DataFlow software services - summary

    DataStage file system

    Researchers

    DataBank repository

    Researchers, other users

    Zipped BagIt Data Packagewith RDF metadata manifest

  • 7/31/2019 Overview of DataFlow

    10/29

    The DataStage / DataBank Beta Launch

    The DataFlow Project has involved

    taking our initial working DataStage and DataBank prototypes

    undertaking a complete code review, rewriting where necessary

    improving the user interfaces

    preparing the software for deployment in two forms

    as a Virtual Machine to run in a VMWare environment as a Debian Package to install on the Ubuntu operating system

    writing documentation to describe the installation and functionality

    Beta releases v0.1 of these DataStage and DataBank services are now available

    can be run locally or on a cloud

    installation easy and customizable (e.g. your name & logo)

    enable research groups and institutions to provide their members with

    zero-cost data management solutions (apart from hosting costs)

    cloud provision can expand and shrink with requirements

    no need to build and staff your own local data centre

  • 7/31/2019 Overview of DataFlow

    11/29

    Acknowledgements . . . thanks to the JISC UMF for funding

    and acknowledgement of the excellent work of my DataFlow colleagues:

    Bhavana Ananda, Katherine Fletcher, Graham Klyne (IBRG)

    Ian Chard, Neil Jefferies, Anusha Ranganathan (Bodleian Library)

    Alex Dutton, Joseph Talbot (OU Computing Service)

    Gabriel Hanganu, Sander van der Waal (OSS Watch)

    Ross Gardler (Open Directive LLP)

    Neil Caithness, Matteo Turilli, David Wallom (Oxford e-Research Centre)

    Richard Jones, Ben OSteen (Cottage Labs)

    Stephanie Taylor (Critical Eye Communications)

    Matthew Barker, Tom Ellis, Alex Hartwig (Cannonical Ltd)

  • 7/31/2019 Overview of DataFlow

    12/29

    . . . time for a user endorsement

    Graham Klyne, architect of the original DataStage prototype

    Bhavana Ananda, current DataStage developer

    . . . and a DataStage demo

    Chris Holland, Department of Zoology

  • 7/31/2019 Overview of DataFlow

    13/29

    New for Beta Release v0.2, early April 2012

    Integration ofSWORD v2 repository submission protocol

    DataStage data packages can be submitted to any SWORD-compliant

    repository (e.g. the Dryad Data Repository, www.datadryad.org)

    DataBank will be able to ingest data packages from any SWORD client

    DataBank, as well as DataStage, will by then have Debian packaging for ease

    of deployment onto Ubuntu Linux hosts Re-inclusion ofWebDAV, to permit users to read and write via Web access

    Deployment will be tested on a wider range of cloud hosting environments

    for both VMWare virtual machine and Debian package installation

    including the Eduserv academic cloud

    User interface improvement and additional functionality on the basis of existing

    plans and user feedback

    Leading to a fully-featured release (Version 1.0) in May 2012

  • 7/31/2019 Overview of DataFlow

    14/29

    DataStage file system

    Researchers

    DataFlow services summary adding SWORD

    DataBank repository

    Researchers, other users

    SWORD deposit protocol

    Zipped BagIt Data Packagewith RDF metadata manifest

  • 7/31/2019 Overview of DataFlow

    15/29

    The conventional research data lifecycle

    Scholarly publications:

    conference papers and

    journal articles

    Raw data in research note-

    books and live PC files

    Research results

    and conclusions

    Data selection

    and interpretation

    Publication

    activities

    Research datasets abandoned

    on local hard drives or CD-ROMs

    Hypothesis formulation

    and project design

    Experimentation

    and data creation

    Research plan

    Institutionalrepositories

  • 7/31/2019 Overview of DataFlow

    16/29

    The DataFlow-enhanced research data lifecycle

    Scholarly publications:

    conference papers and

    journal articles

    Raw data in research note-

    books and live PC files

    Research results

    and conclusions

    Hypothesis formulation

    and project design

    Experimentation

    and data creation

    Data selection

    and interpretation

    Publication

    activities

    Research plan

    DataBankrepository

    Archived

    datasets

    DataStage

    filestore

    Private yet

    sharable

    Open data on Web

    Management

    Dissemination

    Preservation

  • 7/31/2019 Overview of DataFlow

    17/29

    So what have we got in DataStage?

    Just a file store, appearing as a mapped drive easy to use

    Customizable access controls to suit different types of groups

    Does not require software installation on users computer

    Uses standard software components found on every client machine

    Cross-platform Windows, Mac or Linux

    DataStage server hosted on Ubuntu Linux system

    Deployable locally, or on a cloud

    FREE, apart from hosting costs

    Has Web access, permitting Web apps to be built on top For example, for data packaging and SWORD repository submission

    Other Web apps possible . . .

    Can be used for other things than just storing datasets

  • 7/31/2019 Overview of DataFlow

    18/29

  • 7/31/2019 Overview of DataFlow

    19/29

    Wider applications of DataStage

    Escaping the Ivory Tower

    Applications in commerce

    Applications in education

  • 7/31/2019 Overview of DataFlow

    20/29

    Adding a security app

    Time-stamp each data file using irrevocable method

    Encrypt each data file using, for example, the OpenPGP standard

    Create a data package of time-stamped encrypted files

    Compute the UNF (Universal Numeric Fingerprint) for date package, so one

    can later ensure that it has not been altered

    Applications:

    Experimental data security forpatent application e.g. pharmaceuticals

    Secure storage offinancial data many commercial companies

    DataStage kernel

    Data Packaging DataBank or other

    SWORD repository

    Data Packaging

    Data Security

    DataStage kernelSWORD deposit protocol

    Security wrapper

  • 7/31/2019 Overview of DataFlow

    21/29

    Raspberry Pi computer

    Designed by David Braben ofthe Raspberry Pi Foundation in CambridgeFirst released on 29 February 2012

    Size of a credit card, and cost ~25 for a configured system

    Intended to stimulate the teaching of basic computer science in schools

  • 7/31/2019 Overview of DataFlow

    22/29

    Raspberry Pi computer schematic

    Ethernet port, two USB ports, HDMI monitor socket

    700 MHz ARM processor running Linux

    Programmable in Python, C, BBC Basic

    256 Mb RAM (eight times capacity of BBC Micro B)

    Storage on SD card (16 Gb card costs about 10)

    Samba file sharing permits connection to external drives

  • 7/31/2019 Overview of DataFlow

    23/29

    Pi Store (aka DataStage) for classroom data integration

    Pi Store

    One Pi Store for each class

    A cloud-based data integration solution

    Each pupil has a private directory to store stuff

    Accessible from school or from home

    The teacher has access to all pupils folders,

    for example to permit marking homework

  • 7/31/2019 Overview of DataFlow

    24/29

    DataStage folders

    Typically a researcher will use his private folder for daily work

    The research group leader can read files in that folder

    Files placed in the Shared folder can also be read by other group members,

    and those place in the Collaborative folder can be written and read by all

  • 7/31/2019 Overview of DataFlow

    25/29

    DataStage metadata are limited

    Intentionally, DataStage metadata are limited to author, title, identifier, date and

    description

    This is to encourage researchers to submit datasets to their repository, bearing

    in mind Grahams concept of curation by addition

    Additional rich metadata can be included in a separate metadata file as part of

    the entire data package, in XML or RDF format

    DataBank can recognize such a file and index the metadata, extracting

    elements for inclusion in the RDF manifest

    Separately from the DataFlow Project, we have been developing a minimal

    metadata information model for describing a research investigation and the

    various research outputs (papers, datasets, protocols, workflows, etc.) that mayresult from the investigation

    Tanya Gray has encoded this as an XML model, and can dynamically create

    from that model a Web form in which to enter such metadata

    Such rich metadata can form part of a DataStage data package

  • 7/31/2019 Overview of DataFlow

    26/29

    MIIDI data model - Minimal

    information for an Infectious

    Disease Investigation

  • 7/31/2019 Overview of DataFlow

    27/29

    The MIIDI input form for Research Investigation information

  • 7/31/2019 Overview of DataFlow

    28/29

    The MIIDI input form for Journal Article information

  • 7/31/2019 Overview of DataFlow

    29/29

    MIIRO data model - Minimal

    information for Investigations

    and Research Outputs