Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
Data Sharing, Management and ReuseWhy CyVerse helps open science
Dr. Robert DaveyHead of Research e-Infrastructure
@froggleston
www.earlham.ac.uk
What is data stewardship?
“Processes, policies, guidelines and responsibilities for administering organisations' entire data in compliance with policy and/or regulatory obligations”
Wikipedia
Who• Custodians, curators, data scientists, etc
What• Data and metadata
How• Standards• Controls• Data Entry
www.earlham.ac.uk
What is data stewardship?
“Processes, policies, guidelines and responsibilities for administering organisations' entire data in compliance with policy and/or regulatory obligations”
www.earlham.ac.uk
• Data stewardship must have a well-defined purpose, or fitness– data recording policy– data access policy– computer interoperability policy
• Sounds simple enough
Data stewardship matters
We couldn’t do the science we do without access to data and analysis
www.earlham.ac.uk
The data-driven process
DATASTANDARDS
ASSEMBLY
VARIATION
FUNCTION
MODEL INTEGRATION
RESEARCHERS & COLLABORATORS
EXPERIMENTS & HYPOTHESES
INTERPRETATION
DATARESOURCES
www.earlham.ac.uk
Data is a mess
So what can we do about it?
• Put data where humans and computers can see it
• Describe data in a common way
• Be committed to reproducibility
• Connect up resources to build ecosystems
www.earlham.ac.uk
Managing your data
• Useful to think about project management
• What data are you going to collect?
• Where are you going to store it?
• Is anyone going to work on your data at the same time?
• How can you make sure your data hasn’t changed?
• Organise your project folders!
www.earlham.ac.uk
Managing your metadata
• https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1000424
• What metadata are you going to record?
• Where are you going to keep it?
• What format will it be in?
• Are you going to share this with anyone else?
• How can you make sure others can understand your work?
www.earlham.ac.uk
What is metadata?
• Why is metadata useful?
• Where might you keep metadata?
• What are the possible problems with metadata?
• Are these problems to do with people or computers?
• Where might you find information about data you want?
• How might you start getting that data?
www.earlham.ac.uk
Data is a mess
So what can we do about it?
• Describe data in a common way using human terms
• Supply this data in computer-friendly formats
• Put data where humans and computers can see it
• Be FAIR!
www.earlham.ac.uk
FAIR data
• Findable, Accessible, Interoperable, Reusable
https://www.nature.com/articles/sdata201618
www.earlham.ac.uk
FAIR data
• Findable, Accessible, Interoperable, Reusable
https://www.nature.com/articles/sdata201618
www.earlham.ac.uk
Research Data Lifecycles
www.earlham.ac.uk
Research Data Lifecycles
www.earlham.ac.uk
Research Data Lifecycles
www.earlham.ac.uk
Research Data Lifecycles
www.earlham.ac.uk
Managing your data
www.earlham.ac.uk
Managing your data
www.earlham.ac.uk
Managing your data
http://www.pnas.org/content/115/11/2584.short
www.earlham.ac.uk
Research Data Lifecycles
www.earlham.ac.uk
Managing your data
• Journals often require that data and code is available• Reviewers may reject papers with no FAIR outputs
– especially in informatics / data science
• Specific scientific data repositories– EMBL EBI / NCBI, custom databases
• General repos– Zenodo– Figshare– Data Dryad– Github / Bitbucket
www.earlham.ac.uk
Managing your data
Earlham Institute, Norwich Research Park, Norwich, Norfolk, NR4 7UG, UKwww.earlham.ac.uk
www.earlham.ac.uk
Put data where humans and computers can see it
• CyVerse is a huge and versatile research infrastructure• Helps thousands of users collaborate on data and analysis
www.earlham.ac.uk
• Extensive compute and storage resources
• Committed to data stewardship practices– Data Store - iRODS data grid– Data Commons - descriptions/identifiers for public/user-curated data– Discovery Environment - GUI for end users– Agave API - programmatic interfaces to all of the above
• FAIR data compliance
• Potential to power huge federated services
Put data where humans and computers can see it
www.earlham.ac.uk
• EI is the hardware and middleware site for CyVerse UK
• First international node outside the US• Opened for use by UK community in November 2016
• New collaborations in Australia and Austria
• http://cyverseuk.org/
www.earlham.ac.uk
• http://www.cyverse.org/learning-center/manage-data
• Data allocations– 100GB per “standard” user (1TB at request)– Need a login to start storing private / collaborative data
• Data transfer tools– DE - web-based, slower transfers of 2GB max– CyberDuck - Graphical interface (GUI) that runs on the desktop for fast transfers– iRODS icommands - specialised command line (CLI) tools for fast transfers– FUSE - slow transfers, appears as a drive or folder on your computer
Put data where humans and computers can see it
Main purposes of the project are:
● give research groups access to life science cyberinfrastructure in the UK● allow the analysis of data on UK-provisioned research HPC● ensure reproducibility of analysis through application versioning /
containerisation● support data sharing● distribute documentation and code as open source
CyVerse UK aims to give UK and EU users a geographical advantage, though it is available to users worldwide.
● Common authentication process
● Use of application through the Discovery Environment
BUT… applications need to be duplicated to be run both in the US and in the UK.
● Now providing webservice VMs to:○ SignaLink○ COPO○ Grassroots
Expansion and Community SupportWon a £400k BBSRC 16ALERT grant to expand the compute capacity of CyVerse UK
● Larger core count and RAM
● Better networking and switching
● Project-initiated VM requests to provide specific groups with cloud access
● Planning to include flexible analysis frontends such as the Genomics Virtual Laboratory
Earlham Institute, Norwich Research Park, Norwich, Norfolk, NR4 7UG, UKwww.earlham.ac.uk