34
Data Sharing, Management and Reuse Why CyVerse helps open science Dr. Robert Davey Head of Research e-Infrastructure @froggleston

Why CyVerse helps open science · Data stewardship matters We couldn’t do the science we do without access to data and analysis. The data-driven process ... • EI is the hardware

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Why CyVerse helps open science · Data stewardship matters We couldn’t do the science we do without access to data and analysis. The data-driven process ... • EI is the hardware

Data Sharing, Management and ReuseWhy CyVerse helps open science

Dr. Robert DaveyHead of Research e-Infrastructure

@froggleston

Page 2: Why CyVerse helps open science · Data stewardship matters We couldn’t do the science we do without access to data and analysis. The data-driven process ... • EI is the hardware

www.earlham.ac.uk

What is data stewardship?

“Processes, policies, guidelines and responsibilities for administering organisations' entire data in compliance with policy and/or regulatory obligations”

Wikipedia

Who• Custodians, curators, data scientists, etc

What• Data and metadata

How• Standards• Controls• Data Entry

Page 3: Why CyVerse helps open science · Data stewardship matters We couldn’t do the science we do without access to data and analysis. The data-driven process ... • EI is the hardware

www.earlham.ac.uk

What is data stewardship?

“Processes, policies, guidelines and responsibilities for administering organisations' entire data in compliance with policy and/or regulatory obligations”

Page 4: Why CyVerse helps open science · Data stewardship matters We couldn’t do the science we do without access to data and analysis. The data-driven process ... • EI is the hardware

www.earlham.ac.uk

• Data stewardship must have a well-defined purpose, or fitness– data recording policy– data access policy– computer interoperability policy

• Sounds simple enough

Data stewardship matters

We couldn’t do the science we do without access to data and analysis

Page 5: Why CyVerse helps open science · Data stewardship matters We couldn’t do the science we do without access to data and analysis. The data-driven process ... • EI is the hardware

www.earlham.ac.uk

The data-driven process

DATASTANDARDS

ASSEMBLY

VARIATION

FUNCTION

MODEL INTEGRATION

RESEARCHERS & COLLABORATORS

EXPERIMENTS & HYPOTHESES

INTERPRETATION

DATARESOURCES

Page 6: Why CyVerse helps open science · Data stewardship matters We couldn’t do the science we do without access to data and analysis. The data-driven process ... • EI is the hardware

www.earlham.ac.uk

Data is a mess

So what can we do about it?

• Put data where humans and computers can see it

• Describe data in a common way

• Be committed to reproducibility

• Connect up resources to build ecosystems

Page 7: Why CyVerse helps open science · Data stewardship matters We couldn’t do the science we do without access to data and analysis. The data-driven process ... • EI is the hardware

www.earlham.ac.uk

Managing your data

• Useful to think about project management

• What data are you going to collect?

• Where are you going to store it?

• Is anyone going to work on your data at the same time?

• How can you make sure your data hasn’t changed?

• Organise your project folders!

Page 8: Why CyVerse helps open science · Data stewardship matters We couldn’t do the science we do without access to data and analysis. The data-driven process ... • EI is the hardware

www.earlham.ac.uk

Managing your metadata

• https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1000424

• What metadata are you going to record?

• Where are you going to keep it?

• What format will it be in?

• Are you going to share this with anyone else?

• How can you make sure others can understand your work?

Page 9: Why CyVerse helps open science · Data stewardship matters We couldn’t do the science we do without access to data and analysis. The data-driven process ... • EI is the hardware

www.earlham.ac.uk

What is metadata?

• Why is metadata useful?

• Where might you keep metadata?

• What are the possible problems with metadata?

• Are these problems to do with people or computers?

• Where might you find information about data you want?

• How might you start getting that data?

Page 10: Why CyVerse helps open science · Data stewardship matters We couldn’t do the science we do without access to data and analysis. The data-driven process ... • EI is the hardware

www.earlham.ac.uk

Data is a mess

So what can we do about it?

• Describe data in a common way using human terms

• Supply this data in computer-friendly formats

• Put data where humans and computers can see it

• Be FAIR!

Page 11: Why CyVerse helps open science · Data stewardship matters We couldn’t do the science we do without access to data and analysis. The data-driven process ... • EI is the hardware

www.earlham.ac.uk

FAIR data

• Findable, Accessible, Interoperable, Reusable

https://www.nature.com/articles/sdata201618

Page 12: Why CyVerse helps open science · Data stewardship matters We couldn’t do the science we do without access to data and analysis. The data-driven process ... • EI is the hardware

www.earlham.ac.uk

FAIR data

• Findable, Accessible, Interoperable, Reusable

https://www.nature.com/articles/sdata201618

Page 13: Why CyVerse helps open science · Data stewardship matters We couldn’t do the science we do without access to data and analysis. The data-driven process ... • EI is the hardware

www.earlham.ac.uk

Research Data Lifecycles

Page 14: Why CyVerse helps open science · Data stewardship matters We couldn’t do the science we do without access to data and analysis. The data-driven process ... • EI is the hardware

www.earlham.ac.uk

Research Data Lifecycles

Page 15: Why CyVerse helps open science · Data stewardship matters We couldn’t do the science we do without access to data and analysis. The data-driven process ... • EI is the hardware

www.earlham.ac.uk

Research Data Lifecycles

Page 16: Why CyVerse helps open science · Data stewardship matters We couldn’t do the science we do without access to data and analysis. The data-driven process ... • EI is the hardware

www.earlham.ac.uk

Research Data Lifecycles

Page 17: Why CyVerse helps open science · Data stewardship matters We couldn’t do the science we do without access to data and analysis. The data-driven process ... • EI is the hardware

www.earlham.ac.uk

Managing your data

Page 18: Why CyVerse helps open science · Data stewardship matters We couldn’t do the science we do without access to data and analysis. The data-driven process ... • EI is the hardware

www.earlham.ac.uk

Managing your data

Page 19: Why CyVerse helps open science · Data stewardship matters We couldn’t do the science we do without access to data and analysis. The data-driven process ... • EI is the hardware

www.earlham.ac.uk

Managing your data

http://www.pnas.org/content/115/11/2584.short

Page 20: Why CyVerse helps open science · Data stewardship matters We couldn’t do the science we do without access to data and analysis. The data-driven process ... • EI is the hardware

www.earlham.ac.uk

Research Data Lifecycles

Page 21: Why CyVerse helps open science · Data stewardship matters We couldn’t do the science we do without access to data and analysis. The data-driven process ... • EI is the hardware

www.earlham.ac.uk

Managing your data

• Journals often require that data and code is available• Reviewers may reject papers with no FAIR outputs

– especially in informatics / data science

• Specific scientific data repositories– EMBL EBI / NCBI, custom databases

• General repos– Zenodo– Figshare– Data Dryad– Github / Bitbucket

Page 22: Why CyVerse helps open science · Data stewardship matters We couldn’t do the science we do without access to data and analysis. The data-driven process ... • EI is the hardware

www.earlham.ac.uk

Managing your data

Page 23: Why CyVerse helps open science · Data stewardship matters We couldn’t do the science we do without access to data and analysis. The data-driven process ... • EI is the hardware

Earlham Institute, Norwich Research Park, Norwich, Norfolk, NR4 7UG, UKwww.earlham.ac.uk

Page 24: Why CyVerse helps open science · Data stewardship matters We couldn’t do the science we do without access to data and analysis. The data-driven process ... • EI is the hardware

www.earlham.ac.uk

Put data where humans and computers can see it

• CyVerse is a huge and versatile research infrastructure• Helps thousands of users collaborate on data and analysis

Page 25: Why CyVerse helps open science · Data stewardship matters We couldn’t do the science we do without access to data and analysis. The data-driven process ... • EI is the hardware

www.earlham.ac.uk

• Extensive compute and storage resources

• Committed to data stewardship practices– Data Store - iRODS data grid– Data Commons - descriptions/identifiers for public/user-curated data– Discovery Environment - GUI for end users– Agave API - programmatic interfaces to all of the above

• FAIR data compliance

• Potential to power huge federated services

Put data where humans and computers can see it

Page 26: Why CyVerse helps open science · Data stewardship matters We couldn’t do the science we do without access to data and analysis. The data-driven process ... • EI is the hardware

www.earlham.ac.uk

• EI is the hardware and middleware site for CyVerse UK

• First international node outside the US• Opened for use by UK community in November 2016

• New collaborations in Australia and Austria

• http://cyverseuk.org/

Page 27: Why CyVerse helps open science · Data stewardship matters We couldn’t do the science we do without access to data and analysis. The data-driven process ... • EI is the hardware

www.earlham.ac.uk

• http://www.cyverse.org/learning-center/manage-data

• Data allocations– 100GB per “standard” user (1TB at request)– Need a login to start storing private / collaborative data

• Data transfer tools– DE - web-based, slower transfers of 2GB max– CyberDuck - Graphical interface (GUI) that runs on the desktop for fast transfers– iRODS icommands - specialised command line (CLI) tools for fast transfers– FUSE - slow transfers, appears as a drive or folder on your computer

Put data where humans and computers can see it

Page 28: Why CyVerse helps open science · Data stewardship matters We couldn’t do the science we do without access to data and analysis. The data-driven process ... • EI is the hardware

Main purposes of the project are:

● give research groups access to life science cyberinfrastructure in the UK● allow the analysis of data on UK-provisioned research HPC● ensure reproducibility of analysis through application versioning /

containerisation● support data sharing● distribute documentation and code as open source

CyVerse UK aims to give UK and EU users a geographical advantage, though it is available to users worldwide.

Page 29: Why CyVerse helps open science · Data stewardship matters We couldn’t do the science we do without access to data and analysis. The data-driven process ... • EI is the hardware

● Common authentication process

● Use of application through the Discovery Environment

Page 30: Why CyVerse helps open science · Data stewardship matters We couldn’t do the science we do without access to data and analysis. The data-driven process ... • EI is the hardware

BUT… applications need to be duplicated to be run both in the US and in the UK.

Page 31: Why CyVerse helps open science · Data stewardship matters We couldn’t do the science we do without access to data and analysis. The data-driven process ... • EI is the hardware
Page 32: Why CyVerse helps open science · Data stewardship matters We couldn’t do the science we do without access to data and analysis. The data-driven process ... • EI is the hardware

● Now providing webservice VMs to:○ SignaLink○ COPO○ Grassroots

Expansion and Community SupportWon a £400k BBSRC 16ALERT grant to expand the compute capacity of CyVerse UK

● Larger core count and RAM

● Better networking and switching

● Project-initiated VM requests to provide specific groups with cloud access

● Planning to include flexible analysis frontends such as the Genomics Virtual Laboratory

Page 34: Why CyVerse helps open science · Data stewardship matters We couldn’t do the science we do without access to data and analysis. The data-driven process ... • EI is the hardware

Earlham Institute, Norwich Research Park, Norwich, Norfolk, NR4 7UG, UKwww.earlham.ac.uk