20
Supported by Towards Reproducible Data Analysis Using Container Technologies Sergio Maffioletti EnhanceR project director UZH/S3IT

Towards Reproducible Data Analysis Using Container ...€¦ · Researcher’s side recommendations for Open Science: Share data, software, workflows and other digital artifacts. Persistent

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Towards Reproducible Data Analysis Using Container ...€¦ · Researcher’s side recommendations for Open Science: Share data, software, workflows and other digital artifacts. Persistent

Supported by

Towards Reproducible Data Analysis

Using Container Technologies

Sergio Maffioletti

EnhanceR project director

UZH/S3IT

Page 2: Towards Reproducible Data Analysis Using Container ...€¦ · Researcher’s side recommendations for Open Science: Share data, software, workflows and other digital artifacts. Persistent

https://www.enhancer.ch

Disclaimer

What I’m presenting here is the result of a personal experience plus the outcomes of different discussions

within the EnhanceR project.

i.e.:

if you like the talk, congratulate with me…if you don’t, blame EnhanceR

Page 3: Towards Reproducible Data Analysis Using Container ...€¦ · Researcher’s side recommendations for Open Science: Share data, software, workflows and other digital artifacts. Persistent

https://www.enhancer.ch

What are we going to talk about ?

• Context• What is the user story we have in mind ?• Let’s build the infrastructure support• Let’s not stop here: building containers for/with end-users• One more step: what do we put inside container ?• Main challenges and open questions

Page 4: Towards Reproducible Data Analysis Using Container ...€¦ · Researcher’s side recommendations for Open Science: Share data, software, workflows and other digital artifacts. Persistent

https://www.enhancer.ch

Who is EnhanceR again ?

Page 5: Towards Reproducible Data Analysis Using Container ...€¦ · Researcher’s side recommendations for Open Science: Share data, software, workflows and other digital artifacts. Persistent

https://www.enhancer.ch

What problems are we facing ?

Reproducible data analysis

“Reproducibility is just collaboration with people you don’t know, including yourself next week”

— Philip Stark, UC Berkeley Statistic

Page 6: Towards Reproducible Data Analysis Using Container ...€¦ · Researcher’s side recommendations for Open Science: Share data, software, workflows and other digital artifacts. Persistent

https://www.enhancer.ch

Context

Repeatability (Same team, same experimental setup): The measurement can be obtained with stated precision by the same team using the same measurement procedure, the same measuring system, under the same operating conditions, in the same location on multiple trials. For computational experiments, this means that a researcher can reliably repeat her own computation.

Replicability (Different team, same experimental setup): The measurement can be obtained with stated precision by a different team using the same measurement procedure, the same measuring system, under the same operating conditions, in the same or a different location on multiple trials. For computational experiments, this means that an independent group can obtain the same result using the author's own artifacts.

Reproducibility (Different team, different experimental setup): The measurement can be obtained with stated precision by a different team, a different measuring system, in a different location on multiple trials. For computational experiments, this means that an independent group can obtain the same result using artifacts which they develop completely independently.

Page 7: Towards Reproducible Data Analysis Using Container ...€¦ · Researcher’s side recommendations for Open Science: Share data, software, workflows and other digital artifacts. Persistent

https://www.enhancer.ch

Let’s simplify...

Peng, R. D. (2011). Reproducible research in computational science. Science (New York, Ny), 334(6060), 1226.

Page 8: Towards Reproducible Data Analysis Using Container ...€¦ · Researcher’s side recommendations for Open Science: Share data, software, workflows and other digital artifacts. Persistent

https://www.enhancer.ch

What is the user story we have in mind ?

on average a researcher:• develops on personal server• changes code and data as research progress• finally gets publishable results

• sometimes running on a large-scale research IT infrastructure

• prepares slides / images / tables / manuscript• publishes manuscript

• at the end of a review process

Page 9: Towards Reproducible Data Analysis Using Container ...€¦ · Researcher’s side recommendations for Open Science: Share data, software, workflows and other digital artifacts. Persistent

https://www.enhancer.ch

What is the user story we have in mind ?

Researcher’s side recommendations for Open Science:

● Share data, software, workflows and other digital artifacts.● Persistent links should appear in the published article for data,

code, and digital artifacts. ● Citation should be standard practice, to enable credit for shared

digital scholarly objects.● Document digital scholarly artifacts, to facilitate reuse.● Use Open Licensing when publishing digital scholarly objects.

Page 10: Towards Reproducible Data Analysis Using Container ...€¦ · Researcher’s side recommendations for Open Science: Share data, software, workflows and other digital artifacts. Persistent

https://www.enhancer.ch

What does this means for a service provider ?

● “Reproducible Data Analysis as a service” implies looking at the full stack of the service* ○ infrastructure + tools + competences + policies +

best-practises + support

* I know - I’m intentionally skipping the business aspect of this...

Page 11: Towards Reproducible Data Analysis Using Container ...€¦ · Researcher’s side recommendations for Open Science: Share data, software, workflows and other digital artifacts. Persistent

https://www.enhancer.ch

What does this means for a service provider ?

● “Reproducible Data Analysis as a service” implies looking at the full stack of the service* ○ infrastructure + tools + competences + policies +

best-practises + support● Why?

○ understand user-side - anticipate issues; steer adoption and development; enforce policies; better plan resources.

* I know - I’m intentionally skipping the business aspect of this...

Page 12: Towards Reproducible Data Analysis Using Container ...€¦ · Researcher’s side recommendations for Open Science: Share data, software, workflows and other digital artifacts. Persistent

https://www.enhancer.ch

What does this means for a service provider ?

● “Reproducible Data Analysis as a service” implies looking at the full stack of the service* ○ infrastructure + tools + competences + policies +

best-practises + support● Why?

○ understand user-side - anticipate issues; steer adoption and development; enforce policies; better plan resources.

● at the end ?○ we become a valuable asset for a research group○ we actually help them

* I know - I’m intentionally skipping the business aspect of this...

Page 13: Towards Reproducible Data Analysis Using Container ...€¦ · Researcher’s side recommendations for Open Science: Share data, software, workflows and other digital artifacts. Persistent

https://www.enhancer.ch

Let’s build the infrastructure

what container technology

orchestration

integration withresource management

storage for data and container’s images

deployment and management

monitoring

Page 14: Towards Reproducible Data Analysis Using Container ...€¦ · Researcher’s side recommendations for Open Science: Share data, software, workflows and other digital artifacts. Persistent

https://www.enhancer.ch

Let’s build the infrastructure

validation andverification

automatedpolicies

scanning signing

https://www.docker.com

https://www.enhancer.ch/pipeline

Page 15: Towards Reproducible Data Analysis Using Container ...€¦ · Researcher’s side recommendations for Open Science: Share data, software, workflows and other digital artifacts. Persistent

https://www.enhancer.ch

Let’s not stop here

what to consider● Automated build / integration with CD/CI● design strategies● naming schema● Path binding● documentation, metadata and runner script

building containers for/with end-users

competences● version control - CD/CI● container build process

opportunities● development best practises● embed policies● standardise assumptions

Page 16: Towards Reproducible Data Analysis Using Container ...€¦ · Researcher’s side recommendations for Open Science: Share data, software, workflows and other digital artifacts. Persistent

https://www.enhancer.ch

Container design strategies

https://www.enhancer.ch/pipeline

Page 17: Towards Reproducible Data Analysis Using Container ...€¦ · Researcher’s side recommendations for Open Science: Share data, software, workflows and other digital artifacts. Persistent

https://www.enhancer.ch

what do we put inside container now ?

https://nbis-reproducible-research.readthedocs.io/en/course_1811/tutorial_intro/

what to consider: ● Track software dependencies:

● in-container executions:

competences:● track requirements in sw

development● sw deployment - CD/CI

opportunities:● end-user best practices● better handling of sw

dependencies

Page 18: Towards Reproducible Data Analysis Using Container ...€¦ · Researcher’s side recommendations for Open Science: Share data, software, workflows and other digital artifacts. Persistent

https://www.enhancer.ch

Open questions

Infrastructure / Pull● what containers shall I allow on my infrastructure ?● how do I make sure cited container is exactly what I’m getting ?● how do I verify and validate containers when we deploy them on our infrastructure● how do I know what the container is doing ?● how do I know whether the container has the latest security patch ?

Run● how do I make sure a deployed container runs ‘as documented’ on my data ?● “how do I find a container that I need for running RNAseq ?”

Build● what assumptions can I make when building a container and what I should try to avoid ?

○ data mapping in and out / user privileges /● where do I publish my container and how do I get a DOI for the publication ?● how do I publish my container so that people can find it for their purposes ? (metadata)● how do I describe/document my container’s behaviour

Page 19: Towards Reproducible Data Analysis Using Container ...€¦ · Researcher’s side recommendations for Open Science: Share data, software, workflows and other digital artifacts. Persistent

https://www.enhancer.ch

Main challenges

● Social○ adoption by end-users○ how to address: “is it worth the investment ?”

● Technical○ scale-out / orchestration○ integration of specialised resources (e.g. GPU)○ multi-tenancy - privileges○ documented assumptions within the containers○ maintenance

■ bugfix and security○ portability vs performance

Page 20: Towards Reproducible Data Analysis Using Container ...€¦ · Researcher’s side recommendations for Open Science: Share data, software, workflows and other digital artifacts. Persistent

https://www.enhancer.ch

Acknowledgments

● Guidelines for pipeline interoperability using containers○ https://www.enhancer.ch/pipeline

● Survey for Research IT Infrastructure providers○ https://forms.gle/JBW78qDPWabd4GDR8