1
Scientific workflows Best practices for workflow design: how to prevent workflow decay Kristina Hettne 1 , Katy Wolstencroft 2 , Khalid Belhajjame 2 , Carole Goble 2 , Eleni Mina 1,3 , Harish Dharuri 1 , Lourdes Verdes- Montenegro 4 , Julian Garrido 4 , David de Roure 5 , Marco Roos 1,2 1 Leiden University Medical Centre, Leiden, The Netherlands, 2 University of Manchester, Manchester, United Kingdom, 3 Netherlands BioInformatics Centre, 4 Instituto de Astrofísica de Andalucía, Granada, Spain, 5 Oxford e-Research Centre, University of Oxford, Oxford, United Kingdom Introduction Workflows are increasingly used in life sciences as a means to capture, share, and publish the steps of a computational analysis. Tools such as Taverna and Galaxy are widely adopted tools for creating and executing workflows, which can be published and shared using myExperiment, the largest public repository of scientific workflows. While myExperiment provides access to a large number of workflows from different domains such as genomics, biodiversity and astronomy, scientists find it often difficult (or impossible) to exploit those workflows by reusing or re-purposing them for their analyses. In a recent study [1] of Taverna workflows on myExperiment, the authors found that as much as 80% failed to be either executed, or to produce the same results. The main reasons for the workflows to break were the following: volatile third-party resources, missing example data that can help the scientist understand the main purpose of the workflow, missing execution environment, and insufficient metadata. We believe that all of these issues, with a possible exception of the first one, can be prevented by following a minimum set of guidelines at the workflow design stage. 10 Best Practices for workflow design Resulting workflows We followed the 10 best practices for workflow design when implementing a set of workflows in Taverna that perform a sophisticated text mining procedure referred to as concept profile matching [3] to functionally annotate single nucleotide polymorphisms. The set of workflows were published on myExperiment as a pack. Concept profile matching Fig. 1: Experiment sketch. Orange symbols are workflows, blue symbols are input data sets, yellow symbols are user- interactive decision points, green symbol is output comparison. Wf4Ever semantic technology supporting the Best Practices 1. The abstract workflow sketch was aggregated in the RO as a wf4ever:Document. A file description was added using the rdfs:comment property. 2. The subworkflows were described using the following properties: class: wfdesc:Workflow and relation: wfdesc:hasSubWorkflow. 3. The outputs after a workflow run were annotated with the wfprov:wasOutputFrom relation. Conclusions When following the 10 best practices for workflow design we were helped by the available Wf4Ever models and tooling We noticed several technology gaps where specific guidelines or tooling would be of great help. We suggest that something similar to unit testing would be useful for testing workflows and support workflow maintenance. Ongoing work on nanopublications might provide means to refer to different parts of a workflow or an RO in a more fine-grained way. The question on how to convince users that the best practices are beneficial in the long run remains. [1] Zhao, J., et al.: Why workflows break - Understanding and combating decay in Taverna workflows, accepted for the 8th IEEE International Conference on eScience 2012 [2] Belhajjame, K., et al.: Workflow-Centric Research Objects: First Class Citi-zens in Scholarly Discourse. In: SEPUBLICA: Future of scholarly commu-nication in the Semantic Web. 2012 [3] Jelier, R., et al.: Text-derived concept profiles support assessment of DNA microarray data for acute myeloid leukemia and for androgen receptor stimulation. BMC Bioinformatics. 8, 14 (2007) 1. Make an abstract workflow 2. Use modules 3. Think about the output 4. Provide input and output examples 5. Annotate 6. Make it executable from outside the local environment 7. Choose services carefully 8. Reuse existing workflows 9. Test and validate 10. Advertise and maintain Fig. 2: MyExperiment pack with references to the main workflows in the experimental sketch (figure 1). Fig. 3: Taverna workflow performing the “Match SNPs with concept profiles” task in the experimental sketch (figure 1). A (workflow-centric) Research Object (RO) is an artifact that bundles one or several workflows, the provenance of the results obtained by their enactments, other digital objects that are relevant to the experiment (papers, datasets, etc.), and annotations that semantically describe all these objects (fig 4). A model that can be used to describe these workflow-centric research objects is proposed in [2] (fig 5). This model is implemented as a suite of lightweight ontologies or vocabularies, building upon existing work. Fig. 4: RO high-level representation. We used the RO manager to create meta-data for the elements in a workflow-centric RO. Examples: Fig. 5: Overview of the RO model.

Best practices for workflow design: how to prevent workflow decay

Embed Size (px)

Citation preview

Page 1: Best practices for workflow design: how to prevent workflow decay

Scientific workflows

Best practices for workflow design: how to prevent

workflow decayKristina Hettne

1, Katy Wolstencroft

2, Khalid Belhajjame

2, Carole Goble

2, Eleni Mina

1,3, Harish Dharuri

1, Lourdes Verdes-

Montenegro4, Julian Garrido

4, David de Roure

5, Marco Roos

1,2

1Leiden University Medical Centre, Leiden, The Netherlands,

2University of Manchester, Manchester, United Kingdom,

3 Netherlands BioInformatics Centre,

4Instituto de Astrofísica de Andalucía, Granada, Spain,

5Oxford e-Research Centre, University

of Oxford, Oxford, United Kingdom

Introduction

Workflows are increasingly used in life sciences as a means to capture, share, and publish the steps of a

computational analysis. Tools such as Taverna and Galaxy are widely adopted tools for creating and

executing workflows, which can be published and shared using myExperiment, the largest public

repository of scientific workflows. While myExperiment provides access to a large number of workflows

from different domains such as genomics, biodiversity and astronomy, scientists find it often difficult (or

impossible) to exploit those workflows by reusing or re-purposing them for their analyses.

In a recent study [1] of Taverna workflows on myExperiment, the authors found that as much as 80%

failed to be either executed, or to produce the same results. The main reasons for the workflows to break

were the following: volatile third-party resources, missing example data that can help the scientist

understand the main purpose of the workflow, missing execution environment, and insufficient metadata.

We believe that all of these issues, with a possible exception of the first one, can be prevented by

following a minimum set of guidelines at the workflow design stage.

10 Best Practices for workflow designResulting workflowsWe followed the 10 best practices for workflow design when

implementing a set of workflows in Taverna that perform a

sophisticated text mining procedure referred to as concept

profile matching [3] to functionally annotate single nucleotide

polymorphisms. The set of workflows were published on

myExperiment as a pack.

Concept profile matching

Fig. 1: Experiment sketch. Orange symbols are workflows,

blue symbols are input data sets, yellow symbols are user-

interactive decision points, green symbol is output

comparison.

Wf4Ever semantic technology supporting the Best Practices

1. The abstract workflow sketch was aggregated in the RO

as a wf4ever:Document. A file description was added

using the rdfs:comment property.

2. The subworkflows were described using the following

properties: class: wfdesc:Workflow and relation:

wfdesc:hasSubWorkflow.

3. The outputs after a workflow run were annotated with

the wfprov:wasOutputFrom relation.

Conclusions

• When following the 10 best practices for workflow design we were helped by the available Wf4Ever models and tooling

• We noticed several technology gaps where specific guidelines or tooling would be of great help.

• We suggest that something similar to unit testing would be useful for testing workflows and support workflow maintenance.

• Ongoing work on nanopublications might provide means to refer to different parts of a workflow or an RO in a more fine-grained way.

• The question on how to convince users that the best practices are beneficial in the long run remains.

[1] Zhao, J., et al.: Why workflows break - Understanding and combating decay in Taverna workflows, accepted for the 8th IEEE International Conference on eScience 2012

[2] Belhajjame, K., et al.: Workflow-Centric Research Objects: First Class Citi-zens in Scholarly Discourse. In: SEPUBLICA: Future of scholarly commu-nication in the Semantic Web. 2012

[3] Jelier, R., et al.: Text-derived concept profiles support assessment of DNA microarray data for acute myeloid leukemia and for androgen receptor stimulation. BMC Bioinformatics. 8, 14 (2007)

1. Make an abstract workflow

2. Use modules

3. Think about the output

4. Provide input and output examples

5. Annotate

6. Make it executable from outside the local environment

7. Choose services carefully

8. Reuse existing workflows

9. Test and validate

10. Advertise and maintain

Fig. 2: MyExperiment pack with references to the main

workflows in the experimental sketch (figure 1).

Fig. 3: Taverna workflow performing the “Match SNPs with

concept profiles” task in the experimental sketch (figure 1).

A (workflow-centric) Research Object (RO) is an artifact that

bundles one or several workflows, the provenance of the results

obtained by their enactments, other digital objects that are

relevant to the experiment (papers, datasets, etc.), and

annotations that semantically describe all these objects (fig 4).

A model that can be used to describe these workflow-centric

research objects is proposed in [2] (fig 5). This model is

implemented as a suite of lightweight ontologies or

vocabularies, building upon existing work.

Fig. 4: RO high-level representation.

We used the RO manager to create meta-data for the elements in a

workflow-centric RO. Examples:

Fig. 5: Overview of the RO model.