Upload
peter-van-heusden
View
349
Download
1
Tags:
Embed Size (px)
Citation preview
IntroductionBiological Sequence Analysis
Scientific workflow management systemsThe Galaxy framework
ConclusionBibliographyReferences
Assessing Galaxy’s ability to express scientificworkflows in bioinformatics
Peter van Heusden and Alan Christoffels
South African National Bioinformatics InstituteUniversity of the Western Cape
Bellville, South Africa
10th FASTAR/Espresso Workshop 2013 / 4-6 November 2013
Peter van Heusden and Alan Christoffels Assessing Galaxy’s ability to express scientific workflows in bioinformatics
IntroductionBiological Sequence Analysis
Scientific workflow management systemsThe Galaxy framework
ConclusionBibliographyReferences
What is bioinformatics?
Bioinformatics is the discipline of solving problems in biology andmedicine using computational resources.
Within bioinformatics, biological sequence analysis (BSA)describes those analyses that “infer biological information fromsequence alone”. (Durbin, 1998)
Cost of biological sequence analysis has two parts:1 Cost of acquiring sequence2 Cost of analysing sequence
Peter van Heusden and Alan Christoffels Assessing Galaxy’s ability to express scientific workflows in bioinformatics
IntroductionBiological Sequence Analysis
Scientific workflow management systemsThe Galaxy framework
ConclusionBibliographyReferences
Cost of acquiring sequence
(Wetterstrand, 2013)Peter van Heusden and Alan Christoffels Assessing Galaxy’s ability to express scientific workflows in bioinformatics
IntroductionBiological Sequence Analysis
Scientific workflow management systemsThe Galaxy framework
ConclusionBibliographyReferences
Cost of analysing sequence
The “sudden reliance on computation has created an ‘informaticscrisis’ for life science researchers: computational resources canbe difficult to use, and ensuring that computational experimentsare communicated well and hence reproducible is challenging”(Goecks et al., 2010)
As cost of sequencing plummets analysis faces two challenges:1 Growing data volume demands more sophisticated computational
approaches2 Translating biological questions into computational workflows
remains difficult
Peter van Heusden and Alan Christoffels Assessing Galaxy’s ability to express scientific workflows in bioinformatics
IntroductionBiological Sequence Analysis
Scientific workflow management systemsThe Galaxy framework
ConclusionBibliographyReferences
How do we do bioinformatics?
Given a set of protein sequences from species A, which genesfrom species B produce similar proteins, and where are thesegenes located on the genome of B?Analysis proceeds (Stevens et al., 2001) using:
1 Collections of data objects2 Transformers that generate new collections (e.g. transform
collection of proteins into collection of genome regions that theymatch)
3 Filters (e.g. discard low quality matches to genome)
Peter van Heusden and Alan Christoffels Assessing Galaxy’s ability to express scientific workflows in bioinformatics
IntroductionBiological Sequence Analysis
Scientific workflow management systemsThe Galaxy framework
ConclusionBibliographyReferences
How we do bioinformatics (2)
Data collections typically exist as (compressed) files
Bioinformatics tools typically are command line executables thataccept and generate files (often using ad-hoc formats)Scripting languages (Perl, Python) used to compose workflows,APIs often used for reading/writing file formats
1 Workflow enactment often involves manual steps and is closelytied to execution environment
2 Workflow is not easily reproducible nor reusable
Peter van Heusden and Alan Christoffels Assessing Galaxy’s ability to express scientific workflows in bioinformatics
IntroductionBiological Sequence Analysis
Scientific workflow management systemsThe Galaxy framework
ConclusionBibliographyReferences
Scientific workflow management systems
Scientific workflow management systems (SciWMS) have beenproposed as an alternative to current script-based approaches toanalysis workflow.
SciWMSs “provide a high-level declarative way of specifying whata particular in silico experiment modelled by a workflow is set toachieve, not how it will be executed.” (Taverna project, 2009)
Workflow descriptions resemble dataflow languages (McPhillipset al., 2009)
Peter van Heusden and Alan Christoffels Assessing Galaxy’s ability to express scientific workflows in bioinformatics
IntroductionBiological Sequence Analysis
Scientific workflow management systemsThe Galaxy framework
ConclusionBibliographyReferences
The promise of SciWMSs
In addition to workflow specification, SciWMSs sometimes offer:
Types that model objects of scientific domain
Recording of provenance of data objects
Execution of scientific workflows on diverse computingenvironments (desktop, cluster, grid, cloud)
Peter van Heusden and Alan Christoffels Assessing Galaxy’s ability to express scientific workflows in bioinformatics
IntroductionBiological Sequence Analysis
Scientific workflow management systemsThe Galaxy framework
ConclusionBibliographyReferences
SciWMSs for bioinformatics
Many workflow systems have been proposed for use inbioinformatics: Taverna, Kepler, Triana, Bioopera, Mobyle,BiosFlow, bpipe
Some workflow features are also available in Galaxy
Peter van Heusden and Alan Christoffels Assessing Galaxy’s ability to express scientific workflows in bioinformatics
IntroductionBiological Sequence Analysis
Scientific workflow management systemsThe Galaxy framework
ConclusionBibliographyReferences
Support for workflow patternsScientific Data ModellingWorkflow representation and use
What is Galaxy
Galaxy emerged in 2004/5 as a web interface to bioinformaticstools and dataGalaxy is becoming common platform through which to “publish”tools and data
More than 30 known public Galaxy servers36 000 users on main public Galaxy server, 0.8 Pb of data
Peter van Heusden and Alan Christoffels Assessing Galaxy’s ability to express scientific workflows in bioinformatics
IntroductionBiological Sequence Analysis
Scientific workflow management systemsThe Galaxy framework
ConclusionBibliographyReferences
Support for workflow patternsScientific Data ModellingWorkflow representation and use
Galaxy as an open-source project
Galaxy consists of c. 250 000 lines of (mostly Python) code
Core team includes 15 developers spread across 4 differentinstitutes
Development is open source and “out in the open” with codehosted on BitBucket, development planning on Trello and mailinglists
Peter van Heusden and Alan Christoffels Assessing Galaxy’s ability to express scientific workflows in bioinformatics
IntroductionBiological Sequence Analysis
Scientific workflow management systemsThe Galaxy framework
ConclusionBibliographyReferences
Support for workflow patternsScientific Data ModellingWorkflow representation and use
Galaxy I
Peter van Heusden and Alan Christoffels Assessing Galaxy’s ability to express scientific workflows in bioinformatics
IntroductionBiological Sequence Analysis
Scientific workflow management systemsThe Galaxy framework
ConclusionBibliographyReferences
Support for workflow patternsScientific Data ModellingWorkflow representation and use
Galaxy II
Peter van Heusden and Alan Christoffels Assessing Galaxy’s ability to express scientific workflows in bioinformatics
IntroductionBiological Sequence Analysis
Scientific workflow management systemsThe Galaxy framework
ConclusionBibliographyReferences
Support for workflow patternsScientific Data ModellingWorkflow representation and use
Galaxy workflow management features
Galaxy allows composition of workflows defined as series oftasks and related dataflow
Allows execution of workflows on local machine or via various jobschedulers
Data objects generated in Galaxy have associated provenanceinformation
Peter van Heusden and Alan Christoffels Assessing Galaxy’s ability to express scientific workflows in bioinformatics
IntroductionBiological Sequence Analysis
Scientific workflow management systemsThe Galaxy framework
ConclusionBibliographyReferences
Support for workflow patternsScientific Data ModellingWorkflow representation and use
Limitations of Galaxy as a SciWMS
Limited support for scientific workflow patternsType refers to format of data items
Provenance is recorded as attribute of data files
Workflows are not first class objectsAnalysis view focuses on individual datasetsExecution engine schedules tasks (with limited support for taskcollections)
Galaxy can be enriched by drawing on prior research onSciWMSs
Peter van Heusden and Alan Christoffels Assessing Galaxy’s ability to express scientific workflows in bioinformatics
IntroductionBiological Sequence Analysis
Scientific workflow management systemsThe Galaxy framework
ConclusionBibliographyReferences
Support for workflow patternsScientific Data ModellingWorkflow representation and use
Scientific workflow patterns
Analysis of scientific workflows has yielded a set of designpatterns used in workflows (Yildiz et al., 2009)
Galaxy workflow language supports sequential dataflow, parallelsplit and synchronisationTool definition language has recently been extended to supportmultiple instances of task (not workflow) execution with a-prioriruntime knowledge
Tool authors can signal that input to tool can be split for parallelexecutionNo interface between workflow authors and multiple instancesupport
Support for cancel of individual task but not entire workflowNo support for triggering new thread of activity (restart)
Peter van Heusden and Alan Christoffels Assessing Galaxy’s ability to express scientific workflows in bioinformatics
IntroductionBiological Sequence Analysis
Scientific workflow management systemsThe Galaxy framework
ConclusionBibliographyReferences
Support for workflow patternsScientific Data ModellingWorkflow representation and use
Scientific workflow patterns (2)
No support for exclusive choice (e.g. execute different dataflowpath based on different input)No support for sub-workflows
Galaxy workflow language is “abstraction hating” (Green andPetre, 1996)Leads to workflow diagrams resembling bowl of spaghetti foranything but the most simple cases
Peter van Heusden and Alan Christoffels Assessing Galaxy’s ability to express scientific workflows in bioinformatics
IntroductionBiological Sequence Analysis
Scientific workflow management systemsThe Galaxy framework
ConclusionBibliographyReferences
Support for workflow patternsScientific Data ModellingWorkflow representation and use
The Galaxy type system
Galaxy types represent file typesFile type does not map simply to semantics
Collection types are not supported, although some types are“splittable” to allow parallel task executionWorkflow parameters are not supported via type system
Cannot guarantee that workflow is well-formedProvenance recording is coarse-grained
What will happen if we update single element of input datacollection?
Peter van Heusden and Alan Christoffels Assessing Galaxy’s ability to express scientific workflows in bioinformatics
IntroductionBiological Sequence Analysis
Scientific workflow management systemsThe Galaxy framework
ConclusionBibliographyReferences
Support for workflow patternsScientific Data ModellingWorkflow representation and use
Science questions vs execution plans
Type system could model scientific domain objects (e.g. proteinand nuceleotide sequences) but . . .
Bioinformatics tools do not support standard formats or supportstandard formats with quirksNot clear what information to save from tool output
Experienced bioinformaticists want opportunity to review “rawoutput” to explore factors that underpin confidence in analysis
Need to support both recording and reporting of workflow outputBoth recording “raw” output trace and reporting provenance ofscientific domain objects are necessary features for SciWMS
Peter van Heusden and Alan Christoffels Assessing Galaxy’s ability to express scientific workflows in bioinformatics
IntroductionBiological Sequence Analysis
Scientific workflow management systemsThe Galaxy framework
ConclusionBibliographyReferences
Support for workflow patternsScientific Data ModellingWorkflow representation and use
Workflow execution in Galaxy
Internally workflows are expanded into collections of tasks atexecution timeTasks are executed by backend classes: either local or viaschedulerExecution parameters can be set by “dynamic job runners”
Allows e.g. resource requirements of job to be signalled toschedulerConfigured using a combination of XML and Python codemaintained by Galaxy administrator
Workflow execution leaves no visible trace in the user interfaceAt runtime execution shows individual jobs runningData objects are grouped by “history”, not associated with aworkflow
No support for re-execution of part of workflow
Peter van Heusden and Alan Christoffels Assessing Galaxy’s ability to express scientific workflows in bioinformatics
IntroductionBiological Sequence Analysis
Scientific workflow management systemsThe Galaxy framework
ConclusionBibliographyReferences
Support for workflow patternsScientific Data ModellingWorkflow representation and use
Scope for workflow optimisation
Workflows are dataflow graphs (Johnston et al., 2004)
Knowledge of inputs and types can be used to plan executionefficiently, e.g. pipeline tasks and exploit opportunities forstreamingCollection of data objects and parameters sets can be exploitedfor automatic parallel enactment of tasks and sub-workflows
Data collections and workflows provide structures for nesting ofprovenance information
Knowledge of data provenance could facilitate lifecycle of dataproducts: kept for re-use or discarded as “intermediate products”
Peter van Heusden and Alan Christoffels Assessing Galaxy’s ability to express scientific workflows in bioinformatics
IntroductionBiological Sequence Analysis
Scientific workflow management systemsThe Galaxy framework
ConclusionBibliographyReferences
Conclusion
Bioinformatics faces an “informatics crisis” as cost to generatesequence has decreased while cost to compose or reproduceanalysis has remained highGalaxy has emerged as a popular interface to bioinformatics toolsand data with workflow management featuresInsight from prior research on SciWMSs suggests areas forenhancement:
Support for additional workflow patternsExtension of type system with support for biological types,collections and parameter setsImprovement of workflow execution through treating workflows asfirst class objects with associated optimisation of execution andprovenance storage
Currently being pursued as a research agenda at SANBIPeter van Heusden and Alan Christoffels Assessing Galaxy’s ability to express scientific workflows in bioinformatics
IntroductionBiological Sequence Analysis
Scientific workflow management systemsThe Galaxy framework
ConclusionBibliographyReferences
Thanks
Workflows for biological se-quence analysis are discussedby the “Pipelines collaboration”
Research on SciWMS supportedby the MRC and Prof Christoffels
Professor Alan Christoffels
Peter van Heusden and Alan Christoffels Assessing Galaxy’s ability to express scientific workflows in bioinformatics
IntroductionBiological Sequence Analysis
Scientific workflow management systemsThe Galaxy framework
ConclusionBibliographyReferences
Bibliography I
R. Durbin. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids.Cambridge University Press, Apr. 1998. ISBN 9780521629713.
J. Goecks, A. Nekrutenko, J. Taylor, and T. G. Team. Galaxy: a comprehensive approach forsupporting accessible, reproducible, and transparent computational research in the lifesciences. Genome Biol, 11(8), 2010.
T. R. G. Green and M. Petre. Usability analysis of visual programming environments: a ‘cognitivedimensions’ framework. Journal of Visual Languages and Computing, 7:131–174, 1996.
W. M. Johnston, J. R. P. Hanna, and R. J. Millar. Advances in dataflow programming languages.ACM Computing Surveys, 36(1):1–34, Mar. 2004.
T. McPhillips, S. Bowers, D. Zinn, and B. Ludäscher. Scientific workflow design for mere mortals.Future Generation Computer Systems, 25(5):541–551, May 2009.
R. Stevens, C. Goble, P. Baker, and A. Brass. A classification of tasks in bioinformatics.Bioinformatics, 17(2):180–188, Feb. 2001.
Taverna project. Why use workflows?, 2009. URLhttp://www.taverna.org.uk/introduction/why-use-workflows/.
Peter van Heusden and Alan Christoffels Assessing Galaxy’s ability to express scientific workflows in bioinformatics
IntroductionBiological Sequence Analysis
Scientific workflow management systemsThe Galaxy framework
ConclusionBibliographyReferences
Bibliography II
K. Wetterstrand. DNA sequencing costs: Data from the NHGRI genome sequencing program(GSP), 2013. URL http://www.genome.gov/sequencingcosts/.
U. Yildiz, A. Guabtni, and A. H. H. Ngu. Towards scientific workflow patterns. In Proceedings ofthe 4th Workshop on Workflows in Support of Large-Scale Science, WORKS ’09, page13:1–13:10, New York, NY, USA, 2009. ACM.
Peter van Heusden and Alan Christoffels Assessing Galaxy’s ability to express scientific workflows in bioinformatics