Upload
jan-aerts
View
518
Download
3
Tags:
Embed Size (px)
DESCRIPTION
Presentation at BOSC2012 by Wolstencroft K - Workflows on the Cloud: scaling for national service
Citation preview
Workflows on the Cloud:Scaling for National Service
Katy Wolstencroft, Robert Haines, Helen Hulme,Mike Cornell, Shoaib Sufi, Andy Brass, Carole Goble
University of Manchester, UK
Madhu Donepudi, Nick JamesEagle Genomics Ltd, UK
Motivation: Workflows for Diagnostics
NHS genetic testing, e.g. colon disease Annotation of SNPs in patient data, ready for interpretation by clinician.Diagnostic Testing TodayPurify DNA. PCRs exons of relevant genes (MLH1, MSH2, MSH6).Sequence, identify variants, classify: (pathogenic, not pathogenic, unknown significance etc.).Writes report to clinicianDiagnostic Testing Tomorrow (or later today) uses whole genome sequencing
Next Gen Seq data
Variation data
ANNOTATE, FILTER, DISPLAY
New problem: How do we classify all the variants that we discover?
Taverna Workflows
Sophisticated analysis pipelines A set of services to analyse or
manage data (either local or remote)
Workflows run through the workbench or via a server
Automation of data flow through services
Control of service invocation Iteration over data sets Provenance collection Extensible and open source
Nucleic Acids Res. 2006 Jul 1;34(Web Server issue):W729-32.Taverna: a tool for building and running workflows of services.Hull D, Wolstencroft K, Stevens R, Goble C, Pocock MR, Li P, Oinn T.
Freely availableopen source
Current Version 2.4
80,000+ downloads across version
Part of the myGrid Toolkit
Tavernahttp://www.taverna.org.uk/
Windows/Mac OS X/Linux/unix
SNP annotation
Annotation taskLocation, Gene, TranscriptPresent in public databases, dbSNP etcFrequency in e.g. 1000 genome dataConservation data (cross species)
Workflows are good for collecting and integrating data from a variety of sources, into one place
Variant ClassificationSNP
Nonsense: base insertion, causing a frameshift
Synonymous Missense: Non-synonymous
Premature StopNonsense codon
Affects on splicingAffects on function or splicing
SNP Filtering / Triage
Which SNPs are the most important?Reduction of 80K data points to those with potential clinical significance.CriteriaReduce to (disease)-specific gene listSense < Missense < Stop codon etcBased on prediction tool scoresFrequency in population (based on 1000 genome data etc) (high frequency implies non deleterious)Conservation across species (implies that change is deleterious)
Workflow Provenance
Record inferences in clinical decisions
What were the parameters used to build the dataset
What versions of databases, genome assembly, machine
Where does each piece of evidence for/against pathogenicity originate from?
Infrastructure Requirements
Execute analysis workflows Accessible to clinicians and genetic testers Cope with expanding demands on compute Provide a secure environment Collect provenance
Architecture overview
Webinterface
InputSNPs
Results
Storage (S3)
Ensembl (mySQL)
Cache(S3)
Taverna Server
Taverna Server
Taverna Server
Workflow engine
orchestrator
e-Hive
other
Taverna
Application specific tools and Web Services
Application specific tools and Web Services
Application specific tools and Web Services
WS WS Tool
ToolWS
All user interaction via web interface
User data stored in the Cloud
Data for all tools and Web Services stored in the Cloud
Unified access to different workflow engines with our common REST API
Tools and Web Services for each workflow are installed together for easy replication
Workflow engine orchestration
Orchestrator is workflow executor agnostic
Uses common API to: List workflows Configure runs Start runs Manage current runs
Status Progress
Delete runs
Workflow engineorchestrator
e-Hive Taverna
Taverna Interface
e-Hive Interface
Common REST API
Engine specific APIs
Cache
Additional Taverna Functionality
Integration with Cloud infrastructure AWS first
Read/write files securely to S3 Start and stop Cloud instances if required
Tool and Web Service scaling Self-scaling
Released as part of Taverna 3
The user’s view Curated set of workflows
Designed, built and tested by domain experts Quality assurance tested (if appropriate)
Workflows are presented as applications The workflows themselves are hidden Configured and run via a web interface
All user data stored securely in the Cloud User separation
Workflows as a Service
Web interface: Overview
Upload input data Configure workflow runs with
Input parameters Uploaded data Reused output data
Start workflow runs Monitor workflow runs View results preview Download complete results
Web interface: Getting started
Web interface: Creating a Run
Web interface: Checking run progress
A Typical Workflow Parse files from SNP calling
machines Annotate SNPs Predict effects (BioMart, VEP,
polyphen)
Workflow as a Service
The workflow IS the serviceRun restricted sets of Taverna workflows in the cloudConnects to other cloud based resources – storage, tools
etcUsers can tweak parameters, but not design their ownWeb portal access for scientistsData passed by reference instead of filePay as you go – cheap at the point of useElastic and available now
Acknowledgements/Partners University of
Manchester Eagle Genomics Technology Strategy
Board 100932 - Cloud Analytics
for Life Sciences National Health
Service Amazon Web Services