47
www.ci.anl.gov www.ci.uchicago.edu Creating and Sharing Re-usable Workflows in Cardiovascular Research: Lessons learned using Taverna Ravi Madduri University of Chicago Argonne National Laboratory

Ravi Madduri University of Chicago Argonne National Laboratory

  • Upload
    bryson

  • View
    37

  • Download
    0

Embed Size (px)

DESCRIPTION

Creating and Sharing Re-usable Workflows in Cardiovascular Research: Lessons learned using Taverna. Ravi Madduri University of Chicago Argonne National Laboratory. About me. Research Fellow at the Computation Institute, University of Chicago - PowerPoint PPT Presentation

Citation preview

Creating and Sharing Re-usable Workflows in Cardiovascular Research: Lessons learned using Taverna

Creating and Sharing Re-usable Workflows in Cardiovascular Research: Lessons learned using TavernaRavi MadduriUniversity of ChicagoArgonne National Laboratory

www.ci.anl.govwww.ci.uchicago.eduwww.ci.anl.govwww.ci.uchicago.eduAbout meResearch Fellow at the Computation Institute, University of ChicagoLead architect for Workflow technologies in the caBIG projectWorkflow Working Group Chair and a key person in the BIRN projectInterested in Informatics, Applications of High throughput data transfer, computing in Biomedical informatics#www.ci.anl.govwww.ci.uchicago.eduAnd..

#www.ci.anl.govwww.ci.uchicago.eduAgendaIntroduction to Service Oriented Science (SoS)Introduction to caBIG as an example of SoSIntroduce caGrid as an enabler of SoS visionIntroduce Workflow conceptsTalk about our implementation using TavernaShow a few Taverna workflows including the AutoQRS workflow from CVRGLessons learned and future directions.#www.ci.anl.govwww.ci.uchicago.eduService-Oriented Science People create services (data, code, instr.) which I discover (& decide whether to use) & compose to create a new function ... & then publish as a new service.

I find someone else to host services, so I dont have to become an expert in operating services & computers! I hope that this someone else can manage security, reliability, scalability,

!!Service-Oriented Science, Science, 2005

#www.ci.anl.govwww.ci.uchicago.edu5caBIG Goal and Vision caBIG is a virtual web of interconnected data, individuals and organizations that redefines how research is conducted, care is provided, and patients/participants interact with the biomedical enterprise.Connect the cancer research community through a shareable, interoperable infrastructureDeploy and extend standard rules and a common language to more easily share informationBuild or adapt tools for collecting, analyzing, integrating and disseminating information associated with cancer research and care#www.ci.anl.govwww.ci.uchicago.edu6caGridcaBIG function dimensionsClinical Data and Trials Management Biospecimen Management

In Vivo Imaging

Molecular Characterization

#www.ci.anl.govwww.ci.uchicago.eduCan track clinical trial registrationsFacilitate automatic capture of clinical laboratory dataManage reports describing adverse events during clinical trials Facilitate data sharing within an institution or across a multi-site trial Rapid integration of evolving science, such as patient reported outcomes or evolving research areas such as epigenetics

Biospecimens can include a range of biological materials: tumor biopsies, bone marrow, blood, and others.

While biospecimens may be used primarily for diagnostic purposes or as a part of a treatment intervention, in some cases, patients may donate these resources for further research as well.

The consent process when donating biospecimens is critical, as it drives how these resources may or may not be used for future research.

In Vivo Imaging supports both diagnosis and the monitoring of the effectiveness of treatment, using less invasive methods than other techniques. Imaging is a vital tool to support a variety of different kinds of research studies. The transmittal of imaging data among specialists and institutions remains cumbersome, for both patients and health care providers. Shared standards across imaging tools allow data to be more easily compared across imaging events and between researchers.

High throughput methods and sophisticated analysis methods allow for the combination of proteomics, gene expression, and other basic research data.Researchers need the ability to submit and annotate microarray data, integrate data from multiple providers, and permit analysis and visualization of data. Sophisticated analyses involving interdisciplinary teams of investigators require interoperable data exchange and analysis.

7What is caGrid? Biomedical applications that share data all have common needs for syntactic and semantic interoperabilitycaGrid is a software toolkit aimed at software developers creating Grid applications#www.ci.anl.govwww.ci.uchicago.eduBecause biomedical applications that share data all have common needs for syntactic and semantic interoperability, caGrid provides common services that support these needs. caGrid is a toolkit aimed at software developers creating Grid applications. When creating Grid applications, software developers use standard caGrid components to add strong security, enable syntactic interoperability, and achieve semantic interoperability to provide comprehensive access to data and analytical resources.

caGrid provides the GAARDS toolkit as a standard security platform. Security features provided by GAARDS allow Grid application developers to add secure login capabilities to their applications. Grid application developers also use GAARDS components to securely share applications and data.

Grid application developers use caGrid metadata services to add semantic information to all services. Providing semantic information is a vital part of ensuring that collaborators accessing your data can correctly interpret the meaning of your data. An example use of semantic information is to specify the units of measure for a data value.

The Grid is a trusted network that supports collaborative biomedical research. Getting on the Grid involves joining that trusted network. Each service joins the Grid by applying for and utilizing a credential issued by a trusted authority.

caGrid providesMetadata services that add semantic information to all Grid servicesThe GAARDS toolkit, a standard security platformIntroduce: the Eclipse for services developmentIndex Service: A service registry for advertisement and discovery of capabilities

#www.ci.anl.govwww.ci.uchicago.educaGrid: nuts and bolts

#www.ci.anl.govwww.ci.uchicago.edu10This diagram depicts the categories of services that can be found on the Grid. Most of the services shown in this diagram are included in the caGrid software distribution. For example, security services that enable secure data sharing and metadata services that provide semantic information. Community-provided services share data and analytical resources with collaborators over the Grid. Most often, these services are added to the Grid by adopting caBIG applications that include Grid services. After installing and configuring a caBIG application, collaborators utilize the provided Grid service to securely consume data and analytical resources provided by the application.A scientific workflowprecisely defines a multi-step procedure, to seamlessly integrate and streamline local and remote heterogeneous computational and data resources to perform in silico scientific exploration. #www.ci.anl.govwww.ci.uchicago.eduWorkflow Requirements12Service discovery

Data access

Service interaction

Security enforcement

Knowledge sharing

www.ci.anl.govwww.ci.uchicago.edu

caGriddatainstrumentscomputation resource

Virtualization Security

Connectivity Overview of caGrid Workflow

Discovery Composition Orchestration Analysis Community reusegenerateWorkflow as consumerEasily reuse services for complex experiments.Workflow as contributor Workflow as best practice wrapped as services.Workflow providing RoI for SOA

#www.ci.anl.govwww.ci.uchicago.edu1314

caGrid Workflow Suite Service discoveryData accessService interactionSecurity enforcementKnowledge sharing

www.ci.anl.govwww.ci.uchicago.eduThe caBIG Workflow SystemcaGrid

Discovery composition Execution Reuse Community reusegenerate Service discovery based on cancer research metadata.Data-flow modeling flavor caGrid activityState management (WSRF)Security (GSI)Implicit iteration: handle parallel executionWSRF and GSI enforcement

A Facebook for caGrid workflows

Workflow Execution. ServiceWorkflows in caGrid Portal

#www.ci.anl.govwww.ci.uchicago.edu15Semantic Service DiscoverySemantic search searches Index Service for registered caGrid services matching various search criteria:Service name, inputs, outputs, research center,class names, concept codes, etc.

Service metadataTypes of queryString based. Property based.Semantic based.Semantic Service Discovery#www.ci.anl.govwww.ci.uchicago.educaBIG services palette As a result of semantic search or direct addingcaBIG services appear in Tavernas Service PanelReady to be dragand dropped into caGrid workflows

#www.ci.anl.govwww.ci.uchicago.edu

Data access: CQL Builder#www.ci.anl.govwww.ci.uchicago.eduService interaction: managing state

01020#www.ci.anl.govwww.ci.uchicago.eduSecurity enforcementAuthenticationAbility to invoke services secured by Grid Security Infrastructure (GSI)Integrated caGrid Security framework (GAARDS) with Tavernas Credential managerTransport Level SecurityAuthorizationThis is done on the service side upon looking at Users credentialsCredential Delegation Service Integration

#www.ci.anl.govwww.ci.uchicago.eduSecure Grid servicesTaverna can invoke secure Grid services that require user to log in to caGridTaverna interacts with caGrids GAARDS infrastructure to obtain users proxy:Authenticate the user with users affiliated Authentication ServiceObtain users proxy from Dorian ServiceDefault proxy lifetime: 12 hours

#www.ci.anl.govwww.ci.uchicago.eduUsing secure caGrid servicesInvolves:Discovering a secure caGrid service from TavernaLogging onto selected caGrid to obtain a proxy certificateSaving and managing caGrid proxies and username and passwords#www.ci.anl.govwww.ci.uchicago.eduConfiguring secure services (1/2)Authentication Service and Dorian Service urls required in order to obtain users proxyCan be configured globally for all services from the same caGrid (in preferences)Can be configured individually for a particular caGrid service (overrides configuration from preferences)

#www.ci.anl.govwww.ci.uchicago.eduConfiguring secure services (2/2)View secures service detailsConfigure servicessecurity properties

#www.ci.anl.govwww.ci.uchicago.eduLogging onto caGridUser is prompted for his caGrid username and password when any secure service is invoked from a workflow for the first time

#www.ci.anl.govwww.ci.uchicago.eduCredential managementTaverna obtains proxy for user from Dorian Service using users caGrid username and passwordProxies are saved and managed byCredential ManagercaGrid username and password can also be remembered

#www.ci.anl.govwww.ci.uchicago.eduWorkflow execution serviceTaverna Workflow Service wraps the Taverna execution engine into a WS-Resource and exposes operations such as createResource, startWorkflow, getStatus, and getOutput for user submitted workflows.

startWorkflowcreateResourcegetStatusgetOutputWorkflow ServiceStateful Resources

(Resource Properties)EPRTaverna EngineData ServicesAnalytical ServicescaGrid & Other ServicesClient APITaverna WorkbenchWorkflow Portlet

#www.ci.anl.govwww.ci.uchicago.edu28Workflow execution serviceTaverna Workflow Service Provides stateful resources that execute the workflows. Supports caGrid security architecture (GSI Security). Allows programmatic submission of workflows.

#www.ci.anl.govwww.ci.uchicago.edu29Access Taverna workflow via caGrid portalTaverna Workflow Portlet is deployed in the caGrid Portal on the training Grid:URL : http://portal-demo.training.cagrid.org/web/guest/tools/taverna-workflowThe Portlet currently lists a few workflows with their descriptions that can be browsed from the above URL

Users can select a workflow they are interested in running.

View : 1

#www.ci.anl.govwww.ci.uchicago.edu30Access Taverna workflow via caGrid portalURL : http://portal-demo.training.cagrid.org/web/guest/tools/taverna-workflow Based on the number of input ports in the workflow, the portlet prompts the users to enter the input values in the textbox.

For example, the Lymphoma workflow takes only one input in the form an Experiment ID that identifies the experiment that caArray uses for data collection.

Hit submit after the entering the data.

View : 2

#www.ci.anl.govwww.ci.uchicago.edu31Access Taverna workflow via caGrid portalURL : http://portal-demo.training.cagrid.org/web/guest/tools/taverna-workflow The portlet stores the user submitted workflows in the current session of the portal.

Users can View all the Active and Completed Workflows in the session.

Clicking the Output Button shows the output of the workflow.

The portlet provides workflow specific view-resolvers to render the outputs. For E.g: Lymphoma workflow currently displays the output in a html table.

Views : 3, 4, & 5

#www.ci.anl.govwww.ci.uchicago.edu32Search cabig in myExperiment or Typehttp://www.myexperiment.org/search?type=workflows&query=cabig Typehttp://tinyurl.com/cabig-workflow

Knowledge Sharing#www.ci.anl.govwww.ci.uchicago.eduDiscovery using myExperiment 34

www.ci.anl.govwww.ci.uchicago.edu

MicroArray from tumor tissue Microarray preProcessingLymphoma prediction

Lymphoma Prediction Workflowwww.ci.anl.govwww.ci.uchicago.edu35Lymphoma type prediction Acknowledgement: Juli Klemm, Xiaopeng Bian, Rashmi Srinivasa (NCI)Jared Nedzel (MIT)

www.ci.anl.govwww.ci.uchicago.edu36AutoQRS Analysis Workflow

WFDB binary and Patient IDWFDBdata serviceAutoQRS Output Data ServiceAutoQRS Analytical ServiceRetrieve WFDB Patient RecordJSDL serviceInvokeProcessingAnalysisExecutionRecordAutoQRS XML ResultsStore WFDBwww.ci.anl.govwww.ci.uchicago.eduThe Taverna workflow

#www.ci.anl.govwww.ci.uchicago.eduThe result in MS Excel

#www.ci.anl.govwww.ci.uchicago.eduAccomplishmentsLymphoma workflow Among the top 20 most viewed/downloaded Workflows in myExperimentThis is more impressive given that this workflow was uploaded much later than the other workflowsOur BMC-Bioinformatics Article on caGrid Workflow Toolkit: A Taverna based workflow tool for cancer Grid achieved Highly Accessed relative to its ageWe are part of the CVRG Project that recently got renewed#www.ci.anl.govwww.ci.uchicago.eduLessons LearnedLower the barriers to entry for sharing data and analyticsSoftware is surprisingly hard to use for end users more so if the benefit is not all too clearReturn on Investment of a SOA is in creating reusable workflows (LEGO blocks)Workflows are only as good as the services we createTraditional SDLC does not always work in the favor of the end users80-20 and KISS

#www.ci.anl.govwww.ci.uchicago.eduGoals of Workflow Project in CVRG Deploy existing technology on the CVRG that can be used to store and execute workflows generated locally using the Taverna workbenchDevelop new technology that allows non-expert users to graphically compose and execute workflows via a web-interface.Extend the Taverna Engine and add support to invocation of REST-style services so that users can annotate workflow inputs and outputs using ontology terms from NCBO Bioportal and other ontology repositoriesDevelop specifications describing how workflows should be designed, validated, and documented, and support user development of workflows.Extend the technology so that workflows can be executed in a cloud-computing environment

#www.ci.anl.govwww.ci.uchicago.eduSuggested DirectionHosted Workflow Solution SaaS workflow toolsGlobus OnlineGalaxy#www.ci.anl.govwww.ci.uchicago.eduAcknowledgements Univ. Chicago / ANLIan FosterDinanath SulakheBo LiuUniv. Manchester, UKCarole GobleStian Soiland-ReyesAlexandra Nenadic

Inventrio Shannon HastingsStephen LangellaScott OsterOther colleagues from Ohio State University, National Cancer Institute, JHU #www.ci.anl.govwww.ci.uchicago.eduJournal papers & book chaptersComposition as a Service. IEEE Internet Computing. 2010A Comparison of Using Taverna and BPEL in Building Scientific Workflows: the case of caGrid. CCPE. 2010.Data-driven Service Composition in Building SOA Solutions: A Petri Net Approach. IEEE T-ASE, 2010Scientific workflows that enable Web-scale collaboration: combining the power of Taverna and caGrid. IEEE Internet Computing. 2008Workflow in a Service Oriented Cyberinfrastructure Environment. in: Junwei Cao (Ed.). Cyberinfrastructure Technologies and Applications. Nova Science Publishers, 2008. (book chapter)

#www.ci.anl.govwww.ci.uchicago.eduConference papersScientific workflows as services in caGrid: a Taverna and gRAVI approach. ICWS 2009Wrap Scientific Applications as WSRF Grid Services using gRAVI. ICWS 2009Orchestrating caGrid Services in Taverna. ICWS 2008Building Scientific Workflow with Taverna and BPEL: a Comparative Study in caGrid. WESOA 2008Build Grid Enabled Scientific Workflows using gRAVI and Taverna. SWBES 2008

#www.ci.anl.govwww.ci.uchicago.eduContact informationRavi [email protected] Computation Institute, Univ. Chicagohttp://www.ci.uchicago.edu/

#www.ci.anl.govwww.ci.uchicago.eduCancer Data Standards Repository

Taverna workbench

Security services

(3) Service invocation

caFlow

(4)

(1)

(5)

(1) Service discovery

Index

Metadata

(2) Data access

(2)

Analytical services

Data services

(3)

authen.

credentialdelegation

...

(5) Knowledge sharing

(4) Security enforcement

Cancer Data Standards Repository