View
219
Download
1
Tags:
Embed Size (px)
Citation preview
1
Improving the Reuse of Scientific Workflows and their By-products
Xiaorong XiangNational Evolutionary Synthesis Center (NESCent)
Duke University, University of North Carolina - Chapel Hill, and North Carolina State University
Gregory MadeyDepartment of Computer Science and Engineering
University of Notre Dame
2007 IEEE International Conference on Web Services (ICWS 2007)Salt Lake City, Utah, July 2007Supported in part by the Indiana Center for Insect Genomics (ICIG) & the Indiana 21st Century Fund
3
Outline: two parts
Production system (MoGServ) for bioinformatics workflow Bioinformatics application Productivity improvement
Prototype system exploring ideas for end-user composition Workflow reuse Knowledge management/discovery
4
From the article “Genome Sequencing vs. Moore’s Law: Cyber Challenges for the Next Decade” by Folker Meyer in journal CTWatch Quarterly August, 2006 volume 2 number 3
Bioinformatics today
• Rapidly accumulating data: DNA sequences, contigs, expression data, annotations, etc.• Non-standard independently developed heterogeneous data sources• Data sharing and security• Productivity Problem!
5
SOA in Bioinformatics
MORE Community efforts needed to provide
more shared and reliable services More demonstration projects needed =>
best practices, measured utility, feedback to middleware projects, etc.
Recent exposure of data & analysis tools as services
Large public databases and bioinformatics toolsMiddleware projects
Provide infrastructure to compose, manage,execute, connect the distributed services
6
Mother of Green (MoG) project
Biological science In collaboration with Prof. Jeanne Romero-Severson,
Biological Sciences, University of Notre Dame. Study the deep phylogeny of plastid
Computer science Provide an environment to support scientists’ investigations A case study of using SOA for data and application
integration A prototype for future research in service-oriented
architecture domain
7
Mother of GreenMother of Green
• Malaria causes Malaria causes 1.5 - 2.7 million deaths every year1.5 - 2.7 million deaths every year
• 3,000 children under age five die of malaria every 3,000 children under age five die of malaria every
dayday
•Plasmodium falciparum (a protozoan parasite)Plasmodium falciparum (a protozoan parasite)
causes human malariacauses human malaria
• Drug resistance a world-wide problemDrug resistance a world-wide problem
• Targeted drug design through phylogenomicsTargeted drug design through phylogenomics
• Malaria causes Malaria causes 1.5 - 2.7 million deaths every year1.5 - 2.7 million deaths every year
• 3,000 children under age five die of malaria every 3,000 children under age five die of malaria every
dayday
•Plasmodium falciparum (a protozoan parasite)Plasmodium falciparum (a protozoan parasite)
causes human malariacauses human malaria
• Drug resistance a world-wide problemDrug resistance a world-wide problem
• Targeted drug design through phylogenomicsTargeted drug design through phylogenomicsP. falciparumP. falciparum
8
Mother of GreenMother of Green
• P. falciparumP. falciparum has three genomes has three genomesNuclear, mitochondrial, plastidNuclear, mitochondrial, plastid
• Animals and insects have only twoAnimals and insects have only two• Target the third genomeTarget the third genome• No harm to animalsNo harm to animals• New antimalarial drugNew antimalarial drug• High risk, high tech, high payoffHigh risk, high tech, high payoff
J. Romero-SeversonJ. Romero-SeversonDepartment of Biological SciencesDepartment of Biological SciencesGreg Madey & Xiaorong XiangGreg Madey & Xiaorong XiangDepartment of Computer Science & EngineeringDepartment of Computer Science & Engineering
J. Romero-SeversonJ. Romero-SeversonDepartment of Biological SciencesDepartment of Biological SciencesGreg Madey & Xiaorong XiangGreg Madey & Xiaorong XiangDepartment of Computer Science & EngineeringDepartment of Computer Science & Engineering
9
Mother of GreenMother of Green
•Plastids are the third genome•Intracellular organelles •Terrestrial plants, algae, apicomplexans•Functions in plants and algae
PhotosynthesisOxidation of water Reduction of NADPSynthesis of ATPFatty acid biosynthesisAromatic amino acid biosynthesis
•Functions in apicomplexans ?
•Plastids are the third genome•Intracellular organelles •Terrestrial plants, algae, apicomplexans•Functions in plants and algae
PhotosynthesisOxidation of water Reduction of NADPSynthesis of ATPFatty acid biosynthesisAromatic amino acid biosynthesis
•Functions in apicomplexans ?
Chloroplast in plant cell
Plastid in Toxoplasma sp.
Apicoplast in P. falciparum
plastid
10
Mother of GreenMother of Green
•The apicoplast appears to code for <30
proteins.
•Repair, replication and transcription proteins
•Why is the apicoplast essential?
11
• Find the ancestors of the apicoplast• Identify genes in the ancestors• Determine gene function • Look for these genes in the P. falciparum nucleus• Then study regulatory mechanisms in candidate genes
Mother of GreenPhylogenomicsMother of GreenPhylogenomics
12
Phylogenomics of plastids
• Very old lineage (> 2.5 billion years)• Cyanobacterial ancestor• Three main plastid lineages
GlaucophytesGroup of freshwater algaeChloroplast resembles intact cyanobacteria
ChlorophytesGreen plant lineageChloroplast genome reducedMany chloroplast genes now in nuclear genome
RhodophytesRed algal lineage
Chloroplast genome bigger than in green plantsOomycetesApicomplexans
13
Phylogenomics of plastids
• One cyanobacterial ancestor ?• Many?• Lineages are not linear
One plastid origin
Multiple plastid origins
14
The process of endosymbiosis.
Horizontal Gene Transfer (arrows) from the plastid to the nucleus.
The nucleomorph is a remnant of the original endosymbiont nucleus.
Primitive eukaryote
Endosymbiont plastid
Secondary endosymbionts
Second eukaryote
Secondary nonphotosynthetic endosymbiont
Cyanobacteria
Nucleus
Nucleus
Nucleomorph
Plastid disappears
15
Secondary endosymbiont
Tertiary endosymbionts
Third eukaryote
Tertiary nonphotosynthetic endosymbiont
Plastid disappears
Tertiary endosymbiosis. Horizontal Gene Transfer
P. falciparum
16
The information gathering problem
• Rapid accumulation of raw sequence information~100 sequenced chloroplast genomes~57 sequenced cyanobacterial genomesRate of accumulation is increasingInformation accumulates faster than analyses finishInformation in forms not readily accessible
• SolutionSemi-automated web-services“Smart” web-servicesSemantic web
17
A typical in-silico investigation – Data driven research
A: Query complete genome sequences
given a taxa
A: Query complete genome sequences
given a taxa
B: Query protein coding genes
for each genome sequence
B: Query protein coding genes
for each genome sequence
C: Eliminate vectorsequences
C: Eliminate vectorsequences
D: Sequences alignment
D: Sequences alignment
E: Phylogenetic analysis
E: Phylogenetic analysis
18
Time consuming manual web-based operations
Data collection Copy & paste!
Analysis tool usage Copy & paste!
Experiment data recording Copy & paste!
Repetitive experiments for scientific discovery Copy & paste!
Repeat as new data becomes available Copy & paste!
19
MoGServ system architecture
MoGServ interface Web interface Application interface
MoGServ middle layer Data access storage Data and analysis services Service and workflow registry Indexing and querying metadata Service and workflow enactment
Acting in two roles: service requester and service provider
Web InterfaceWeb Interface ApplicationsApplications
Application ServerApplication Server
Data AccessServices
Data AccessServices
Data AnalysisServices
Data AnalysisServices
Job ManagerJob Manager
Job LauncherJob Launcher
Service/WorkflowRegistry
Service/WorkflowRegistry
MetadataSearch
MetadataSearch
Local DataStorage
Local DataStorage
Workflow/SoapEngines
Services
NCBINCBI DDBJDDBJ EMBLEMBL
Data/Services Providers
MoGServMiddleLayer
ServicesAccessClient
OthersOthers
MoG
Ser
v S
yste
m A
rchi
tect
ure
21
Data storage and access services
Local database Integrating data from multiple data sources with
scientists interests Supporting repetitive investigations against
several subsets of sequences Avoiding network traffic and service failure when
retrieving data on-the-fly from public data sources Accessing the data in the local database by
services
22
Service and workflow registry
A table-based description with necessary properties Text description Service location Input/output Provider Version Algorithm Invocation method
Not intended for supporting service discovery or composition To answer end-users questions about their results
Provenance: “Which algorithm was used to generate the data and what is the source of the input data?”
A repository of service and workflow used for local application developers
23
Indexing and querying metadata
Metadata Service and workflow description Description of sequence data in order to track the
origination of data Experimental data output, input, and intermediate
data Indexing and querying with keyword
Lucene Implemented as services
24
Service and workflow enactment
INPUT
Parameters
Task Name
Timer
INPUT
Parameters
Task Name
Timer
Service/WorkflowRegistry
Job ManagerJob Manager
Find the service/workflowdefinition using the task name
Form a JobDescription
Output
Job ID
Output
Job ID
Job LauncherJob Launcher
Instances of Workflow/Service Engines
Instances of Workflow/Service Engines
Job Information
25
Implementation Development and deployment
J2EE, JSP, XSLT Tomcat 5.0.18 / Axis 1.2
Database PostgresSQL 8.1
Index and search of metadata Apache Lucene library
Service implementation Java2WSDL Wrap command line applications with JLaunch library
Workflow Taverna workbench, part of myGrid project Freefluo workflow engine
26
Data and services
Services, Workflows Data collection from remote database Query local database Data analysis tools, blast, clustalw, Data format conversion, readseq Management data sets and jobs Download and upload
Data Complete genome sequences ATP gene sequences Sequence sets Saved jobs
29
Improvement opportunities
Use existing domain ontology in bioformatics community to describe services, workflows, and data
Integrate the semantic web technology to support end-users workflow creation based on their knowledge of scientific domain
Support users with limited knowledge of scientific processes
Record various workflow representations Facilitate the discovery and reuse of prior workflows
Knowledge management Knowledge discovery
30
Service Composition and workflows
Service composition Ad-hoc Semi-automate
Semantic annotation + reasoning Automated
Semantic annotation + planning
Scientific workflows Workflows composed based on service-oriented
architecture for assisting scientists in accessing and analyzing data.
31
Current workflow management systems Existing workflow management system and bi
oinformatics middleware Taverna, Kepler, Triana, Pegasus Design, execute, monitor, re-run
Support ad-hoc, semi-automated and automated service discovery and composition from scratch
32
Our approach
Reuse the verified knowledge and workflow in the community Increase the correctness of composed workflows
over time Provide more accurate guidelines for users
A four level hierarchical workflow structure An enhanced workflow system
33
Aligning
Retrieving
Workflow A defined by a less experienced user using the functional definition of services
queryGene
clustalW
Workflow B defined by an intermediate user with executable services
queryGene
clustalW
queryGene queryGene
setIds
setFilter
clustalW clustalW
Workflow C defined by an expert user with two extra executable services to ensure the accurate output of
the biological process
Three user-defined workflows from different viewsQuestion: “are gene genealogies for ATP subunits α, β,and γ different?”
34
UserService
Annotator
Abstractworkflow
OWLDL reasoner
OWLDL reasoner
Ontology
Create abstract workflow using ontology
Annotate services using ontology
Semantics enabled service registry
Semantics enabled service discovery
Semantics enabled service discovery
Service matchmakingService matchmaking
Workflow composer (software agent/experienced users)
Find appropriate service
Workflow execution
engine
Workflow execution
engine
concreteworkflow
Data provenancemanagement
Data provenancemanagement
Collect and manage information about data origination
Knowledgebase
management
Knowledgebase
managementKnowledgediscovery
Knowledgediscovery
Enhanced workflow system
MogServ
35
Encode, convert theHigh level definition To low-level executable
Invoke a workflow withSpecific input data andRecord the data Provenance and Performance of services,workflows.
Abstract workflow
Concrete workflow
Optimal workflow
Workflow instance
Replace individual Services with theiroptimal alternatives
Task A Task B
Service B
Service A
Service DService C
Service BService A
Service DService C’
input
outputService B
Service A
Service DService C’
Our hierarchical workflow structure
F F T f i l e a
/usr/local/bin/fft /home/file1
M o v e f i l e a f r o m h o s t 1 : / /
h o m e / f i l e a
t o h o s t 2 : / /h o m e / f i l e 1
Abstract Workflow
Concrete Workflow
DataTransfer
Data Registration
Pegasus workflow structure
36
Reusable knowledge Connectivity
Helps to convert from abstract workflow to concrete workflow
Alternative services Helps to convert from concrete workflow to optimal
workflow Quality profile of services
Helps discover optimal workflows Mapping of abstract workflow and concrete workflow
Helps to choose reusable workflows
37
Connectivity identification(Match detection)
Service: QueryLocal Operation: createSet
performTask: mygrid:retrieving
inputPara: Settype(String, mog:gene)
Queryterm(String, null) outputPara:
Setid(string, mog:geneset)useResource: MoG
Service: ClustalW Operation: runClustalWdf
performTask: mygrid:aligning
inputPara: Setid(String, mog:set )Sequencetype(String,
mog:sequence) outputPara:
filen(string, mygrid:sequence_alignment_report)
useResource: EBI
Service: FormatConversion
Operation: convert performtask:
mygrid: translatinginputPara: filen(String, mygrid:sequence
_alignment_report )outputPara:
Out(String, mygrid:nexus_paup_format)
useResource: MoG
Parameter (data type, semantic type) Matching rule: opertation ij → operation mn if exist parameterk is output parameter of operationij and exist parametero is input parameter of operationmn and data type (parametero) = data type (parameterk) and semantic type (parametero) = semantic type(parameterk)
38
Need for verified service connectivity The mismatching problem
TP FP
FN TN
Match Detectionoutput
Accurate annotation
Inaccurate annotationLack semantic annotationInaccurate reasoning
Inaccurate annotationLack of semantic annotationInaccurate reasoning
Accurate annotation
GenBankServiceOut:GenBank record
BlastpIn: protein sequenceX
Mediator, adaptor,shim
DDBJ-XMLOut: sequence
data record
NCBI blastIn: sequence data
record
fasta formatSelf-defined format
May be detectedby experts at design time or after run
Can be detected automatically
X
Yes No
Yes
No
FPTN
Real match
39
Connectivity Graph Implementation
Registrationprocess
registry
Automatically Identify the connectivity
Knowledge base
Store the connectivity
Workflow Translation /
Service compositionprocess
Refine, update, decompose the workflow
connect (servicea, operationai, parameterc, serviceb, operationbi, parameterd)identifyConnect (Single service, rdf repository)Search at syntactic level: search path between two nodes search next available service
automatic composition base on input, output Implementation: shortest path algorithm Dijkstra
Connectivity between services is converted to finding a path between two nodes in a graph
40
Generic Service Description Ontology(myGrid/Feta model)
DataServices
Workflows
Service Domain Ontology(myGrid)
MoGServ applicationDomain Ontology
(MoGServ)
Software components for annotation RDFStore
Ontological modules used for semantic description of data, services & workflows
41
MoGServ Application Domain Ontology
To better track the data origination
To support the automation of workflow creation
To better share the data on the web in the future
properties domain range
invokedby Job User
isParentOf Set Set
isInstanceOf Job Service
hasSetName Set XML:String
Ontological modules
Number of Concepts Number of propertiesObject Datatype
MoGServ 12 9 7
myGrid 419 8
myGrid/Feta model 26 11 17
Example concepts and properties defined in MoGServ
42
Sample service/workflow annotation
Question:Which service has an operation that accepts nucleotide_sequence as a parameter
Answer:Uri:http://www.ebi.ac.uk …/alignment:blastn_ncbiOperationName: Run
Displayed byRdf-Gravity
43
Implementation of annotation and query components for data, services & workflows
Sesame 1.2.6 library Supports files, RDBMS, SeRQL
Sesame RDF store
AnnotationTemplates
(Data)
AnnotationTemplates(Service)
Querytemplates
Select Y, W, X from {Y} mg:hasOperation{W} mg:inputParameter {X} rdf:type {mog:set}using namespace rdf = <http://www.w3.org/1999/02/22-rdf-syntax-ns#>, mg = <http://www.mygrid.org.uk/ontology#>, mog = <http://almond.cse.nd.edu:10000/mog#>
QueryComponents
Annotationcomponents
resultService: http:host.cse.nd.edu/axis/services/ClustalW?wsdlOperation: runClustalWdfinputParameter: setidSeRQL
44
Experiment Used 418 concepts from domain
ontology for semantic type, defined 10 concepts for data type.
Randomly generate service annotation. 1 input, 1 output
1000 services connectivity graph (right side)
Intel Pentium mobile 1.5GZ
Number of services Number of Matched pair
Load RDF repository
(milliseconds)
Average time of match detection per single service (milliseconds)
200 10 1547 12.02
400 34 2346 13.01
600 84 2600 12.31
800 138 3015 12.35
1000 225 3325 12.51
Number of nodes 724
Number of arcs 587
Average path search time (milliseconds)
Less than 1
Connectivity graph load time (milliseconds)
220
Length 0 = 724, length 1= 587,length 2=448, length 3= 281,Length 4=114, length 5=71Length 6 =28, length 7=16Length 8 = 4, length 9 = 2
Conclusion:Feasible solution.
45
Reuse of workflows Reuse of abstract
workflows Reuse of concret
e workflows Compare structur
al similarity of two workflows
Implementation: SUBDUE algorithm
SUBDUE is has a graphy match utility that is part of its data mining system
Given workflow is converted to a graph and fed to the SUBDUE match algorithm
Abstract example …
input
output
query_term
hasParameter
task
hasInput
task
hasNext
retrieving
aligning
multiple_alignment_report
performTask
hasOutput performTask
hasParameter
v 1 inputv 2 outputv 3 taskv 4 taskv 5 query_termv 6 retrievingv 7 aligningv 8 multiple_aligning_report
e 3 4 hasNexte 3 1 hasInpute 4 2 hasOutpute 3 6 performTaske 4 7 performTaske 1 5 hasParametere 2 8 hasParameter
SUBDUE input formatGraph view
46
Conclusion Pro
Increase the correctness of the formed workflow over time Avoid the incorrect, inaccurate semantic annotations Take advantage of verified knowledge Avoid the ontological reasoning process
Better support for semi-automated and automated service composition over time Provide more accurate guideline to users over time
Con The connectivity graph can be big
Number of parameters Number of services
Search the connectivity of a service when a service is registered in the system may take relative long time More complex matching rule Number of parameters
May not have high accuracy at the beginning
47
Future work
Integrate the GridSam into the MoGServ for execution, monitoring
Integrate the Grid computing technology for resource allocation
Refine the MoGServ application domain ontology Create interface for end-user workflow creation Create interface for individual workspace Evaluate the scalability, accuracy of connectivity
graph approach and the graph matching approach with large number real workflows and services