Middleware for Bioinformaticians: Lessons from the my Grid Project

Middleware for Bioinformaticians: Lessons from the myGrid Project Carole Goble and the myGrid consortiumUniversity of Manchester, UKhttp://www.mygrid.org.uk

EPSRC funded UK eScience Program Pilot ProjectParticular thanks to the other members of the Taverna project, http://taverna.sf.net

e-Science is about global collaboration in key areas of science and the next generation of [computing] infrastructure that will enable it.

Sir John Taylor, Director Office of Science and Technology, UK

Science = Science + e-ScienceDiscovery increasingly done in silico on results obtained from experiments using computational analysis & data repositories.A new era of collection based and simulation based science, in addition to hypothesis driven and experimental sciencepredictionhypothesisanalysisminingintegrationexperimentresultsanalysisminingintegration

Bioinformatics The application of computer technology to the management of biological information. Specifically, it is the science of developing computer databases and algorithms to facilitate and expedite biological research, particularly in genomics. http://www.informatics.jax.org/mgihome/other/glossary.shtml

What does a bioinformatician do all day?

Williams-Beuren Syndrome (WBS)Contiguous sporadic gene deletion disorder1/20,000 live births, caused by unequal crossover (homologous recombination) during meiosisHaploinsufficiency of the region results in the phenotypeMultisystem phenotype muscular, nervous, circulatory systemsCharacteristic facial featuresUnique cognitive profileMental retardation (IQ 40-100, mean~60, normal mean ~ 100 )Outgoing personality, friendly nature, charming

Williams-Beuren Syndrome Microdeletion on Chromosome 7q11.23 **Chr 7 ~155 Mb ~1.5 Mb7q11.23GTF2IRFC2CYLN2GTF2IRD1NCF1WBSCR1/E1f4HLIMK1ELNCLDN4CLDN3STX1AWBSCR18WBSCR21TBL2BCL7BBAZ1BFZD9WBSCR5/LABWBSCR22FKBP6POM121NOLR1GTF2IRD2C-cen C-midA-cen B-midB-cen A-midB-telA-telC-telWBSCR14WBSSVASPatient deletions

Picture of WorkflowPicture ofLab ScientistPredictVerifyCandidate genes in the WBS Critical Region

Picture ofLab ScientistPredictVerifyCandidate genes in the WBS Critical Region

Filling a genomic gap in Silico Services published on the web, many without programmatic interfaces

Filling a genomic gap in Silico Services published on the web, many without programmatic interfacesPublic and local databases and data setsSequence alignment algorithmsStochastic models for clustering gene expression dataGene prediction algorithmsProtein-protein interaction algorithmsProtein folding simulationsVisualisation toolsLiterature searchesOntology services

Filling a genomic gap in Silico

Obstacles to linking up resourcesAccess to and understanding of distributed, heterogeneous information resources and applications1000s of relevant information sources and tools, An explosion in availability of experimental data, scientists annotations, text documents; abstracts, eJournal articles, monthly reports, patents, ...Rapidly changing domain concepts and terminology and analysis approachesConstantly evolving data structures and dataContinuous creation of new data sourcesHighly heterogeneous sources and applications Different policies for access, security, Data and results of uneven quality, depth, scope

The tedium of bioinformaticsFrequently repeated info rapidly added to public databasesTime consuming, mundane and error proneDont always get resultsHuge amount of interrelated data is produced handled in notebooks and files saved to local hard drive, its provenance often lostMuch knowledge and know-how remains undocumented

The Bioinformatician does the analysisThe Bioinformatician is the middleware

Reuseadapting and sharing best practice and know-how across a communityGraves DiseaseSimon Pearce and Claire Jennings, Institute of Human Genetics School of Clinical Medical Sciences, University of Newcastle, UKWilliams-Beuren SyndromeHannah Tipney, May Tassabehji, Andy Brass, St Marys Hospital, Manchester, UK

No single applicationClinical recordsProteomicsSmall moleculesComputational steerage of heart simulation codes

Cardiac Vulnerability to Acute Ischemiahttp://www.bioeng.auckland.ac.nz

Cardiac Vulnerability to Acute Ischemia, Simulation Step

Finite Element Bidomain SolverSimulation protocol pace at 250 msInitial conditions K+ 5.4 mmol/lParameters Shock strength 50 A.. Mechanical modelElectrophysio modelsResult file produced for every 1ms, 7.3MB200ms simulationMonitor, Stop, Checkpoint, DiscardRestart with different parametersPerturb initial conditions: Stage 1 and stage 2 hypoxia1 week to run per simulationBlanca Rodriguez, OxfordData AnalysisBlood perfusion bath model

Integrative BiologyTrue in silicoHypothesis driven, predictiveValidate and predictGenerating new data Computationally demanding, stateful, lengthyWorkflow is an experimental protocolMeasurement basedBlancas simulation codes wrapped as soaplab services.

WBS and GravesIn silico support for experimentInformation integrationStateless workflowsAssistiative e-ScienceMore semanticsDiscover data sets and services

GenBank Accession NoGenBank EntrySeqretNucleotide seq (Fasta)GenScanCoding sequenceORFsprettyseqrestrictcpgreportRepeatMaskerncbiBlastWrappersixpacktranseq6 ORFsRestriction enzyme mapCpG Island locations and %Repetitive elementsTranslation/sequence file. Good for records and publicationsBlastn Vs nr, est databases.Amino Acid translationepestfindpepcoilpepstatspscanIdentifies PEST seqIdentifies FingerPRINTSMW, length, charge, pI, etcPredicts Coiled-coil regionsSignalPTargetPPSORTIIInterProHydrophobic regionsPredicts cellular locationIdentifies functional and structural domains/motifsPepwindow?Octanol?BlastWrapperURL inc GB identifiertblastn Vs nr, est, est_mouse, est_human databases.Blastp Vs nrRepeatMaskerBLASTwrapperSort for appropriate Sequences onlyPink: Outputs/inputs of a servicePurple: Tailor-made servicesGreen: Emboss soaplab services Yellow: Manchester soaplab services RepeatMaskerTF binding PredictionPromotor PredictionRegulation Element PredictionIdentify regulatory elements in genomic sequence

Williams-Beuren WorkflowsCharacterisation of nucleotide sequenceIdentification of overlapping sequenceCharacterisation of protein sequence

Experiment life cycleDiscovering and reusingexperiments and resourcesManaging lifecycle, provenance and results of experiments Sharingservices & experiments

PersonalisationForming experiments Executing and monitoring experiments

Middleware for bioinformaticians Construct, manage and publish in silico experiments, chiefly as workflows, to link up your own and others resourcesData intensive, up stream analysisWorkflow Reuse - foundations for sharing and adapting workflows and resources, and their outcomes, based on semantic descriptionsWhole experiment lifecycle, including provenance

Middleware for bioinformaticians Open domain services and resourcesOpen communityOpen applicationOpen model and open dataOpen architectureService Oriented ArchitectureLoosely coupledWeb services basedAssemble your own componentsDesigned to work together

Third-party toolsUtopiaHaystack LSID LaunchpadmyGrid information modelApplicationsCore ServicesExternal ServicesService & workflow discoveryFeta semantic discoveryGRIMOIRES federated UDDI+ registryWeb portalsTaverna e-Science workbenchWorkflow enactmentFreefluo workflow engineMetadata ManagementKAVE metadata storeKAVE provenance capturemyGrid ontologySoaplabGowlabAMBIT text extraction serviceLegacy applicationsWeb ServicesOGSA-DAI databasesWeb SitesOGSA-DAI DQP servicee-Science coordinatione-Science mediatore-Science process patternse-Science eventsLSID supportData ManagementmIR myGrid information repositoryWeb Service (Grid Service) communication fabricNotification servicePedro semantic publicationPedro semantic publicationCScience OutcomesJava applicationsExecutable codes with an IDL

Making,wrapping,publishing and discovering services

Workflow ComponentsScufl Simple Conceptual Unified Flow LanguageTaverna Writing, running workflows & examining resultsSOAPLAB Makes applications availableFreefluo Workflow engine to run workflowsFreefluo

Data and Metadata ManagementLife Science IdentifiersOWL & RDFS OntologiesTo annotate and classify entities with a common vocabulary based on a common understanding.RDF Knowledge Added Value to Experiment Information Repository and Common Information model for e-Science

Layering modelsOperation

name, descriptiontaskmethodresourceapplicationService

namedescriptionauthor organisation Parameter

name, descriptionsemantic typeformattransport typecollection typecollection format

WSDL based Web serviceWSDL basedoperationSoaplab servicebioMoby serviceworkflowhasInputhasOutputLocal Java codesubclasssubclass

EnactorWorkflow scriptFailure policyAlternates listMetadatatemplateInfoRepository

KAVELSIDLSID + DataLSIDs + MetadataExternalData StoreLSIDDataServicesEvent Notification ServiceInvocation + DataEventsService DiscoveryExternal Data StoreLSID + DataServiceRegistryServiceSemantic Annotation

What are the outcomes for Sciencee-ScienceComputer Science?

CTA-315H11CTB-51J22ELNWBSCR14 Four workflow cycles totallingThe gap was correctly closed and all known features identifiedA Pseudo gene missed when working by hand discoveredBiological Outcomes

Picture ofLab ScientistPredictVerifyCandidate genes in the WBS Critical RegionVerifyRobert Stevens, Hannah J Tipney, Chris Wroe, Tom Oinn, Martin Senger, Phillip Lord, Carole A Goble, Andy Brass and May Tassabehji Exploring Williams-Beuren Syndrome Using myGrid in. Bioinformatics 20:i303-310. Proc of 12th Intelligent Systems in Molecular Biology (ISMB), 31st Jul-4th Aug 2004, Glasgow, UK

Bioinformatics e-Science OutcomesElapse time to perform one pipeline from 2 weeks to 2 hoursData collection improvedOther people have used and want to develop the workflowsWhich means describing them so they can be understood

Changed work practicesAnalysis all at onceService interoperability -> results integration

Bioinformatics e-Science OutcomesCuts down the time taken to perform one pipeline from 2 weeks to 2 hoursMuch more systematic collection and analysis. More regularly undertaken. Less boring. Less prone to mistakes.Notification means dont even have to initiate it.Reuse happens between teams.Other people have used and want to develop the workflowsWhich means describing them and annotating them

Extra results foundChanged work practicesWhich means we have a different problem to solve

BioinformaticiansWorkflowProviders3rd party annotation providersBiologistsSearch existing work

Edit workflow Try out workflowRegister and annotate workflow and new services for reuseservicesworkflowsworkflow fragmentsCreate or wrap services, especially shim servicesAdapt workflow structureParameterise servicesFragment workflowAnnotatewith- free text- ontologyDeploy workflowMaintain reuse/repurpose history

Results Integration

Keeping track a Web of scienceRelationship BLAST report has with otherOther classes of information related to BLAST reportJun Zhao, Chris Wroe, Carole Goble, Robert Stevens, Dennis Quan, Mark Greenwood, Using Semantic Web Technologies for Representing e-Science Provenance in Proc 3rd International Semantic Web Conference, Hiroshima, Japan, Nov 2004

Building a data model and viewing resultsLeaky pipes with prior process path dependencies and state

Process 1Process 2Process 3EnactorWorkflow WorkbenchSteering ControlSteering of simulations by manipulation of service stateWorkflow definition sent to enactormyGrid Metadata StoresScientistsProcess and data provenance captured and stored by metadata servicesScientist designs, initiates and steers simulation from Taverna WorkbenchIntegrative Biology Project http://www.integrativebiology.ac.uk

Some lessons

The problem is (now) not connecting up and running the servicesIts managing and visualising all the data results, and the metadata and the provenance records and stuff

Activation EnergyImportant for take up and community building.And take up leads to much better understanding. 1 hour to learn how to use the workflow environmentService scavenge and goDeal with legacy

Services suckThe workflow are only as good as the services they link together. myGrid ships with access to > 1000Bootstrapping services.Reliability. Stability. Alternates.Service provider partners.

Sharing takes effort.Unanticipated reuse by people you dont know in automated workflows.The metadata needed pays off but its challenging and costly to obtain..Automated, service providers, network effectsQuality control. Misuse. Inappropriate use. Competitive advantage, Intellectual property.Workflow design - local or licensed services

A NCBI-BLASTDescriptionService Name: BlastOperation: executetask: pairwise_local_aligningresource: EMBLapplication: blastnParameter:Input: Name: accessionsemantic type: EMBL Nucleotide sequence idtransport data type: stringOutput:Name: Resultsemantic type: sequence alignment reporttransport data type: string

Tiered specificationsClasses of servicesDomain semanticUnexecutablePotentials

Instances of servicesBusiness operationalExecutableActualsWroe C, Goble CA, Greenwood M, Lord P, Miles S, Papay J, Payne T, Moreau L Automating Experiments Using Semantic Data on a Bioinformatics Gridin IEEE Intelligent Systems Jan/Feb 2004

Disposable SWPlan to throw awaySeparate e-Science research from e-Science developmentSupport your e-science pioneersUser driven Early adoptersUser driven pioneersTechnology drivenLash upPrototype 1internalPrototype 2externalDevelopment trackResearch trackMigration track

Reusable SWDesign for extensibility and reuse open systemsDesign for the generic but build from the specificSeparate CS research and development tracks When you are interoperating, standards arent boring, they are necessary. Standards mean you can use everyone elses stuff.

Interoperability and execution complexityLayers of detailScience Computer complexity mismatch

Shim Services

The devil is in the detailExperiment provenanceSimple workflowDescriptions in biological languageWorkflows for automagical execution implicit iteration, generous typing Debugging and rerunning provenancelogsSimple classifications of servicesExpressive ontologies to match up services automaticallyDescriptions for automatic service execution and fault management

Freefluo Workflow Enactor CoreScufl languageparserProcessorProcessorPlainWebServiceSoaplabProcessorLocalAppProcessorEnactorTavernaWorkbenchProcessorBioMOBYProcessorSeqHoundProcessorBioMART

Yellow SoaplabGreen WSDL Web Service

Scientists are from Venus and Computer Scientists are from Mars

They have different needs and motivations.

Mars vs VenusNot my problem: Lets solve this other problem which isnt your problem but is fun and leads to interesting software.Over-complication: Lets solve this harder problem than take the easier route that solves your problem. Hendler Principle a little semantics goes a long way.Size matters: Well it works for my toy test set that I synthesisedMother knows best: Tell us what you want and then go away and we will build it for you. Fin: You cant use it until its finishedSuits me: I can understand it -- just need train you to be just like me

Venus vs MarsThe parent principle: Repeating the same old mistakes despite our experiences. Simplifications, hackery and monolithes now stores up trouble down the road. It works cos I say so: it works in my application/hack, thus it is goodShort termism: It just about holds together to get the results for my paper. Lets hope the PhD student doesnt leave...You have to invest now for the future.Isolationism: It doesnt matter if only I can understand what I am doing, no one else will want to know. Oh yeah?

Services & ToolsBiologistsScienceBiologye-ScienceBioinformaticseComputer ScienceBioinformaticiansTools MiddlewareConversation Respect Understanding Compromise Collaboration

You have been working with us too long - I understood you perfectly Mike Sternberg Head of Structural Bioinformatics Group & Director of Imperial College Centre for Bioinformatics

Thanks tomyGrid: Chris Wroe, Katy Wolsencroft, Tom Oinn, Antoon Goderis, Peter Li, Anil WipatWBS: Hannah Tipney, May TassabehjiGraves Disease: Clare JenningsIntegrative Biology: David Gavaghan

45 minutesHow we learnt to love our scientistsSummary: An in silico experiment is a procedure that uses computer-based information repositories and computational analysis to test a hypothesis, derive a summary, search for patterns, or demonstrate a known fact. The myGrid project has developed open service-based middleware to support the construction, management and sharing of data-intensive in silico experiments in biology. Biology domain services are coordinated by workflows; middleware services are coordinated by e-science patterns and a common information model. The e-scientific method is supported by provenance management and change notification. myGrid has been successfully used by biologists in the field. We present myGrid through its real use in gene alert and characterisation in a Williams-Beuren Syndrome investigation. We are on our THIRD versionProject running since Oct 2001Will conclude June 2005We started pretty well from scratch building on standards and others tools if possible.Bioinformatics is a science that seeks to understand Biology using information technology and computing. Bioinformaticians develop software and use it to store, organize, search, and manipulate data gathered by Experimental Biologists. They create simulations, models and predictions to help understand the mechanisms of living organisms. An example of a computer model is the amino acid sequence of a protein that can be modeled as a sequence (string) of letters (characters), with each character corresponding to an amino acid. Sequences in the form of strings can easily be stored in computers, and software can be written to compare and manipulate these strings. A complex data model is the Biomolecular Interactions Network Database (BIND), a database that stores records of interactions between molecules. A BIND record stores the properties of each interacting molecule and evidence that the interaction occurs. Elfin-like face, broad forehead, wideset eyes, button nose, flat nasal bridge, long philtrum, low set checks, wide mouth, malocclusion of teeth.Mental retardation (IQ 40-100, mean~60, normal mean ~ 100 )Outgoing personality, friendly nature, charming.Unique cognitive profileShort statureHoarse voiceInfantile hypercalcemia (raise blood calcium levels)Sensitivity to noise

Williams-Beuren Syndrome microdeletions reside on chromosome 7q11.23. Patients withdeletions fall into two categories. Those with classic WBS (* indicates the common deletion) and thosewith SVAS but not WBS, caused by hemizygous deletion of the elastin gene. A physical map of theregion composed of genomic clones is shown with a gap in the critical region. The myGrid software wasused to continue the contig and identify more genes at this locus.

Identify new, overlapping sequence of interestCharacterise the new sequence at nucleotide and amino acid level

Computational process is predicting both physical map , and providing candidate genes in the WBS Critical Region.Both of these will then later be verified in the lab, to check that they make sense.Genetic maps will be used then against the mutant cell lines. And expression analysis will be done of candidate genes. Identify new, overlapping sequence of interestCharacterise the new sequence at nucleotide and amino acid level

Computational process is predicting both physical map , and providing candidate genes in the WBS Critical Region.Both of these will then later be verified in the lab, to check that they make sense.Genetic maps will be used then against the mutant cell lines. And expression analysis will be done of candidate genes. Cutting and pasting between numerous web-based services i.e. BLAST, InterProScan etc

Cutting and pasting between numerous web-based services i.e. BLAST, InterProScan etc If you can execute it then its a services

Cutting and pasting between numerous web-based services i.e. BLAST, InterProScan etc

Complex, time consuming process, because ...

Advantages: Specialist human intervention at every step, quick and easy access to distributed servicesDisadvantages: Labour intensive, time consuming, highly repetitive and error prone process, tacit procedure so difficult to share both protocol and results, it does not scale.To automate means hacking up lots of PERL scriptsWe are not in the business of building a particular application or virtual database.Or writing specific workflows.

Reuse requires descriptionBusiness driven2D Gels separate a sample according to mass and pH, so that (hopefully) each spot contains one protein.Blanca Rodriguez, OxfordCardiac Vulnerability to Acute IschemiaComputers are Lesson one the grid is only part of the overall Construct in silico experiments (a.k.a. workflows or protocols) find and adapt othersWorkflows to build pipelines linking public servicesWeb services basedManage the experiment lifecycle Open community - Ad hoc exploratory investigative workflows for individuals from no particular a priori communityData Intensive, up stream analysisPipelines - experiments as workflows

Open services - the services are not ours.Open model - there is no standardised model of biology

Distributed computing PersonalisationProvenance and Data managementWorkflow enactment, DQPEvent notification A virtual lab workbench, a toolkit which serves life science communities.Distributed computing PersonalisationProvenance and Data managementWorkflow enactment, DQPEvent notification A virtual lab workbench, a toolkit which serves life science communities.

Construct in silico experiments (a.k.a. workflows or protocols) find and adapt othersWorkflows to build pipelines linking public servicesWeb services basedManage the experiment lifecycle Open community - Ad hoc exploratory investigative workflows for individuals from no particular a priori communityData Intensive, up stream analysisPipelines - experiments as workflows

Open services - the services are not ours.Open model - there is no standardised model of biology

Distributed computing PersonalisationProvenance and Data managementWorkflow enactment, DQPEvent notification A virtual lab workbench, a toolkit which serves life science communities.Distributed computing PersonalisationProvenance and Data managementWorkflow enactment, DQPEvent notification A virtual lab workbench, a toolkit which serves life science communities.

Service Oriented Architectural consolidationLSIDs for identity throughout the architectureInformation Model defines domain generic experimental entities and metatypesE-Science mediator for coordinating servicesE-Science events and event bus for communicating with mediator (and other services)So far, we have discussed only the static data model static in the sense of a class diagram and described only long-lived, persistent objects. Weve not mentioned how the applications relate to the middleware services, nor how the services relate to each other.As an example, workflow enactment creates result data & provenance metadata, which needs to be preserved in the repositories. We have chosen to implement this by embedding observer objects for each service (GoF Observer pattern) into the enactor, which respond to workflow lifecycle events.We are now generalising these workflow events into more general e-science events that can be generated by different services. We will coordinate all these events via an e-Science mediator object (GoF Mediator pattern). We anticipate representing our e-science process explicitly, by modifiable e-science process patterns representation not yet defined. We plan to use our existing notification service as an event bus for replicated services & mediators.Over 1000 servicesWrap up servicesRegistration the ability to register services and workflows within Taverna so that others in the organisation know they exist. Practically all the services are remote and third partyServices are changeable and unreliableRedundant services are essentialWSDL in the wild is poorAutomated annotation ideal and trickySay something about standard techniques and semantic technologiesAnd using everyone elses stuff

Say anything about the SOA aspects?

Metadata drivenI3C/OMG standard -> W3CEach database on the web has:Different policies for assigning and maintaining identifiers, dealing with versioning etc.Different mechanism for retrieving an item given an ID.LSID designed to harmonise the retrieval of data and guarantee its immutability.urn:lsid:ncbi.nlm.nig.gov:GenBank:T48601:2Authorities, resolvers and clientsEverything in myGrid has an LSID.We have an mIR authority and data and metadata resolvers. We can use third party clients.Its how we can link stuff together with RDF!!

Identify the type and format of data so that it can (only) be input to type-compatible viewers, services and workflows.Do not restructure legacy data types like BioMOBYService then becomes an organisational layer. We want to share descriptions, but not much else.

Its a unit of publishing rather than a unit of functionality. The fields in red are those with a controlled vocabulary / ontology.

Plug-ins and LSID assignment and resolution not shown for clarity.

Taverna ingests a scufl doc which can include: explicit retry information (no of retries, wait time between retries), explicit alternate services, and templates for metadata to be generated during workflow execution. In the future these are likely to be split up.The data objects that enter the workflow - ie the LSIDs input data can either be from the data (or metadata?) store, in which case it will have an LSID, or supplied by the user, in which case it will be allocated one and stored in the data store. an "knowledge template" for the metadata generation the initial prototype (using Ouzo) has these as part of the Scufl doc, but this could have changed with the new plug-in architecture. There is probably some knowledge that is specific to the context of using a service, but it might also be useful to get knowledge tempates say from a semantic registry. (though possibly not at the moment) a user idiocycratic (sp?) > service equivalence list for suervice substitution some user configuration about how many retries taverna should make > on which parts of the workflow (say...) currently part of the Scufl doc. Possibly adding user choices to override the defaults in the workflow could be a useful addition. Organisational" provenance. A user indentification "plug-in" is a means by which by which users can identify themselves, their project, the data and metadata store they wish to use and so all this can be added to the generated data and metadata. Some available services Taverna, the workflow design workbench, also ingests a list of services. This can be a simple scavenged list, or a view on a semantic registry. The emphasis is that Taverna supports users in browsing a the external world of autonomous services. Users can personalise Taverna by pointing at the semantic registry of their choice, or scavenging from their chosen locations. > taverna has access to > 1. an LSID resolution service > 2. a semantic registry of some sort (??) > 3. a metadata store > 4. a data store > > - the concepts associated with the LSIDs can be found through the > metadata interface, so taveran can find out what they are if it needs to > - the services executed, as described in the scufl doc, also have their > inputs, outputs and purpose exposed as a concept through the registry > (aside - are services also given LSIDs? Are WSDL docs just another kind > of data?) > It is possible for services to have multiple descriptions. There can be the WSDL doc auto-generated from the service by Axis, a WSDL doc stored in a registry, or a WSDL doc that I have stored locally. If we want to be able to identify that the same service instance was used, even though it was found in a different way, then there will need to be distinct identifiers for the WSDL docs and the services. > taverna spits out > 1. results (intermediate and ultimate) into an XML data store as XML > (implemented as a relational database with an OGSA-DAI interface on top > of MySQL? could it be another form?) Taverna works with the LSID authority to allocate LSIDs to intermediate and ultimate results. This has to be done incrementally to cope with long running workflows. As LSID data values are immutable, I assumed that the allocation and storage would have to be done at the same time. The final results could be collected into an XML structure but I am not sure of the implementation details. > 2. metadata that covers (a) the data and process provenance graph (b) > the knowledge graph which is grounded to the data.process graph (c) some > link to the overall experiment provenance??? During the workflow execution, taverna/freefluo generates a series of events. The provenance plugin listens to these events and writes the RDF statements to the metadata store (there will also be some provenance in the data store). The statements generated cover (a), (b) and (c), based on what is ingested. 3. Taverna also spits out monitoring information about the current progress. This could be seen as just a different view on the generated series of events. The monitoring tries to provide a view of the current status from the events, while the provenance provides an audit trail of what has happened and higher-level information of particular interest to users. > > I also notice tabs on the Graves results - so taverna also has > different viewers for the different results ? Taverna creates a sophisticated viewer for its results. The window contains a tab for each workflow output. The mime types are used to identify appropriate viewers. (Users can select from the list of possible viewers, e.g. raw data or use an XML viewer for text/xml values.) Lists of values are handled, and can be viewed individually or in a tabular form. Matthew Pocock did much of the work on this. I think that the initial motivation was viewing graphical results and coping with list structures. At the moment it does not really support Hannah's problems where she has 35 workflow outputs, which are different information about the same biological sequence, and she want an integrated view of what the results as a whole mean. NEED TO CALCULATE NUMBER OF EXONS IDENTIFIEDFour workflow cycles totalling ~ 10 hours

Identify new, overlapping sequence of interestCharacterise the new sequence at nucleotide and amino acid level

Computational process is predicting both physical map , and providing candidate genes in the WBS Critical Region.Both of these will then later be verified in the lab, to check that they make sense.Genetic maps will be used then against the mutant cell lines. And expression analysis will be done of candidate genes. Saved time, increased productivity

Benchmark: first run though of two iterations of workflowsReduced gap by 267 693 bp at its centrmeric endCorrectly located all seven known genes in this regionIdentified 33 of the 36 known exons residing in this location

how many domain services do we have (numbers vary!) but between 200 and 300 depending how you count EMBOSS -- and how many of these are "shim" services? About 20 for GD and WBShow many services are in the WBS >workflows and the Graves disease workflows? 19 domain and 11 shim in WBShow long does a WBS workflow run for? completyly dependant on size of sequence, and if the prediction tools predict anything; from 10 minutes to 2 hours how long does an average service run for? and how often do they fail? No such thing as an average service! NIX can take 3-6 hours if it's busy, emboss tools can take a matter of seconds. NCBI Blast is a painhow often do the wrappers change? too often! Had to redo a couple already (obviously only a problem if we don't have the license and are screen scraping) e.g. blast broke because something has changed and the bioPython module we were using hasn't kept up! Ggrrrrrrrrr How much time does it really save Hannah? up to 2 weeks is a good bet. May reckons longer but the user still has to find the relevence in the data. Saved time, increased productivity

Benchmark: first run though of two iterations of workflowsReduced gap by 267 693 bp at its centrmeric endCorrectly located all seven known genes in this regionIdentified 33 of the 36 known exons residing in this location

how many domain services do we have (numbers vary!) but between 200 and 300 depending how you count EMBOSS -- and how many of these are "shim" services? About 20 for GD and WBShow many services are in the WBS >workflows and the Graves disease workflows? 19 domain and 11 shim in WBShow long does a WBS workflow run for? completyly dependant on size of sequence, and if the prediction tools predict anything; from 10 minutes to 2 hours how long does an average service run for? and how often do they fail? No such thing as an average service! NIX can take 3-6 hours if it's busy, emboss tools can take a matter of seconds. NCBI Blast is a painhow often do the wrappers change? too often! Had to redo a couple already (obviously only a problem if we don't have the license and are screen scraping) e.g. blast broke because something has changed and the bioPython module we were using hasn't kept up! Ggrrrrrrrrr How much time does it really save Hannah? up to 2 weeks is a good bet. May reckons longer but the user still has to find the relevence in the data. Results of WBS workflowsRelated to control flow too. Some control flow in Scufl.Specific data model using a generic tool. Where do you do the specifics and how?Do I need a block diagram of myGrid?reviously we displayed the results as a list . now our results display is similar to the Scavenger Pane of Taverna, the only difference is that we do not group services in directories with respect to their types (SOAPLAB. WSDL etc ) we only do the color coding with respect to types. There is a tree view of service descriptions in Pedro though. when the user want to see details of one particular resulting service he/she presses the Annotator button and Pedro launches with the tree view. Do you think we should also have a tree view in Feta Result Panel? I'll do some experiments on that. Regards Pinar Multiple stakeholdersDevelopers of workflowsTaverna WorkbenchSingle userBuilding and running

Multiple stakeholdersDevelopers of workflows (Workbench)users of workflows (Portal)service providers (Debug/Monitor).

Multiple stakeholdersUsers of workflowsCollaboratingSharingGridSphere portal frameworkImageBLASTMultiple stakeholdersDevelopers of workflows (Workbench)users of workflows (Portal)service providers (Debug/Monitor).

Multiple stakeholdersService providersDebug/Monitor/Repair, Publish toolsSoftware engineers!Web services in the wild suck

Multiple stakeholdersDevelopers of workflows (Workbench)users of workflows (Portal)service providers (Debug/Monitor).

Show stoppers for practical adoption are not technical showstoppersCan I incorporate my favourite service?Can I manage the results?

URLs and Soaplab endpointsIntrospection

These have main consequences for myGrid and are maybe different for business workflows ??

Context of myGrid Not so much down stream data captureCollaboration for peopleLegacy of the openness tenet.

WBS workflows and Graves Disease WF took over a year. Putting the workflows together is easyBut these are only as good as the services they link together.No replicas, unreliable, poor interfaces, inaccessibleRate limiting stepActivation Energy versus Reusability trade-offLack of available services, levels of redundancy can be limited But once available can be reused for the greater good of the communityLicensing of Bioinformatics ApplicationsMeans cant be used outside of licensing bodyNo license = access third-party websitesInstability of external servicesResearch levelReliant on other peoples serversTaverna can retry or substitute before graceful failureShimsA network effectSeek service providers partners for mutual satisfaction. GSOH required.

Here is our description of the blast service presented earlier. This is a relatively simple description. We have put nothing at the service level at all, although in practiceWe probably would have do so.

The description is ontological in basis so, actually, it is richer than appears here. Hence pair-wise-local-aligning also comes under local alignment. Should probably also include similarity search or even homology search.

Hackery will get you in the end

This is because machines cannot think and people can. Simplify?

Bioinformaticians AND Machinists mismatches

You still need bioinformaticians!Constantly balancing, putting the complexity at the right level

A Scufl workflow from the Graves disease case study. This workflow uses a number of different Scufl processor types, e.g. WSDL Web service operations (green boxes) and Soaplab services (yellow boxes). The overall workflow input is displayed at the top and the outputs along the bottom of the diagram.Take-up outweighs smarts.It doesnt matter how simple the solution, what counts is if everyone can use it. If it helps a bit then it helps a bit.It doesnt matter how smart and complete the solution is, it doesnt count if no-one can use it.

Building the Semantic Grid means building it for the three` different knowledge stakeholders.The Grid means what I say it means

Documents

Middleware for Bioinformaticians: Lessons from the my Grid Project