29
Tom Oinn, [email protected]

Tom Oinn, [email protected]. In general a grid system is, or should be : “A collection of a resources able to act collaboratively in pursuit of an overall

Embed Size (px)

Citation preview

Tom Oinn, [email protected]

In general a grid system is, or should be :“A collection of a resources able to act

collaboratively in pursuit of an overall objective”

A life science grid is therefore :“A collection of resources able to act

collaboratively to solve a problem in the life science domain”

Massive diversity of Information classes Services Data Problems

Relatively small data sizes Relatively small computational load Challenge is complexity and heterogeneity Much scientific work is exploratory

Environment must be flexible and easy to reconfigure

Environment must provide facilities for provenance capture

Existing diverse services Web based, SOAP services, custom protocols

such as BioMoby etc. Existing data resources

Relational, unstructured flat file, XML May or may not be exposed through some kind

of service interface i.e. SRS, BioMart Existing user communities

Large well funded service and research projects with substantial IT support

Small groups with no IT support, little funding but interesting problems

Experts in their domain Little or no experience with distributed computing

Most bioinformaticians are not computer scientists Generally not supported by dedicated CS groups Need to allow these users to make use of their

existing expertise but remove concerns such as: Parallelism Distributed programming Fault recovery Job dispatch and submission Provenance capture Logging and auditing

“A collection of existing legacy and novel tools and databases exposed through a variety of technologies able to act collaboratively to solve a problem posed by an ‘IT naïve’ user in the life science domain across the public internet and with little or no technological support and as inexpensively as possible.”

Users typically have no control over services (provided by 3rd parties) so create a client side integration platform.

Should be accessible to an unsupported PhD student with standard networking, a three year old PC and no dedicated IT support.

http://taverna.sf.net

A ‘super client’ to a variety of disparate services on both intra-net and inter-net

Project homepage : http://taverna.sf.net myGrid project page :

http://www.mygrid.org.uk OMII-UK home : http://www.omii.ac.uk Alberto’s Taverna + EBI mini tutorial :

http://www.ebi.ac.uk/Tools/webservices/tutorials/taverna

Taverna is : A workflow language based on a dataflow model. A graphical editing environment for that language. An invocation system to run instances of that

language on data supplied by a user of the system.

When you download it you get all this rolled into a single piece of desktop software

The enactor can be run independently of the GUI

Taverna can interoperate the following by default : SOAP based web services Biomart data warehouses Soaplab wrapped command line tools BioMoby services and object constructors Inline interpreted scripting (Java based)

Other service classes can be added through an extension point (but you probably don’t need to)

Document builders

Service invocation

(creates job)

Polling loop (check status, fail if not

ready)

Get results

•Add service to services list by pointing Taverna to Web Service Description Language (WSDL) document online

•Taverna inspects WSDL, extracts operations

•Add operations to workflow, right click to automatically add document builders and splitters for doc/literal style services

•Use nested workflow to define polling logic, sub-workflow fails, waits and retries if data is not ready

*SOAP is the Simple Object Access Protocol - http://www.w3.org/TR/soap/ & http://www.w3.org/TR/wsdl

Soaplab server in services list

Individual tool within category

Soaplab services support rich descriptive metadata

•Soaplab services are added to the services palette by pointing Taverna at the root of the Soaplab installation.

•Individual services within that server are categorized and displayed within categories

•Services support polling and provide links to metadata directly within Taverna

http://www.ebi.ac.uk/Tools/webservices/soaplab/guide

BioMoby provides semantic description of services

Taverna can use this to assist in the service composition at design time

All this provided by the Moby team – Taverna’s extension architecture allows third party developers to contribute in a loosely coupled way

Service discovery Free text search over ‘known’ services. Semantic search over service repository, relies on manual service

annotation and submission of those annotations to the repository. Provenance tracking

Lineage tracking of result data. Automatic semantic annotation of data from service annotations. Possible as the workflow engine creates a ‘managed environment’ with

an overview of all data movement. Result visualization

Common renderers included in base distribution include 3d structure, images, graph rendering

Extensibility New service classes New renderer types New UI elements

Funded through the Open Middleware Infrastructure Institute (OMII-UK) as part of the myGrid project run by Carole Goble

Four years old, funding secured through 2008 and beyond.

Development team at Manchester & Hinxton, UK

Wide group of ‘friends and allies’ across the world particularly within UK eScience

Implemented in Java, released under LGPL licence.

Science varies widely in scale both in space (CPU cycles required, storage, numbers of services etc) and time (duration of collaborations, stability of VO membership)

Current grid infrastructure is focused on projects with large spatial and temporal scale

Does this existing work map well to scientific problems with different characteristics, especially different temporal characteristics?

What about security…?

A workflow can access multiple resources These resources can have arbitrary security constraints

It is likely that a given workflow requires more than one principal to be available to complete.

How can we make multiple security agents available to the workflow engine in a principled fashion?

Define the basic unit of a virtual experiment or fast virtual organization to map directly to a peer group within a peer to peer framework

Peer group contains a workflow instance along with any resources required to enact that instance including arbitrarily many security agents, data stores, metadata stores etc.

Services accessed by the workflow may (and usually will) exist outside of the peer group.

Workflow instance

Security Agent

(User A)

Security Agent

(User B)

Data Manager

Peer Group(Virtual Experiment)

External tools, data and services

A Virtual Experiment (VE) is created by the construction of a new peer group within the P2P framework

Resources such as workflow engines, data managers and security agents exist as factory services.

Each factory can construct a limited version of itself Workflow engines with specific workflow definitions loaded Data managers with specific levels of storage space Security agents with policies to restrict full use of

credentials These limited proxy objects connect to the peer group

This is a secured operation but as there is no delegation existing security mechanisms are adequate to get this far

Factories may be on the intranet or internet (most likely for workflow services) or on the user’s workstation, PDA or cellphone (for security agents).

A VE becomes collaborative when more than one user can access the objects within the peer group.

A VE uses collaborative security when more than one user inserts a security agent into the peer group.

Note that the peer group structure also allows multiple views on the same VE as objects can exist in more than one peer group. For example, you could split the workflow instance into a

monitoring and steering component and give some users access to a peer group containing both and others to one containing only the monitoring part.

The peer group has a unique identity which can be used to discover or register it with any registry service available.

Taverna2 under development, delivery by the end of 2007 Rewrite of Taverna to support, amongst other things:

Integration with grid technologies through a set of new extensibility points

Transient VO management (short lived virtual organizations, 20 second upwards lifetime!)

More sophisticated computational model Massive scalability, pipelining of nested token streams, single

threaded execution model, transparent reference passing architecture

Monitoring and steering of running processes with arbitrary granularity through an extension point

Implement extensions to interface to your GRID Get a free and well supported rich client portal for non

expert users Access otherwise out of reach user communities

If you have a grid with resources that our community could use Talk to us, tell us about it Write a plugin for its resource broker, data system

or security model If you have a scientific community who wants

to access such resources Again, please let us know We can provide on site training We are always interested in new application areas

for our work I can be contacted at [email protected], or for

more general discussion please join the mailing lists linked from http://taverna.sf.net

Please see http://www.mygrid.org.uk/wiki/Mygrid/Acknowledgements for most up to date list