Upload
calvin-bates
View
222
Download
3
Tags:
Embed Size (px)
Citation preview
SIMDAT SIMDAT
IXodusa knowledge discovery process based
on the SIMDAT-Pharma GRID technologies
Richard KamuzinziUniversité Libre de Bruxelles – Bioinformatics
June, 5 – 7th 2007 World Wide Workflow GRID ASIA 2007
Singapore
SIMDAT SIMDAT
SIMDAT Facts
• EU Information Society Technologies (IST)• GRID Project• Duration: 4 years
• Start date: September 1st 2004• 26 partners
SIMDAT SIMDAT
Scope
• Product and Process Development (automobiles, aircraft, drugs, meteorological services) is – Complex– Involves several independent
organizations at different locations
• Complexity management in one site is too expensive => cost/risk sharing with partners => GRID
SIMDAT SIMDAT
Strategic objectives
• to test and enhance Data Grid technology for product development and production process design,
• to develop federated versions of problem-solving environments by leveraging enhanced Grid services,
• to exploit Data Grids as a basis for distributed knowledge discovery,
• to promote defacto standards for these enhanced Grid technologies across a range of disciplines and sectors as well as
• to raise awareness of the advantages of Data Grids in important industrial sectors
SIMDAT SIMDAT
Project organization (SIMDAT-Pharma)
NEC, GSK, Inpharmatica, ULB, Fraunhofer SCAI-Bio and UKA
SIMDAT SIMDAT
IXodus – The scientific problem
• Lyme disease: significant source of human and animal pathology in temperate areas of the world (identified in 90s)
• Caused by the bite of a tick of genus IXodes, infected by the pathogen bacterium Borrelia burgdorferi
• the study of host-parasite interactions is an active research as ~20% ticks have been found infected by the bacterium
• IXodus scientific protocol: designed to deal with characterisations of genes expressed in the salivary gland of the tick IXodes ricinus at various stage of the host-parasite interaction process
SIMDAT SIMDAT
IXodus – Workflow design (1)
• From IXodus scientific protocol to IXodus workflow (WF) design, we identify 2 uses cases:
1. “New cDNA sequences”: the workflow is daily feeded with a batch of nucleic sequences from the systematic sequencing of thousands of salivary gland cDNAs
2. “Databank update”: whenever a new version of relevant biological databank appears, the core workflow analysis is re-enacted to discover potentially new information
SIMDAT SIMDAT
P ro v id e n e wc D N A
<<d a ta s to re >>IX -o d u s
c D N A S e q u e n c e
IX -o d u s S e q u e n c e s
Co mp a re w ithIX -o d u s D B
Bu ild n e w v irtu a ls e q u e n c e
[s imila r A N D (e xa c t p a rt )]
A
B
c D N A S e q u e n c e
D B A n n o ta teg ro u p me mb e rs h ip
<<d a ta s to re >>IX -o d u s
IX -o d u s Co mp Re s u lt
A
[s imila r A N D N O T (e xa c t p a rt )]
ma keBla s tN
[e ls e ]
A B
<<d a ta s to re >>EM BLA
EM BLS e q u e n c e s
Bla s tN Re s u lt
D B A n n o ta te" s u c c e s s "
[s imila r]
ma keBla s tX
<<d a ta s to re >>U N IP RO T/GEN P EP T
U GS e q u e n c e s
[e ls e ] B la s tX Re s u lt
[s imila r]
ma keT Bla s tX
C
A C
EM BLS e q u e n c e s
EM BLS e q u e n c e s
T Bla s tX Re s u lt
A n a ly s is b yd o ma in e xp e rt
[s imila r]
<<a n a ly s is _ kin d >>O RF fin d e r
A
[e ls e ]
[e ls e ]
O RF in d Re s u lt
D B A n n o ta te" p o te n t ia l n e w "
[e ls e ]
[fo u n d ] <<A n a ly s is _ kin d >>M o t if s e a rc h
M S Re s u lt
[e ls e ]
[fo u n d ]
D B A n n o ta te" mo t if fo u n d "
<<d a ta s to re >>IN T ERP RO
In te rp ro S e q u e n c e s
<<d a ta s to re >>IX -o d u s
v irtu a lS e q u e n c e
me mb e rs h ip
A
<<d a ta s to re >>IX -o d u s
S u c c e s s A n n o ta t io n
P o te n t ia lN e w A n n o ta t io n
M o t ifF o u n d A n n o ta t io n
AI X-odus UML 2.0activity diagram
Use case: "New cDNAsequences"
Scie
nti
st
SY
ST
EM
IXodus design (2) Use Case 1
Sequences
Gathering
part
Pre-processing
part
Main analysis
part
SIMDAT SIMDAT
IXodus design (3) Use Case 2
ma keBla s tN
<<d a ta s to re >>EM BL
EM BLS e q u e n c e s
Bla s tN Re s u lt
D B A n n o ta te" s u c c e s s "
[s imila r]
ma keBla s tX
<<d a ta s to re >>U N IP RO T/GEN P EP T
U GS e q u e n c e s
[e ls e ] Bla s tX Re s u lt
[s imila r]
ma ke T Bla s tX
B
A BEM BLS e q u e n c e s
EM BLS e q u e n c e s
T Bla s tX Re s u lt
A n a ly s is b yd o ma in e xp e rt
[s imila r]
<<a n a ly s is _ kin d >>O RF fin d e r
A
[e ls e ]
[e ls e ]
O RF in d Re s u lt
D B A n n o ta te" p o te n t ia l n e w "
[e ls e ]
[fo u n d ] <<A n a ly s is _ kin d >>M o t if s e a rc h
M S Re s u lt
[e ls e ]
[fo u n d ]
D B A n n o ta te" mo t if fo u n d "
<<d a ta s to re >>IN T ERP RO
In te rp ro S e q u e n c e s
<<d a ta s to re >>IX -o d u s
A
<<d a ta s to re >>IX -o d u s
S u c c e s s A n n o ta t io n
P o te n t ia lN e w A n n o ta t io n
M o t ifF o u n d A n n o ta t io n
A
I X-odus UML 2.0activity diagram
Use case: "Databank update "
Scie
nti
st
SY
ST
EM
a fte r o n e w e e k
U p d a teD a ta b a n k
SY
ST
EM
Ad
min
istr
ato
r
<<d a ta s to re >>IX -o d u s
IX -o d u s S e q u e n c e s
A
S e n d T Bla s tX n o t ific a t io n
Re c e iv e n o t fic a t io n
Event processing
part
SIMDAT SIMDAT
IXodus – Implementation
• Workflow technology platform: InforSenseTM KDE
• Implementation is tightly coupled with the deployment environment, which is mainly driven by 2 kind of constraints:– GRID approach– Semantic Web (SW) approach
SIMDAT SIMDAT
IXodus implementation - The test-bed GRID approach
Knowledge DB IXodus
G R IA
N o D ynA
E M B O SS &
B L AST
To o l s
M R S W e b Se r vi c e
W rap p e rs
E 2 E S e cS e rve r
G R IA
N o D ynA
E M B O SS &
B L AST
To o l s
M R S W e b Se r vi c e
W rappe rs
E 2 E S e cS e rve r
G R IA
N o D ynA
E M B O SS &
B L AST
To o l s
M R S W e b Se r vi c e
W rappe rs
E 2 E S e cS e rve r
Info rS e ns e K D E
IP R SC AN B i o To o l s
<<P lugin>>B i o Se ns e
<<P lugin>>Se m ant i c
B r o ke r
<<P lugin>>G R IA
E 2 E S e c C lie nt
G R IAC lie nt
S e m a ntic e na ble d s e rv ic e d is c o v e ry
Se m ant i c e nabl e d
s e rv ic e p u b lic a tio n
O W L _ D L R e a s o ning
Internet
EMBL -services
ULB
NEC – Semantic Broker
ULB -services EMBL - services
Main properties Federated data and services with redundancy Privacy, AuthZ, AuthN, non
repudiation Intellectual Proprietary (IPR)
preservation by traceability(digital signatures)
Users profiles management to optimise resources availability
SIMDAT SIMDAT
G R IA
N o D ynA
E M B O SS &
B L AST
To o l s
M R S W e b Se r vi c e
W rap p e rs
E 2 E S e cS e rve r
G R IA
N o D ynA
E M B O SS &
B L AST
To o l s
M R S W e b Se r vi c e
W rappe rs
E 2 E S e cS e rve r
G R IA
N o D ynA
E M B O SS &
B L AST
To o l s
M R S W e b Se r vi c e
W rappe rs
E 2 E S e cS e rve r
Info rS e ns e K D E
IP R SC AN B i o To o l s
<<P lugin>>B i o Se ns e
<<P lugin>>Se m ant i c
B r o ke r
<<P lugin>>G R IA
E 2 E S e c C lie nt
G R IAC lie nt
S e m a ntic e na ble d s e rv ic e d is c o v e ry
Se m ant i c e nabl e d
s e rv ic e p u b lic a tio n
O W L _ D L R e a s o ning
Internet
ULB
ULBEMBL NEC
NEC
IXodus implementation - The test-bed SW approach
Main properties Semantic-enabled service
annotation Semantic-enabled service
discovery “Which service instance
can operate on the latest version of the EMBL databank?”
Dynamic update of already annotated services
Service advertising
Semantic Broker
SIMDAT SIMDAT
IXodus implementation – InforSense KDE The complete Workflow
SIMDAT SIMDAT
IXodus implementation – InforSense KDE User sequences gathering
SIMDAT SIMDAT
IXodus implementation – InforSense KDE Management of sequences overlapping
SIMDAT SIMDAT
IXodus implementation – InforSense KDE Main analysis flow (Bioinformatics tools)
SIMDAT SIMDAT
IXodus implementation – InforSense KDE Service instance selection & launching
SIMDAT SIMDAT
IXodus - General benefits
• Workflow tool maturity: design of complex WF to support demanding problem in a reasonable delivery-time is a reality (RWD vs. RAD)
• WF on GRID approach is really valuable and provides the confidence we need to front the data/services “tsunami” in Life sciences… the good news is …
SIMDAT SIMDAT
IXodus - General benefits (2)
...thanks to WF technologies, the scientists no more scares the vertiginous “beast” (data/services explosion)…
SIMDAT SIMDAT
IXodus – Remaining challenges
• B2A Grids: we still need precise understanding of strategic benefits from both (“win-win”) side
• WF technologies: need better distinction between “abstract” WF and “operational” WF: – How to decouple?– Runtime service selection using the concept of rules?
• At design phase: the designer would appreciate semantics approach to search for services
• From WF to Service: – Partial (∑args) vs. Complete(∑args)– Different profiles of user
• From WF to UI:– At design phase: need to define how WF actors interact with
the whole system• To leverage the WF log in order to generate textual information
that would support scientific papers/notebooks writing (who, service_name, service_version, database_version, …)
SIMDAT SIMDAT
SIMDAT- Major outcomes to expect
SIMDAT approach will provide state-of-the-art components
• To enable industry-strength environment for e-Science activities
• To support the academia/industry collaborations in R&D activities (B2B & B2A Grids)– B2A Grids: how the “win-win” model is
precisely configured?• To help build up virtual organisations that
federate data, services and scientific expertise
SIMDAT SIMDAT
Thank you !
Web: http://www.simdat.org
Contact: [email protected]
Acknowledgments
co-author: Robert Herzog, Université Libre de Bruxelles (ULB)
Scientific expert: Valérie Ledent, ULB
Edmond Godfroid & Bernard Couvreur: Laboratory of Applied Genetics, ULB
SIMDAT colleagues: Joseph Mavor (ULB), Falk Zimmermann (NEC), Changtao Qu (NEC), Nabeel Azam (InforSense), Moustapha Ghanem (InforSense), Kai Kumpf (SCAI-Bio)