Upload
tranphuc
View
222
Download
0
Embed Size (px)
Citation preview
Modeling and Storing ScientificProtocols
Natalia KwasnikowskaHasselt University, Belgium
Yi Chen and Zoé LacroixArizona State University, AZ, USA
KSinBITOctober 29, 2006
Overview
• Our motivation
• Protocol Model
• Example
• ProtocolDB
• Future Work
Scientific portfolio
• Reproducibility• Archived experiment
– input and output data– intermediate data– detailed description of
the process• Poorly recorded
– paper– only implementation
Scientific protocol
• Complex process composed ofinterconnected tasks– data-analysis pipeline– workflow, dataflow
• Workflow Management Systems– Taverna– Kepler– Pipeline Pilot
Scientific Protocol• Germination proportions were analyzed by using
program genmod.sas, with the filemaxmingermshoots.xls as input and two output filesas result: maxshoots diffs.xls and maxshootslsmeans.xls.
• Preprocessing of data necessary for determination ofbase and optimal temperatures for germination wasachieved in two sub-steps. First, Observations.xls wasused as input to sample numbers for DAPest.sasresulting in file DAPest sample numbers.xls, wassubsequently used as input to DAPest.sas whichproduced DAPestData.xls. Also, graphing to print.saswas run with Observations.xls as input and producedfive bitmaps.
• Base temperature (TB) for germination wasdetermined by two separate methods, only one…
Problems
• Mix the conceptual design level with actualimplementation
• Often lack detailed information about usedresources– which version?
– required parameters?
– what data formats?
Problems
• Difficult to track data provenance– data lineage or data pedigree
– resources may be updated
• Difficult retrieval and comparison ofprotocols– limited querying possibilities
Our contribution
• High-level abstract protocol model• Independent of execution model• Clear distinction between design and
possible implementations• Explicit mapping between them• Suitable for storage in database systems
Design and Implementation
Find all available information about proteins involved in thelatent stage of multiple sclerosis.
Design and Implementation
Find all available information about proteins involved in thelatent stage of multiple sclerosis.
Design and Implementation
Find all available information about proteins involved in thelatent stage of multiple sclerosis.
OMIM
PubMed
Medline
EntrezGene HGNC
RefSeq
IPI SwissProt
InterPro
• Set of design tasks
• design task D:– name N
– input type i
– output type o
• Set of conceptual types –ontology
• Each task is a base protocol D,with input i and output o
Protocol Design Model
T: Codes_ForGene
Protein
D: Ni
o
Protocol Design Model
successor P=P’·P” split-merge P=P’⊕P”
P’i’
o’P”
i”
o”
i
o
P
i ≤ i’ and o” ≤ o
P’i’
o’
P”i”
o”
i
o
P o’ ≤ i”
i ≤ i’⊕i” and o ≤ o’⊕o”
Protocol Design Model
k-recursion star-recursion
i ≤ i’ and o’ ≤ o
P’i’
o’
i
o
P*
P’i’
o’
i
o
Pk
i ≤ i’ and o’ ≤ o
Protocol Implementation Model
• Similar to protocol design model• Set of application names
– instead of design task names
• Set of format names– instead of conceptual type names
• Imposes equality of format names– instead of subtyping
Mapping Design toImplementation
• Conceptual type mapping– Gene → Genbank format– Gene → FASTA format– SeedData → Excel Spreadshead
• Protocol design task mapping– each design task is mapped to an implementation protocol– consistent with the conceptual type mapping
• Protocol design mapping– homomorphic extension
Scientific Protocol
A pair of• a protocol design• a set of protocol implementations together
with– conceptual type mapping– protocol design task mapping
Germination Protocol
D1:MaxGermination
D2:Proportions
D3:Preprocessing
D4:BaseTemp
D5:BaseOptTemp
SeedData
SeedData
PD = (D1 · D2) ⊕ (D3 · (D4 ⊕ D5))
Germination Protocol
D1:MaxGermination
D2:Proportions
D3:Preprocessing
D4:BaseTemp
D5:BaseOptTemp
SeedData
SeedData
PD = (D1 · D2) ⊕ (D3 · (D4 ⊕ D5))
Germination Protocol
D1
D2
D3
D4 D5
I12:Broken.sas
I11:Pho341.sas
I10:Pho341.sas
k
(I10 · I11 · I12)k
Germination Protocol
D1
D2
D3
D4 D5
I12:Broken.sas
I11:Pho341.sas
I10:Pho341.sas
k
(I10 · I11 · I12)k
Germination Protocol
(I7m · I8 · I9) ⊕ (I10 · I11 · I12)k
D1
D2
D3
D4 D5
I12:Broken.sas
I11:Pho341.sas
I10:Pho341.sas
k
I9:Mixed.sas
I8:Merge.sas
I7:Reg.sas
m
Benefits of our approach
• Scientific protocols are modeled at twolevels– design– implementation
• One design may have differentimplementations– easier to compare results– facilitates integration
ProtocolDB
http://bioinformatics.eas.asu.edu/protocoleDatabase.htm
Future work
• Operator semantics• Extending model with data provenance• Querying data provenance• Querying of protocols
– retrieval of similar protocols• Further development of ProtocolDB
http://bioinformatics.eas.asu.edu/protocoleDatabase.htm