25
Modeling and Storing Scientific Protocols Natalia Kwasnikowska Hasselt University, Belgium Yi Chen and Zoé Lacroix Arizona State University, AZ, USA KSinBIT October 29, 2006

Modeling and Storing Scientific Protocols - alpha.uhasselt.bealpha.uhasselt.be/research/groups/theocomp/kwasnikowska/pdf/... · –Pipeline Pilot. Scientific Protocol • Germination

Embed Size (px)

Citation preview

Page 1: Modeling and Storing Scientific Protocols - alpha.uhasselt.bealpha.uhasselt.be/research/groups/theocomp/kwasnikowska/pdf/... · –Pipeline Pilot. Scientific Protocol • Germination

Modeling and Storing ScientificProtocols

Natalia KwasnikowskaHasselt University, Belgium

Yi Chen and Zoé LacroixArizona State University, AZ, USA

KSinBITOctober 29, 2006

Page 2: Modeling and Storing Scientific Protocols - alpha.uhasselt.bealpha.uhasselt.be/research/groups/theocomp/kwasnikowska/pdf/... · –Pipeline Pilot. Scientific Protocol • Germination

Overview

• Our motivation

• Protocol Model

• Example

• ProtocolDB

• Future Work

Page 3: Modeling and Storing Scientific Protocols - alpha.uhasselt.bealpha.uhasselt.be/research/groups/theocomp/kwasnikowska/pdf/... · –Pipeline Pilot. Scientific Protocol • Germination

Scientific portfolio

• Reproducibility• Archived experiment

– input and output data– intermediate data– detailed description of

the process• Poorly recorded

– paper– only implementation

Page 4: Modeling and Storing Scientific Protocols - alpha.uhasselt.bealpha.uhasselt.be/research/groups/theocomp/kwasnikowska/pdf/... · –Pipeline Pilot. Scientific Protocol • Germination

Scientific protocol

• Complex process composed ofinterconnected tasks– data-analysis pipeline– workflow, dataflow

• Workflow Management Systems– Taverna– Kepler– Pipeline Pilot

Page 5: Modeling and Storing Scientific Protocols - alpha.uhasselt.bealpha.uhasselt.be/research/groups/theocomp/kwasnikowska/pdf/... · –Pipeline Pilot. Scientific Protocol • Germination

Scientific Protocol• Germination proportions were analyzed by using

program genmod.sas, with the filemaxmingermshoots.xls as input and two output filesas result: maxshoots diffs.xls and maxshootslsmeans.xls.

• Preprocessing of data necessary for determination ofbase and optimal temperatures for germination wasachieved in two sub-steps. First, Observations.xls wasused as input to sample numbers for DAPest.sasresulting in file DAPest sample numbers.xls, wassubsequently used as input to DAPest.sas whichproduced DAPestData.xls. Also, graphing to print.saswas run with Observations.xls as input and producedfive bitmaps.

• Base temperature (TB) for germination wasdetermined by two separate methods, only one…

Page 6: Modeling and Storing Scientific Protocols - alpha.uhasselt.bealpha.uhasselt.be/research/groups/theocomp/kwasnikowska/pdf/... · –Pipeline Pilot. Scientific Protocol • Germination

Problems

• Mix the conceptual design level with actualimplementation

• Often lack detailed information about usedresources– which version?

– required parameters?

– what data formats?

Page 7: Modeling and Storing Scientific Protocols - alpha.uhasselt.bealpha.uhasselt.be/research/groups/theocomp/kwasnikowska/pdf/... · –Pipeline Pilot. Scientific Protocol • Germination

Problems

• Difficult to track data provenance– data lineage or data pedigree

– resources may be updated

• Difficult retrieval and comparison ofprotocols– limited querying possibilities

Page 8: Modeling and Storing Scientific Protocols - alpha.uhasselt.bealpha.uhasselt.be/research/groups/theocomp/kwasnikowska/pdf/... · –Pipeline Pilot. Scientific Protocol • Germination

Our contribution

• High-level abstract protocol model• Independent of execution model• Clear distinction between design and

possible implementations• Explicit mapping between them• Suitable for storage in database systems

Page 9: Modeling and Storing Scientific Protocols - alpha.uhasselt.bealpha.uhasselt.be/research/groups/theocomp/kwasnikowska/pdf/... · –Pipeline Pilot. Scientific Protocol • Germination

Design and Implementation

Find all available information about proteins involved in thelatent stage of multiple sclerosis.

Page 10: Modeling and Storing Scientific Protocols - alpha.uhasselt.bealpha.uhasselt.be/research/groups/theocomp/kwasnikowska/pdf/... · –Pipeline Pilot. Scientific Protocol • Germination

Design and Implementation

Find all available information about proteins involved in thelatent stage of multiple sclerosis.

Page 11: Modeling and Storing Scientific Protocols - alpha.uhasselt.bealpha.uhasselt.be/research/groups/theocomp/kwasnikowska/pdf/... · –Pipeline Pilot. Scientific Protocol • Germination

Design and Implementation

Find all available information about proteins involved in thelatent stage of multiple sclerosis.

OMIM

PubMed

Medline

EntrezGene HGNC

RefSeq

IPI SwissProt

InterPro

Page 12: Modeling and Storing Scientific Protocols - alpha.uhasselt.bealpha.uhasselt.be/research/groups/theocomp/kwasnikowska/pdf/... · –Pipeline Pilot. Scientific Protocol • Germination

• Set of design tasks

• design task D:– name N

– input type i

– output type o

• Set of conceptual types –ontology

• Each task is a base protocol D,with input i and output o

Protocol Design Model

T: Codes_ForGene

Protein

D: Ni

o

Page 13: Modeling and Storing Scientific Protocols - alpha.uhasselt.bealpha.uhasselt.be/research/groups/theocomp/kwasnikowska/pdf/... · –Pipeline Pilot. Scientific Protocol • Germination

Protocol Design Model

successor P=P’·P” split-merge P=P’⊕P”

P’i’

o’P”

i”

o”

i

o

P

i ≤ i’ and o” ≤ o

P’i’

o’

P”i”

o”

i

o

P o’ ≤ i”

i ≤ i’⊕i” and o ≤ o’⊕o”

Page 14: Modeling and Storing Scientific Protocols - alpha.uhasselt.bealpha.uhasselt.be/research/groups/theocomp/kwasnikowska/pdf/... · –Pipeline Pilot. Scientific Protocol • Germination

Protocol Design Model

k-recursion star-recursion

i ≤ i’ and o’ ≤ o

P’i’

o’

i

o

P*

P’i’

o’

i

o

Pk

i ≤ i’ and o’ ≤ o

Page 15: Modeling and Storing Scientific Protocols - alpha.uhasselt.bealpha.uhasselt.be/research/groups/theocomp/kwasnikowska/pdf/... · –Pipeline Pilot. Scientific Protocol • Germination

Protocol Implementation Model

• Similar to protocol design model• Set of application names

– instead of design task names

• Set of format names– instead of conceptual type names

• Imposes equality of format names– instead of subtyping

Page 16: Modeling and Storing Scientific Protocols - alpha.uhasselt.bealpha.uhasselt.be/research/groups/theocomp/kwasnikowska/pdf/... · –Pipeline Pilot. Scientific Protocol • Germination

Mapping Design toImplementation

• Conceptual type mapping– Gene → Genbank format– Gene → FASTA format– SeedData → Excel Spreadshead

• Protocol design task mapping– each design task is mapped to an implementation protocol– consistent with the conceptual type mapping

• Protocol design mapping– homomorphic extension

Page 17: Modeling and Storing Scientific Protocols - alpha.uhasselt.bealpha.uhasselt.be/research/groups/theocomp/kwasnikowska/pdf/... · –Pipeline Pilot. Scientific Protocol • Germination

Scientific Protocol

A pair of• a protocol design• a set of protocol implementations together

with– conceptual type mapping– protocol design task mapping

Page 18: Modeling and Storing Scientific Protocols - alpha.uhasselt.bealpha.uhasselt.be/research/groups/theocomp/kwasnikowska/pdf/... · –Pipeline Pilot. Scientific Protocol • Germination

Germination Protocol

D1:MaxGermination

D2:Proportions

D3:Preprocessing

D4:BaseTemp

D5:BaseOptTemp

SeedData

SeedData

PD = (D1 · D2) ⊕ (D3 · (D4 ⊕ D5))

Page 19: Modeling and Storing Scientific Protocols - alpha.uhasselt.bealpha.uhasselt.be/research/groups/theocomp/kwasnikowska/pdf/... · –Pipeline Pilot. Scientific Protocol • Germination

Germination Protocol

D1:MaxGermination

D2:Proportions

D3:Preprocessing

D4:BaseTemp

D5:BaseOptTemp

SeedData

SeedData

PD = (D1 · D2) ⊕ (D3 · (D4 ⊕ D5))

Page 20: Modeling and Storing Scientific Protocols - alpha.uhasselt.bealpha.uhasselt.be/research/groups/theocomp/kwasnikowska/pdf/... · –Pipeline Pilot. Scientific Protocol • Germination

Germination Protocol

D1

D2

D3

D4 D5

I12:Broken.sas

I11:Pho341.sas

I10:Pho341.sas

k

(I10 · I11 · I12)k

Page 21: Modeling and Storing Scientific Protocols - alpha.uhasselt.bealpha.uhasselt.be/research/groups/theocomp/kwasnikowska/pdf/... · –Pipeline Pilot. Scientific Protocol • Germination

Germination Protocol

D1

D2

D3

D4 D5

I12:Broken.sas

I11:Pho341.sas

I10:Pho341.sas

k

(I10 · I11 · I12)k

Page 22: Modeling and Storing Scientific Protocols - alpha.uhasselt.bealpha.uhasselt.be/research/groups/theocomp/kwasnikowska/pdf/... · –Pipeline Pilot. Scientific Protocol • Germination

Germination Protocol

(I7m · I8 · I9) ⊕ (I10 · I11 · I12)k

D1

D2

D3

D4 D5

I12:Broken.sas

I11:Pho341.sas

I10:Pho341.sas

k

I9:Mixed.sas

I8:Merge.sas

I7:Reg.sas

m

Page 23: Modeling and Storing Scientific Protocols - alpha.uhasselt.bealpha.uhasselt.be/research/groups/theocomp/kwasnikowska/pdf/... · –Pipeline Pilot. Scientific Protocol • Germination

Benefits of our approach

• Scientific protocols are modeled at twolevels– design– implementation

• One design may have differentimplementations– easier to compare results– facilitates integration

Page 24: Modeling and Storing Scientific Protocols - alpha.uhasselt.bealpha.uhasselt.be/research/groups/theocomp/kwasnikowska/pdf/... · –Pipeline Pilot. Scientific Protocol • Germination

ProtocolDB

http://bioinformatics.eas.asu.edu/protocoleDatabase.htm

Page 25: Modeling and Storing Scientific Protocols - alpha.uhasselt.bealpha.uhasselt.be/research/groups/theocomp/kwasnikowska/pdf/... · –Pipeline Pilot. Scientific Protocol • Germination

Future work

• Operator semantics• Extending model with data provenance• Querying data provenance• Querying of protocols

– retrieval of similar protocols• Further development of ProtocolDB

http://bioinformatics.eas.asu.edu/protocoleDatabase.htm