22
Automatic vs Manual Provenance Abstractions: Mind the Gap Pınar Alper TAPP 9 June 2016 Carole A. GOBLE University of Manchester Khalid BELHAJJAME Université Paris Dauphine Pınar ALPER University of Manchester

Automatic vs Manual Provenance Abstractions: Mind the Gap€¦ · –Reports are experimental metadata on: •Method •Data “Toward interoperable bioscience data.” Nature Genetics,

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Automatic vs Manual Provenance Abstractions: Mind the Gap€¦ · –Reports are experimental metadata on: •Method •Data “Toward interoperable bioscience data.” Nature Genetics,

Automatic vs Manual Provenance

Abstractions: Mind the Gap

Pınar Alper

TAPP

9 June 2016

Carole A. GOBLE

University of

Manchester

Khalid BELHAJJAME

Université Paris

Dauphine

Pınar ALPER

University of

Manchester

Page 2: Automatic vs Manual Provenance Abstractions: Mind the Gap€¦ · –Reports are experimental metadata on: •Method •Data “Toward interoperable bioscience data.” Nature Genetics,

Transparency brought by Provenance

can be a double edged sword

Too revealing

validation formData2formData1

wgbusd

formData3formData1wib

sanitisationformData

3

usdwgb

Secured Provenance

Page 3: Automatic vs Manual Provenance Abstractions: Mind the Gap€¦ · –Reports are experimental metadata on: •Method •Data “Toward interoperable bioscience data.” Nature Genetics,

Too complex

resultinput

wgbinput

adapt

2

Transparency brought by Provenancecan be a double edged sword

simulation

Simulation

process result

adapt

1adapt

3

usd wgb wgb wgb wgb

Simplified Provenance

usd usd usd

usd

Page 4: Automatic vs Manual Provenance Abstractions: Mind the Gap€¦ · –Reports are experimental metadata on: •Method •Data “Toward interoperable bioscience data.” Nature Genetics,

– Reports are experimental metadata on:

• Method

• Data

“Toward interoperable bioscience data.” Nature Genetics, 44(2):121–126, February 2012.

“Scientific Data”, Open-Access Journal. Nature Publishing Group, 2015, http://scientificdata.isa-explorer.org

“Best Practices for Workflow Design: How to Prevent Workflow Decay.” In Proceedings of SWAT4LS, November 2012.

Data

AnnotationsData

B

undle Workflow

&

PROV-O

Compare Manual and Semi-automated abstractions

that simplify workflows in the context of reporting

data-oriented experiments.

Our Goal & Context

Page 5: Automatic vs Manual Provenance Abstractions: Mind the Gap€¦ · –Reports are experimental metadata on: •Method •Data “Toward interoperable bioscience data.” Nature Genetics,

Workflow complexity necessitates abstraction

• Up to 50+ data

processing tasks.

• Majority (70 %)

dedicated to data

adaptation

• Leads to complex

provenance.

“Common Motifs in Scientific Workflows: An empirical analysis.” Future Generation Comp. Sys. 36: 338-351 (2014).

Page 6: Automatic vs Manual Provenance Abstractions: Mind the Gap€¦ · –Reports are experimental metadata on: •Method •Data “Toward interoperable bioscience data.” Nature Genetics,

• Manual, Design Abstractions observable in

existing workflows

• Embedded into design, static

Current Approach

Page 7: Automatic vs Manual Provenance Abstractions: Mind the Gap€¦ · –Reports are experimental metadata on: •Method •Data “Toward interoperable bioscience data.” Nature Genetics,

Bookmarked Intermediaries

Design Abstractions

Page 8: Automatic vs Manual Provenance Abstractions: Mind the Gap€¦ · –Reports are experimental metadata on: •Method •Data “Toward interoperable bioscience data.” Nature Genetics,

Alternative Approach

Semi-automated abstraction systems:

– dynamic

– Workflow Summaries & ZOOM UserViews

“Small is beautiful: Summarizing scientific workflows using semantic annotations.” BigData 2013, pages 318-325.

“Querying and Managing Provenance through User Views in Scientific Workflows.” ICDE 2008, pages 1072–1081.

• ProvAbs, IPAW

2014

• Propub, SSDBM

2011

• TACLP, FGCS 2015

• SecProv, WAIM 2008

• Provenance

Redaction, SACMAT

2011

• Surrogates, VLDB

2011

Page 9: Automatic vs Manual Provenance Abstractions: Mind the Gap€¦ · –Reports are experimental metadata on: •Method •Data “Toward interoperable bioscience data.” Nature Genetics,

Workflow Abstraction Primitives

Task grouping Task elimination

Task A

Task B

df

df

df

inA

outA

inB

outB

Task A & B

inA

outB

df

df

Task A

Task B

df

df

inA

outA

inB

outB

Task C

inC

outC

Task A

indirect

df

inA

outA

Task C

inC

outC

Page 10: Automatic vs Manual Provenance Abstractions: Mind the Gap€¦ · –Reports are experimental metadata on: •Method •Data “Toward interoperable bioscience data.” Nature Genetics,

Integrity Policy: Soundness

a

d1

b

d2d

3 d4

d5

d1d2

d4d5

a&b

In the context of reporting soundness can be compromised.

usd

wg

b

Sub-workflow based design abstractions do not necessarily

preserve soundness.

usd

wg

b

usd

wg

b

Page 11: Automatic vs Manual Provenance Abstractions: Mind the Gap€¦ · –Reports are experimental metadata on: •Method •Data “Toward interoperable bioscience data.” Nature Genetics,

Integrity Policy: Acyclicity

a

d1

c

d2

d4

d5

b

d3

d

1

d5

a & c

d2

d4

b

From a modelling perspective cycles are allowed in provenance.

usd

wg

b

usd

wg

b

Page 12: Automatic vs Manual Provenance Abstractions: Mind the Gap€¦ · –Reports are experimental metadata on: •Method •Data “Toward interoperable bioscience data.” Nature Genetics,

Cyclic dataflows unfold into Acyclic Lineage

a

b

a.1

b.1

a.2

b.1

df

df

df

df

d1

d2

d3

d4

d4

In the context of reporting (e.g.

data tables, design abstractions)

cycles are not observed.

Raw workflow provenance is

acyclic.

usd

wgb

usd

usd

usd

wgb

wgb

wgb

Page 13: Automatic vs Manual Provenance Abstractions: Mind the Gap€¦ · –Reports are experimental metadata on: •Method •Data “Toward interoperable bioscience data.” Nature Genetics,

Integrity Policy: Bipartiteness

d1

d2

a

Workflow provenance is bi-partite.

Design abstractions preserve bipartiteness.

In the context of reporting bi-partiteness is advised but not always

achieved.

wg

b

usd

Page 14: Automatic vs Manual Provenance Abstractions: Mind the Gap€¦ · –Reports are experimental metadata on: •Method •Data “Toward interoperable bioscience data.” Nature Genetics,

Integrity Policy: Validity

a

d1

b

d2 d3

d5

a

d1

d2

d5

Design abstractions preserve validity.

In the context of reporting validity is a necessary property.

usd

?

?

usd

wg

b

usd

wg

b

Page 15: Automatic vs Manual Provenance Abstractions: Mind the Gap€¦ · –Reports are experimental metadata on: •Method •Data “Toward interoperable bioscience data.” Nature Genetics,

Integrity Policy: Completeness

a

d1

b

d2 d3d4

d5

d1d2

d4d5

a&b

Determined by preservation of lineage relations

Design abstractions preserve completeness

In the context of reporting completeness is a necessary property.

usd

wg

b

usd

wg

b

usd

wg

b

Page 16: Automatic vs Manual Provenance Abstractions: Mind the Gap€¦ · –Reports are experimental metadata on: •Method •Data “Toward interoperable bioscience data.” Nature Genetics,

Comparison of Systems

Workflow Summaries ZOOM Design Abstractions

Abstraction Policy Annotation-Primitive pairs Important Task Ids --

Primitive Grouping Elimination Grouping Grouping

Integrity Policy

Validity ✓ ✓ ✓ ✓

Soundness ✓ ✓

Bipartiteness ✓ ✓ ✓

Completeness ✓ ✓ ✓ ✓

Acyclicity ✓ ✓ ✓

Page 17: Automatic vs Manual Provenance Abstractions: Mind the Gap€¦ · –Reports are experimental metadata on: •Method •Data “Toward interoperable bioscience data.” Nature Genetics,

Retrieve Data

• Design Abstractions as groundtruth– Ports on main data derivation path

– Sub-workflow tasks.

• Workflow Summaries

– Eliminate All Adapters

– Collapse All Adapters

– Collapse

• ZOOM

– Non-Adapter tasks designated as significant

Pack

parameters

Unpack results

Change format

Perform

Analysis

Comparison Against

Design Abstraction

Page 18: Automatic vs Manual Provenance Abstractions: Mind the Gap€¦ · –Reports are experimental metadata on: •Method •Data “Toward interoperable bioscience data.” Nature Genetics,

Task Elimination

total # of processes in the

abstracted account

A significant task and its report worthy output

are not necessarily co-located

Hopping over traces does not simplify the

account data –wise as as it does process-

wise

Process

Precision

Data

Precision

# of processes in the abstracted

account overlapping w user’s

abstraction

Page 19: Automatic vs Manual Provenance Abstractions: Mind the Gap€¦ · –Reports are experimental metadata on: •Method •Data “Toward interoperable bioscience data.” Nature Genetics,

Task Grouping

Process Data

ZOOM’s soundness policy

creates two extra groups

Where you put the boundary to groups

matters for data abstraction!

Page 20: Automatic vs Manual Provenance Abstractions: Mind the Gap€¦ · –Reports are experimental metadata on: •Method •Data “Toward interoperable bioscience data.” Nature Genetics,

Task Grouping

Process Data

Abstracting selectively (less aggressively) by taking

activity function (hence I/O characteristics) into

account.

Page 21: Automatic vs Manual Provenance Abstractions: Mind the Gap€¦ · –Reports are experimental metadata on: •Method •Data “Toward interoperable bioscience data.” Nature Genetics,

Conclusions

• Abstraction systems focus on the process and do not directly cater for data significance.

• Are these observations generalisable?

– Elimination unsuited for WFs with pack/unpack steps

– Sweeping style grouping isn’t helpful for data abstraction

– Selective grouping relies on domain-specific policies.

• End-use informs suitable integrity policies.

• Scientists are accountable for reports. They are likely to favor having final say on the abstraction.

• Rethink abstraction as a prehoc process supporting workflow design.

“TOWARDS HARNESSING COMPUTATIONAL WORKFLOW PROVENANCE FOR EXPERIMENT REPORTING”, PhD

Dissertation, University of Manchester E-Scholar Repository. https://www.escholar.manchester.ac.uk/uk-ac-man-scw:300560

Page 22: Automatic vs Manual Provenance Abstractions: Mind the Gap€¦ · –Reports are experimental metadata on: •Method •Data “Toward interoperable bioscience data.” Nature Genetics,

Thank You!