25
On Provenance of Queries on Linked Web Data 1,2 Yannis Theoharis, 2 Irini Fundulaki, 3,2 Grigoris Karvounarakis and 1,2 Vassilis Christophides 1 Institute of Computer Science, FORTH and 2 Computer Science Department, University of Crete 3 LogicBox, USA

On Provenance of Queries on Linked Web Data

Embed Size (px)

DESCRIPTION

On Provenance of Queries on Linked Web Data. 1,2 Yannis Theoharis, 2 Irini Fundulaki, 3,2 Grigoris Karvounarakis and 1,2 Vassilis Christophides 1 Institute of Computer Science, FORTH and 2 Computer Science Department, University of Crete 3 LogicBox, USA. What is “Linked Data”. - PowerPoint PPT Presentation

Citation preview

Page 1: On Provenance of Queries on  Linked Web Data

On Provenance of Queries on Linked Web Data

1,2Yannis Theoharis, 2Irini Fundulaki, 3,2Grigoris Karvounarakis and 1,2Vassilis Christophides

1Institute of Computer Science, FORTH and

2Computer Science Department, University of Crete

3LogicBox, USA

Page 2: On Provenance of Queries on  Linked Web Data

What is “Linked Data”

W3C Linking Open Data

publish various open datasets as RDF on the Web

set RDF typed links between data items from different data sources.

Page 3: On Provenance of Queries on  Linked Web Data

Motivation: Linked Data Processing Data is:

fetched from

heterogeneous

sources

integrated

materialized in RDF

made available

via SPARQL

Range of computations

SPARQL queries

Complex programs

(logic or procedular)

Page 4: On Provenance of Queries on  Linked Web Data

Provenance Aware Applications

Trust assessment

trustworthiness

Access control

confidentiality level

Data cleaning

validity

Curated databases

source data origin

All these applications need to represent and store the relation of the input

with the output of data processes

gain efficiency

impossible without provenance

Page 5: On Provenance of Queries on  Linked Web Data

Data Provenance Models

X Y Annot.

a b t

c d t

Y Z Annot.

b e

X Y Z Annot.

a b e

R1 R2R1 R2

Annotation Models: annotation computation coupled with a particular application and a particular assignment of source data annotations

ft tf

Abstract Provenance Models: abstract provenance tokens and operators are substituted by appropriate concrete tokens for a particular application and assignment

X Y Annot.

a b c1

c d c2

Y Z Annot.

b e c3

X Y Z Annot.

a b e c1 * c3

R1 R2R1 R2

tt

t

t Λ t

f

t Λ f

query recomputation!

t: trustedf: untrusted

Page 6: On Provenance of Queries on  Linked Web Data

This Talk

“Can previous work on abstract provenance models be leveraged for SPARQL” ?

NO: due to the OPTIONAL (similar to the SQL left outer join) operatorYES: for the positive (without OPTIONAL) fragment of SPARQL

We present our ongoing work on a SPARQL abstract provenance model.

Challenge: to capture the form of negation that OPTIONAL introduces

Page 7: On Provenance of Queries on  Linked Web Data

Outline

SPARQL algebra

Abstract Provenance Models for Positive SPARQL

Limitations of Previous Models

Towards a SPARQL Provenance Model

Page 8: On Provenance of Queries on  Linked Web Data

Outline

SPARQL algebra

Abstract Provenance Models for Positive SPARQL

Limitations of Previous Models

Towards a SPARQL Provenance Model

Page 9: On Provenance of Queries on  Linked Web Data

SPARQL (1/2)

triple patterns(?x, ?y, e)

mappings{(?x,d),(?y,b)}

{(?x,f),(?y,g)}

ComposeFilter

mappings

{ … }

mappings

{ … }Select

Construct/ Describe

SPARQL: W3C Recommendation language to Query RDF data.

Triple Set

S P O

a b c

d b e

f g e

?x ?y

d b

f g

μ1

μ2

(?x, ?y, e)

constantvariables

Ω1

Page 10: On Provenance of Queries on  Linked Web Data

SPARQL (2/2)

SPARQL algebra defines 5 operators on mapping bags

Unary ops: π (projection),

σ (selection, also called filtering)

Binary ops: U (union)

(join)

(optional)

?x ?y

a b

a c

d e

Ω

?x

a

d

π?x (Ω)

card(μ1) = 2card(μ2) = 1

μ1

μ2

Positive SPARQL (SPARQL+)

?x ?y

a b

a c

σ?x=a (Ω)

?x ?y

a b

Ω1

?x ?z

c d

Ω2

?x ?y ?z

a b -

c - d

Ω1 U Ω2

?z is unbound in μ1μ1μ2 μ1

μ2

?x ?y

a b

c d

e -

Ω1

?y ?z

b f

Ω2

?x ?y ?z

a b f

e b f

Ω1 Ω2

μ1

μ2

μ3

μ4 μ5 = μ1 U μ4

μ6 = μ3 U μ4

μ and μ’ are compatible (μ ~ μ’), if they agree

in their common variables μ1 ~ μ4

μ3 ~ μ4

μ2 ~ μ4

?x ?y

a b

c d

Ω1

?y ?z

b f

Ω2

?x ?y ?z

a b f

c d -

Ω1 Ω2

μ1

μ2

μ3 μ4 = μ1 U μ3

μ2Ω1 \ Ω2Ω1 Ω2

Page 11: On Provenance of Queries on  Linked Web Data

Outline

SPARQL algebra

Abstract Provenance Models for Positive SPARQL

Limitations of Previous Models

Towards a SPARQL Provenance Model

Page 12: On Provenance of Queries on  Linked Web Data

Abstract Provenance Models

Abstract provenance models encode the query

operators in different level of detail

Expressiveness vs efficiency

(annotation storage and computation time)

triple patterns(?x, ?y, e)

mappings{(?x,d),(?y,b)}

{(?x,f),(?y,g)}

ComposeFilter

mappings

{ … }

mappings

{ … }

Select

Provenance

How

Trio

Why

Lineage

Most

informative

Less

informative

Page 13: On Provenance of Queries on  Linked Web Data

Abstract Provenance Models for SPARQL+

Previous models are defined for positive relational algebra

Positive relational operators are monotonic

The addition (removal) of a tuple can only result in additional (removed) tuples in the output

This also holds for SPARQL+ (projection, union, join)

Previous models suffice for SPARQL+

Page 14: On Provenance of Queries on  Linked Web Data

Outline

SPARQL algebra

Abstract Provenance Models for Positive SPARQL

Limitations of Previous Models

Towards a SPARQL Provenance Model

Page 15: On Provenance of Queries on  Linked Web Data

boolean trust semantics

set semantics on trusted mappings

Boolean trust assessment (SPARQL)

?x ?y

d b

f g

Ω1

?x

?y

f g

?y ?z

b c

e h

Ω2

Ω1 \ Ω2

and \ are not monotonic: μ3 becomes untrusted

?x

?y

?z

d b c

f g -

Ω1 Ω2

?x

?y

?z

d b -

f g -

Ω1 \ Ω2

?x

?y ?z

d b -

f g -

Ω1 Ω2

μ1

μ2

μ3

μ4 μ2μ1

μ2

μ5

μ2

μ1

μ2

μ5 becomes untrusted and

μ1 becomes trusted in Ω1 Ω2

Trusted: μ1, μ2, μ3, μ4

Trusted: μ1, μ2, μ4

Page 16: On Provenance of Queries on  Linked Web Data

Perm

?x ?y

d b

f g

Ω1

?x ?y ?y2 ?z2

f g b c

f g e h

?y ?z

b c

e h

Ω2

Ω1 \ Ω2

Intuitively, (f, g) is in Ω1 \ Ω2 because it is not compatible

with neither μ3 nor μ4

?x

?y

?z ?x1 ?y1 ?y2 ?z2

d b c d b b c

f g - f g b c

f g - f g e h

Ω1 Ω2

μ1

μ2

μ3

μ4

If μ3 becomes untrusted, Perm infers that (d, b, c) becomes untrusted, but cannot infer that (d, b, -) should become trusted

(d, b, c) is in Ω1 \ Ω2 due to the join

between μ1 and μ3

Page 17: On Provenance of Queries on  Linked Web Data

RDF Meta Knowledge & M-semirings

?x ?y

d b c1

f g c2

Ω1

?x ?y RDF MK M-semirings

f g c2 Λ (c3Vc4) c2 0 = c2

?y ?z

b c c3

e h c4

Ω2

Ω1 \ Ω2

Like Perm, RDF Meta Knowledge and M-semirings infer that μ5 is untrusted but can not infer that μ1: (d, b, -) is trusted.

?x ?y ?z RDF MK M-semirings

d b c c1 Λ c3 c1 * c3

f g - c2 Λ (c3Vc4)

c2

Ω1 Ω2

μ1

μ2

μ3

μ4

μ2

μ5

μ2

ft

tt

t

ft

t

f

t

Page 18: On Provenance of Queries on  Linked Web Data

Outline

SPARQL algebra

Abstract Provenance Models for Positive SPARQL

Limitations of Previous Models

Towards a SPARQL Provenance Model

Page 19: On Provenance of Queries on  Linked Web Data

A Third Operation for Compatibility (1/2) Take care about compatible mappings

Only one between μ1, μ5 can appear in the result

Keep provenance information for both of them !

?x ?y

d b c1

f g c2

Ω1

?y ?z

b c c3

e h c4

Ω2

?x ?y ?z How SPARQL Prov.

d b c c1*c3 c1*c3

d b - No Info c1*A(μ1, μ3)

f g - c2 c2

Ω1 Ω2 = (Ω1 Ω2) U (Ω1 \ Ω2)

μ1

μ2

μ3

μ4

μ5

μ1

μ2

ft

tt

tt

(t Λ t) = t(t Λ f) = f

t?

A(μ1, μ3) =

f, if μ1 ~ μ3 and c3 = t

t, else

tf

Page 20: On Provenance of Queries on  Linked Web Data

A Third Operation for Compatibility (2/2)

A is a binary operator on mappings

Determines whether the mapping exist in the result or not

If yes, its provenance equals the positive provenance part, e.g. c1 for c1*A(μ1, μ3)

In general,

?x ?y ?z How SPARQL Prov.

d b c c1*c3 c1*c3

d b - No Info c1*A(μ1, μ3)

f g - c2 c2

Ω1 Ω2 = (Ω1 Ω2) U (Ω1 \ Ω2)

μ5

μ1

μ2

A(μ1, μ3) =

0, if μ1 ~ μ3 and c3 ≠ 0

1, else

0: the neutral element for +

1: the neutral element for *

Page 21: On Provenance of Queries on  Linked Web Data

SPARQL Provenance Operators

Two types of operators

on provenance tokens, i.e. + and * (for SPARQL+)

on mappings, i.e. A (for and \)

Good news:

Every triple of the dataset is uniquely annotated.

Why not to use annotations as mapping identifiers in A?

Due to the projection operator…

Page 22: On Provenance of Queries on  Linked Web Data

Enrich Tokens with Schema Information

Use tokens (c1, c2…) as mapping ids in A expressions

But, μ1 ~ μ2 might hold, while π?y,?z (μ1) ~ π ?y,?z (μ2)

Tokens don’t suffice, keep pairs token-schema

A(c1, c2) =

0, if μ1 ~ μ2 and c2 ≠ 0

1, else

?x ?y ?z

a b c

d b -

μ1

μ2

?x ?y ?z Prov.

a b c (c1, {?x, ?y, ?z})

d b - (c2, {?x, ?y, ?z})

?y ?z Prov.

b c (c1, {?y, ?z})

b - (c2, {?y, ?z})

Ω π?y,?z (Ω)

A( (c1, S1), (c2, S2) ) =

0, if πS1 (μ1) ~ πS2 (μ2) and c2 ≠ 0

1, else

Page 23: On Provenance of Queries on  Linked Web Data

Towards a SPARQL Provenance Model

Define an algebra on token-schema pairs

3 operations

2 for SPARQL operators

1 for compatibility

What if there is no projection (or projection is not allowed to be pushed down) ?

annotations suffice (no need for schema information),

still in need of the compatibility operator

What if there is no Optional ?

previous models suffice, e.g. How

Page 24: On Provenance of Queries on  Linked Web Data

Future Work

SPARQL Provenance Model

Extent model expressiveness to capture other computations on

Linked Data

Logic explanations

Implementation

Page 25: On Provenance of Queries on  Linked Web Data

Questions ?