Circuits for Datalog Provenance
Daniel DeutchTel Aviv Univ.
Tova MiloTel Aviv Univ.
Sudeepa RoyUniv. of Washington
Val TannenUniv. of Pennsylvania
“Boolean Provenance/Lineage” as a Boolean formula Q is true on D FQ,D is true Poly-size, Poly-time computable (data complexity) But Q is a RA+ query This talk: What if Q is a Datalog Program?
A Simple Example of Data ProvenanceAsthmaPatien
tAnnBob
FriendAnn JoeAnn TomBob Tom
Smoker
JoeTom
Boolean query Q: x y AsthmaPatient(x) Friend (x, y) Smoker(y)
x1
x2
z1
z2
y1
y2
y3Database D
FQ,D = (x1y1z1) (x1y2z2) (x2y3z2)
3
Provenance– Reliability and repeatability– View management and deletion propagation– Trust and security management– Query answering in probabilistic database, ….
Datalog– Datalog is popular again! (two keynotes this ICDT/EDBT)– Data extraction in Web, declarative networking– Academic/commercial systems (Webdamlog, LogicBlox, Dedalus, Dyna)
Finding suitable “Provenance for Datalog” is important– Both from theoretical and practical viewpoints
How do we compute, store, and interpret provenance for datalog programs efficiently and effectively?
Motivation
4
Can we get poly-size Boolean formulas for datalog provenance?
No, even if we allow unbounded time
Do we have a solution? Yes! Use Boolean Circuits!
What about general “provenance semirings” beyond Boolean provenance? ref. [Green et. al. ’07]
It depends on the semiring
Overview of Our Results
5
Background
Circuits for Boolean Provenance
Circuits for General Provenance Semirings
Outline
6
Background
Circuits for Boolean Provenance
Circuits for General Provenance Semirings
Outline
7
T(x, y) :- R(x, y)T(x, y) :- R(x, z), T(z, y)S(x) :- T(a, x)
DatalogDatalog program for Transitive Closure and Single-source Reachability
EDB (base) relation for edges: R
IDB (derived) relations─ Transitive closure (T)─ Single-source reachability from vertex ‘a’ (S)
IDB(Intensional Databases)
EDB(Extensional Databases)
8
Boolean Provenance PosBool(X)-Database
Tuples are annotated with variables from a set X– Here X = {x1, x2, y1, y2, ….}
For n tuples in X, 2n possible worlds by assignments : X {True, False}
Useful in query evaluation on incomplete or probabilistic databases
AsthmaPatient
AnnBob
FriendAnn JoeAnn TomBob Tom
Smoker
JoeTom
x1
x2
z1
z2
y1
y2
y3
PosBool(X)-database D
9
RA+ over PosBool(X)-Database
Annotation propagates from input to output– Join = , Projection/Union =
Output tuples are annotated by monotone Boolean formula – FQ,D is the annotation of the unique output tuple
AsthmaPatient
AnnBob
FriendAnn JoeAnn TomBob Tom
Smoker
JoeTom
RA+ Q: x y AsthmaPatient(x) Friend (x, y) Smoker(y)
x1
x2
z1
z2
y1
y2
y3PosBool(X)-Database D
FQ,D = (x1y1z1) (x1y2z2) (x2y3z2)
10
Two Important Properties:RA+ over PosBool(X)-Database
For all RA+ query Q, D, and assignment 1. (Faithful Representation) Q(D)= [Q(D)]
2. (Poly-size overhead) The size of FQ,D is poly in |D| and can be computed in poly-time.
AsthmaPatient
AnnBob
FriendAnn JoeAnn TomBob Tom
Smoker
JoeTom
RA+ Q: x y AsthmaPatient(x) Friend (x, y) Smoker(y)
x1
x2
z1
z2
y1
y2
y3
FQ,D = (x1y1z1) (x1y2z2) (x2y3z2)
True
False
True
False
TrueTrue
False
= False
= FalsePosBool(X)-Database D
Semantics using Derivation Trees (Green et al. 2007)
Annotation of T(a, b):
11
Datalog over PosBool(X) DatabaseT(x, y) :- R(x, y)T(x, y) :- R(x, y), T(y, z)S(x) :- T(a, x)
Ra aa b
pqa
b
Trees Leaves t of Annot(t)
…
= (q) (pq) (ppq) …
• Infinitely many trees• But always has a finite equivalent form
= q
But not necessarily poly-size
T(a, b)
R(a, a) T(a, b)
R(a, a) T(a, b)
R(a, b)
T(a, b)
R(a, a) T(a, b)
R(a, b)
R(a, b)
T(a, b)
12
Theorem:Given PosBool(X)-database D and datalog program P,
provenance of tuples in P(D) cannot have a faithful representation using
Boolean formulas of size polynomial in |D|
Lower Bound: Boolean formulas for Datalog Provenance on PosBool(X)
Proof outline:• st-connectivity on n nodes requires n(logn)-size monotone Boolean formula
• Karchmer-Wigderson, 1988
• Faithful representation requires: for all True/False assignments to X, P(D)= [P(D)]
• Reduce to the hard instance with right when P = transitive closure
Solution: Boolean Circuit!
13
Background
Circuits for Boolean Provenance or PosBool(X)
Circuits for General Provenance Semirings
Outline
14
Circuit is a DAG– use common subexpressions– Boolean formula = tree
Leaf nodes: – EDB vars in X
Internal nodes – : IDB/EDB vars used in one derivation– : Alternative derivations
Roots: – IDB vars
Boolean CircuitsR
a aa b
pq
T(x, y) :- R(x, y)T(x, y) :- R(x, y), T(y, z)S(x) :- T(a, x)
XT(a, b)
q pXT(a, b)
XR(a, b)XR(a, a)
a
b
15
Theorem:
Given any PosBool(X)-database D and datalog program P, provenance of tuples in P(D) can be faithfully represented
using monotone Boolean Circuits of poly-size in |D| (and can be computed in poly-time)
Upper Bound: Boolean Circuits for PosBool(X)
16
1. Datalog Provenance can be represented by a system of equations by instantiating vars in the datalogprogram P to EDB/IDB tuples [Green et al. 2007]
Proof Skecth
2. A System of equations with N Boolean variables can be solved in N+1 iterations [Esparza et al. 2011]
• N = #IDB tuples• Build a circuit with N+1 layers from the system of equations
Two key ideas from previous work
• EDB tuples constants, IDB tuples variables • Iteratively solve this system of equations• Fixpoint = provenance for all IDB tuples
17
IllustrationT(x, y) :- R(x, y)T(x, y) :- R(x, y), T(y, z)S(x) :- T(a, x)
Ra aa b
pqa
b
Step1 : Build system of equations by all possible instantiations: x, y, z a, b
XT(a, a) = p (p XT(a, a))XT(a, b) = q (p XT(a, b))XS(b) = XT(a, b)
XS(a) = XT(a, a)
Step 2: Build a circuit with 4 + 1 layers (N = 4) …
varConst
18
XT(a,a),0XS(b),0 XT(a,a),0XT(a,b),0XS(a),0
pq
XT(a,a),1 XS(b),1 XT(a,a),1
XT(a,b),1
XS(a),1
XS(a),2
XT(a,a),2 XS(b),2 XTa,a),2XT(a,b),2
Level 1
Level 2
false false falsefalsefalse
IllustrationXT(a, a) = p (p XT(a, a))XT(a, b) = q (p XT(a, b))XS(b) = XT(a, b)
XS(a) = XT(a, a)
Assign leaf IDB vars to false
Multiple roots for multiple IDB vars
19
1. Store only two levels of circuit instead of N+1 levels– Evaluate iteratively
2. Embed circuit construction in semi-naïve evaluation– Check for new derivations, not only new IDB variables– Sound and Complete
3. Remove self-dependency of IDB vars– works for PosBool(X) and also some other semirings…
XT(a, a) = p (p XT(a, a))XT(a, b) = q (p XT(a, b))XS(b) = XT(a, b)
XS(a) = XT(a, a)
Optimizations
20
Illustration (From here…)
XT(a,a),0XS(b),0 XT(a,a),0XT(a,b),0XS(a),0
pq
XT(a,a),1 XS(b),1 XT(a,a),1
XT(a,b),1
XS(a),1
XS(a),2
XT(a,a),2 XS(b),2 XTa,a),2XT(a,b),2
Level 1
Level 2
false false falsefalsefalse
21
Illustration (…To here)
XT(a,a),bottomXT(a,b),bottomXS(a),bottom
pq
XT(a,a),topXT(a,b),topXS(a),top
With all these optimizations
Top Level
Bottom Level
22
Linear-time deletion propagation (in circuit-size)
Approximation for probabilistic databases– even when only the circuit (and not the database) is available
Circuits can be computed “offline”– Only linear-time evaluation is required when needed (e.g. deletion
propagation) compared to storing and solving a system of equations iteratively, or re-evaluating datalog program
Can use existing techniques for efficient and parallel circuit evaluation
Applications of PosBool(X)-Circuits
23
Background
Circuits for Boolean Provenance or PosBool(X)
Circuits for General Provenance Semirings
Outline
24
(K, +K, K, 0K, 1K)– domain K – +K, K : associative, commutative, have neutral elements 0K, 1K
– K distributes over +K , i.e. a K (b +K c) = a K b +K a K c
– 0K cancels any element in K, i.e. a K 0K = 0K K a = 0K
Examples:
– (B, , , False, True) Set semantics
– (N, +, , 0, 1) Bag semantics
– (N {}, min, +, , 0) Tropical semiring to compute cost (e.g. cost of a shortest path)
Commutative Semirings
25
Generalization of PosBool(X)
(K, +K, K, 0K, 1K)– Tuples are annotated with variables from X– K is of the form Prov(X)– +K denotes alternative usage– K denotes joint usage
Examples:– (PosBool(X), , , False, True)
– (Lin(X), , , , ) tracks contributing tuples [Cui et. al. ’00]
– (Why(X), , , , {}) : pairwise union of subsets, tracks contributing tuples in alternative derivations
[Buneman et. al. ’01]
Provenance Semirings
26
Key property needed for applications like deletion propagation, trust management, cost computation, …
Prov(X) specializes correctly to K, if any valuation v : X K extends uniquely to a homomorphism hv : Prov(X) K (which correctly maps +, of Prov(X) to that of K)
Further, some provenance semirings are “more informative” than the others
Provenance Specialization
27
Provenance Semiring HierarchyN[X]
Why(X)
Lin(X)PosBool(X)
Sorp(X)
Tropical
N (bag)
Security Boolean (set)
Defined later
Specializes correctly
More informative
Less informative
28
Datalog Provenance for General Semirings
Trees Leaves t of Annot(t)
Trees Leaves t of Annot(t)
PosBool(X)
General Prov(X)
+kk
• Infinite sums should be well-defined
• Need to consider “–continuous semirings” and “–continuous homomorphism”
29
Provenance Semiring HierarchyN[X]
Why(X)
Lin(X)PosBool(X)
Sorp(X)
Tropical
N (bag)
Security Boolean (set)
Finite so -continuous
Need to add
N[[X]] and N N[[X]] : Most informative provenance semiring [Green et al. ’07]
30
Poly-size overhead is not valid because of infinite sum But can outputs have finite annotations (with X, , +) that specializes
correctly to semirings with finite domains?
How good is N[[X]] w.r.t. Size of Datalog Provenance?
Theorem:It is not possible to annotate with finite provenance expressions the output of datalog programs following N[[X]] -semanticsthat specialize “correctly” to the semiring Why(X)
Theorem: However, we can generate poly-size circuits in poly-time directly for Why(X)
─ Need more levels in the circuit from system of equations─ Need a different argument for correctness
Finite annotations won’t specialize correctly to Why(X)
31
We propose Sorp(X)– Most general absorptive semiring
a + a.b = a– N[X] but keep polynomials that are not “absorbed” by the others
e.g. pq + p2q3 pq p2q + pq2 p2q + pq2
The same algorithm, proof, and optimizations to construct poly-size circuits hold– Circuits are more general than Boolean circuit
Can we still have a good general semiring w.r.t. size?
1. Specializes correctly to interesting semirings2. Outputs can be annotated by poly-size circuits
32
Provenance Semiring HierarchyN[X]
Why(X)
Lin(X)PosBool(X)
Sorp(X)
Tropical
N (bag)
Security Boolean (set)
33
Data Provenance– e.g. [Cui et. al.’00, Buneman et al. ’08, Cheney et al. ’09, Benjelloun et al. ’08]
Circuits– Circuit complexity (size, /depth, parallelism) has been studied for decades, e.g.
[Arora-Barak ’09] (book)
Provenance for Datalog– System of equations, derivation trees, infinite sum [Grahne’91, Green et al. ’07]– Poly-size c-tables with Boolean formulas for datalog with contradictions
[Abiteboul et al. 2014]
Related Work
34
Circuits to represent and store Datalog Provenance– for PosBool(X) and other semirings– Semantics, Algorithms, Limitations, Applicability
– Preliminary experiments support our results we compared circuits for deletion propagation with iteratively solving
system of equations and reevaluation of datalog from scratch
Future Work:– A complete implementation, evaluation, new applications
Conclusions
35
Thank You
Questions?