24
Queries with Difference on Probabilistic Databases Sanjeev Khanna Sudeepa Roy Val Tannen University of Pennsylvania 1

Queries with Difference on Probabilistic Databases

Embed Size (px)

DESCRIPTION

Queries with Difference on Probabilistic Databases. Sanjeev Khanna Sudeepa Roy Val Tannen University of Pennsylvania. Probabilistic Databases. To model and query uncertain data (sensor networks, information extraction…) Possible worlds model - PowerPoint PPT Presentation

Citation preview

1

Queries with Difference on Probabilistic Databases

Sanjeev KhannaSudeepa RoyVal Tannen

University of Pennsylvania

2

Probabilistic Databases

• To model and query uncertain data (sensor networks, information extraction…)

• Possible worlds model– Each possible world W is a standard database

instance, has a probability P[W]– Compact representation D assuming independence

D

a1

a2

a3

a3

b1

b1

b2

b3

0.1

0.5

0.2

0.1

a1

a2

a3

0.3

0.4

0.6

b1

b2

b3

0.7

0.8

0.4

RS

T

3

Query Semantics

• Query Semantics on probabilistic databases:– Apply the query q on each possible world W– Add up the probabilities of the worlds that give

the same query answer A P[q(D) = A] = ∑W : q(W) = A P[W]

• Goal: Efficiently evaluate P[q(D) = A]– Data complexity; want time polynomial in n = |D|

• Can we always efficiently compute P[q(D)]?– NO, in general it is #P-hard

4

b1

b2

b3

u1

u2

u3

0.7

0.8

0.4

b1

b2

b3

0.7

0.8

0.4

a1

a2

a3

a3

b1

b1

b2

b3

v1

v2

v3

v4

0.1

0.5

0.2

0.1

a1

a2

a3

a3

b1

b1

b2

b3

0.1

0.5

0.2

0.1

a1

a2

a3

w1

w2

w3

0.3

0.4

0.6

a1

a2

a3

0.3

0.4

0.6

Introduce event variables for tuples (P[w1] = 0.3, …)

Step 1: Boolean provenance for q(D) [FR ’97, Z ’97] f = w1 v1 u1 + w2 v2 u1 + w3 v3 u2 + w3 v4 u3

Step 2: Compute P[q(D)] = P[f]given P[w1] = 0.3, P[v1] = 0.4, …

Probability

Event variables

Boolean query q():-R(x),S(x, y),T(y)

easy

hard

Query Answering in Two Steps

DR

ST

5

Probability Computation for Positive Queries

• Dichotomy Result [DS ’04, ’07; DSS ’10] Given q as input, we can efficiently decide if q is

– Safe: Safe plans run in poly-time on all instances, or,

– Unsafe: #P-hard, e.g. q() :- R(x) S(x, y) T(y)

• Instance-by-instance approach [SDG ’10, RPT ’11]– Both q and D are given as input – Poly-time algorithm to compute P[q(D)] for special

cases even if q is unsafe

What about queries with difference?

Boolean Provenances for Difference

c1

c1

c2

c3

a1

a2

a3

a2

v1

v2

v3

v4

a1

a2

a3

w1

w2

w3

R T

6

q1(x):- R(x, y), S(y, z)

b1

b2

b1

c1

c2

c3

u1

u2

u3

q2(x):- R(x, y), S(y, z), T(z)b1

b2

u1(v1 + v2) + u3v4

u2v3

b1

b2

u1v1w1 + u1v2w2 + u3v4w2

u2v3w3

b1

b2

(u1(v1 + v2) + u3v4) . (u1v1w1 + u1v2w2 + u3v4w2)

(u2v3) . (u2v3w3)

q = q1 – q2

S

7

Previous Work on Difference

FOR ’11– Framework for exact and approximate

probability computation – But, no guarantee of polynomial running time

In fact, we show in this paper that with difference,

in some cases no approximation exists (unless NP = RP)

How far can we go with difference in poly-time?

8

A Quick Comparison

With difference

• DNF of boolean provenance may be exponential in n

• P[q(D)] may not be approximable

Without difference

• DNF of boolean provenance is poly-size (n|q|)

• P[q(D)] is always approximable (FPRAS)

FPRAS: Fully Polynomial Randomized Approx. Scheme Compute with prob. ≥ ¾ in time polynomial in n, 1/ε

p [(1-ε) P[q(D)], (1+ε) P[q(D)]

9

Our Results

• We study queries of the form q1 – q2 and their generalization

– FPRAS: If q1 is any UCQ, q2 is any safe CQ-

– #P-hardness: Even if both q1 and q2 are safe CQ-

– Inapproximability: Even if q1 is the trivial TRUE query and q2 is a UCQ

• Our FPRAS result extends to a larger class of queries of which q1 – q2 is a special case

[CQ- : Conjunctive queries without self-joins]

10

Difference Rank• Define difference rank (q) of query q recursively

– (R) = 0

– (q1 - q2) = (q1) + (q2) + 1• R – S : rank 1

– (q1 ⋈ q2) = (q1) + (q2)

• (R – S1) ⋈ (R - S2) : rank 2

• (R - T1) ⋈ T2 : rank 1

– (q1 q2) = max ((q1), (q2))

• (R – S1) ⋈ (R - S2) (R - T1) ⋈ T2 : rank 2

– Select, project: rank remains the same

11

FPRAS for queries q with (q) = 1given some conditions hold

(inapproximable for (q) = 1 in general)

12

Steps in FPRAS

• Step 1: Compute boolean provenance of q[D] for any query q with (q) = 1

• Step 2: Write the boolean provenance in a “Probability Friendly Form” (if possible)

• Step 3: FPRAS inspired by Karp-Luby framework

13

Boolean Provenance for Queries q s.t. (q) = 1

Lemma:For any q with (q) = 1, on any D, the provenance

f of q(D) has form

f is poly-size in n = |D|, poly-time computable

...))(()).(( 4321 DNFDNFDNFDNFf

14

Probability Friendly Form (PFF)

If f is in PFF, we can approximate P[f] using Karp-Luby Framework

...))(()).(( 4321 DNFDNFDNFDNFf

...))(()).(( 42 dDNNFbcddDNNFabcf

f is in PFF, if the negated DNF-s can be written in poly-size d-DNNFs (next slide)

15

d-DNNFDarwiche ’01, ’02, DM ’02deterministic - Decomposable Negation Normal

Form

No internal node can have negation

At most one child of a +-node is satisfiable

Children of a .-node do not share variables

+

+1v

1u

1v

2v 2v

In general, can be a DAG

Probability can be computed in linear time

...))(()).(( 42 dDNNFbcddDNNFabcf

16

Karp-Luby Framework

[KL ’83] Given boolean expression DAGs F1, …, Fm

f = F1 + F2 + ... + Fm

P[f] can be computed in poly-time (in m, n)

if in poly-time, i(1) P[Fi] can be computed

(2) it can be checked if a given assignment satisfies Fi

(3) a random satisfying assignment of Fi can be sampled

Well-studied special case: DNF counting, where F1, …, Fm are DNF minterms: f = xyz + xyw + wuv

17

Conditions (1) and (2) hold for PFF

Product of minterm and d-DNNF is another d-DNNF

w2=1, z1=1

+

+1v

1v

2v 2v

121 zwu+

+1v

1u

1v

2v 2v

...))(()).(( 42 dDNNFbcddDNNFabcf

18

Condition (3) also holdsLemma: Generating a random satisfying assignment on a d-DNNF can be done in poly-time

+

+1v

1v

2v 2v

1. Process in reverse topological order

2. Generate a random satisfying assignment bottom up

v2 = 1 v2 = 0

v1 = 0

v1 = 1v2 = 0

v1 = 0, v2 = 0

v1 = 1, v2 = 0

At random

19

Expressibility in PFF

So, if f is in PFF, we can approximate P[q(D)]

But, can we decide in poly-time if some sub-expressions of a boolean expression have poly-size d-DNNFs?

• Not known • But, there are natural sufficient conditions that can be

verified in poly-time

– If certain sub-queries are safe and hence generate read-once expressions [OH ’08]

– If sub-queries generate poly-size OBDDs [JS ’11]

– Extends to instance-by-instance approach (both q, D given)

20

#P-hardness for q1 - q2

both q1, q2 are safe CQ-

21

#P-hardness: Steps in the proof

“Hard” query q = q1 – q2

– q1() := R1(x, y1) R2(x, y2) R3(x, y3) R4(x, y4)

– q2() := R1(x1, y) R2(x2, y) R3(x3, y) R4(x4, y)

Counting independent sets in 3-regular bipartite graphs (XZ ’06)

Counting edge covers in bipartite graphs of degree ≤ 4, where the edge set can be partitioned into 4 disjoint matchings

22

Other Related Work

– Semantics of probabilistic query answering• Fuhr-Rollecke ’97, Zimanyi ‘97

– Dichotomy of CQ- ,CQ and UCQ queries• Dalvi-Suciu ’04, ’07, Dalvi-Schnaitter-Suciu ’10

– Knowledge compilation techniques• Olteanu-Huang ’08, Jha-Olteanu-Suciu ’10, Jha-Suciu ’11, Fink-Olteanu ’11

– Instance-by-instance approach• Sen-Deshpande-Getoor ’10, Roy-Perduca-Tannen ’11

23

Conclusions and Future work

A step towards understanding complexity of exact and approximate computation for queries withdifference operations

Future work– Dichotomy results that classify syntactically difference

queries (similar to positive UCQ)?

– Extending FPRAS to queries with difference rank > 1?

– Experimental evaluation of our algorithms

24

Thank you

Questions?