Semistructured Data: The Next Stephomes.cs.washington.edu/~suciu/ssd-next.pdfa2 R. Keven 0.9 a3 Calvin C. 0.8 a4 Collins 0.2 Actor Actor.name ≈ ‘Kevin’

Semistructured Data:The Next Step

Dan Suciu

2

ThesisBack then: semistructured data addressed imprecisions in structure: e.g. data integration, “new data”

Novel paradigm based on semi-structure

Today: many more variety of imprecisions found in data and schema: e.g. data integration, “new data”

Novel paradigm based on probabilities

3

Outline

Semistructured data: past and present

Probabilistic databases: the new paradigm

Supporting technologies and theoretical roots

Towards a theory of probabilistic queries

Conclusions

4

Origins 1/2

Motivation: imprecisions in the structure

data integration [Tsimmis’94-98] handling “new data” [Lorel’97], [UnQL’95]

Solution: relax the structure

5

Origins 2/2

Enabling technologies: OO databases non first normal form data querying SGML files text databases

Theoretical roots: trees, regular expressions

6

TodaySemistructured data in industry: XML support in RDBMS XML engines XML for data exchange stream XML processing

Semistructured data in research: often largest session in SIGMOD/VLDB/PODS dedicated symposiums, conferences, workshops 2nd Dagstuhl workshop !

7

Today:Imprecisions Everywhere

misspellings

different terminologies

object matching

schema matching

measurement errors

constraint violations

None is addressed by semistructured data

8

Handling Imprecisions Today

several tools:

edit distance, q-gram distance, data cleaning, ontologies (WordNet), record linkeage, schema mathings, repairs, ....

always outside the database system

Should be done inside DBMS

9

Summary of Past and Present

imprecisions in data structure:

lead to semistructured data model

many more imprecisions remain today

require new data model

10

Outline




Towards a theory of probabilistic

Conclusions

11

Probabilistic Database

DefinitionA prob DB = a database where each tuple has a probability

will make this more precise later

first, let’s see examples

12

Uncertain Predicates 1/2

SELECT DISTINCT A.name FROM Actor A, Casts C, Film FWHERE C.filmID = F.filmID and C.actorID = A.actorID and A.name ≈ ‘Kevin’ and F.year ≈ 1995 and F.rating ≈ ‘high’

misspellings, data translation, ontologies

Q(x) :- Actor(x,y), Casts(y,z), Film(z,u,v), x ≈ ‘Kevin’, u ≈ 1995, v ≈ ‘high’

13

Uncertain Predicates 2/2Assign a probabilistic score to each tuple

↓probabilistic database

filmID name rating prob. score

f1 ... 9.2 0.95

f2 ... 7.7 0.65

m3 ... 5.3 0.1

m4 ... 8.1 0.7

Film

Film.rating ≈ high

actorID name prob. score

a1 Kevin B. 1.0

a2 R. Keven 0.9a3 Calvin C. 0.8

a4 Collins 0.2

Actor

Actor.name ≈ ‘Kevin’

Now evaluate Q(x) :- Actor(x,y), Casts(y,z), Film(z,u,v)on probabilistic database

14

Inexact Schema Matches 1/2

S2(name, address)

S1(customerName, officeAddress, homeAddress)

?prob=0.8

prob=0.2

15

Inexact Schema Matches 2/2

Query translation:select *from S2where S2.name ≈ ‘Kevin’ and S2.address = ‘Seattle’

customerName officeAddress homeAddress prob. score

Kevin Spokane Seattle 0.2 * 1.00

Keven Portland Bellvue 0.0 * 0.95

Calvin Seattle Redmond 0.8 * 0.37

Collins Ballard Seattle 0.2 * 0.03

S1

Probabilities allow us to consider both mappings

“Run” this on S1

16

Handling Inconsistencies 1/2

Name City Email

Larry Big Seattle lb@com. . . .

.. . . . .

Name City Phone

Larry Big Bellvue . . .

. . . . .

Data Source V1 Data Source V2

Global constraint: each person, only one city

Q: Larry Big’s email ?A: no answers, due to ‘repairs’

[Chomicki]

17

Handling Inconsistencies 2/2

Name City Email prob

Larry Big Seattle lb@com p1. . . .

.. . . . .

Name City Phone prob

Larry Big Bellvue . . . p2

. . . . .

Data Source V1 Data Source V2

Global constraint: p1 + p2 ≤ 1

Q: Larry Big’s email ?A: lb@com, with probability p1 = 0.5

no other info ? p1=p2=0.5

Probabilities allow us to use all the data

18

Formal Definition

A probabilistic database = a probability distribution on all instances

Definition

Pr: Inst → [0,1], ∑I∈Inst Pr[I] = 1

Very powerful: any correlations between tuples

Notation: probabilistic database Ip

“Possible Worlds”

19

Query Semantics

Given Ip, query Q, tuple t:

Pr[t ∈Q] = ∑I∈Inst: t ∈Q(I) Pr[I]

Definition

Note: EVERY query has well defined semantics !

Notation: Q(Ip) = prob distribution on tuples

Return to user: ranked tuples, top k

20

Example

Film

What are the “possible worlds” ?How do we “evaluate” a query ?

Actor

filmID title year prob

f1 t1 1995 p1f2 t2 1982 p2m3 t3 1982 p3m4 t4 1971 p4

filmID actorName prob

f1 Kevin B. q1f1 R. Keven q2f1 Calvin C. q3f2 Collins q4f2 . . . . . .

21

Example: Possible WorldsfilmID title year prob

f1 T1 1995 p1f2 T2 1982 p2m3 T3 1982 p3m4 T4 1971 p4

filmID title year

m3 T3 1982

(1-p1)*(1-p2)*p3*(1-p4)

filmID title year

f1 T1 1995

f2 T2 1982

m4 T4 1971

p1*p2*(1-p3)*p4

filmID title year

f1 T1 1995

f2 T2 1982

m3 T3 1982

m4 T4 1971

p1*p2*p3*p4

. . .

. . .

Ip

22

Example: Query Evaluationyear filmID prob

1995 f1 p11982 f2 p21982 m3 p3

filmID actorName prob

f1 Kevin B. q1f1 R. Keven q2f1 Calvin C. q3f2 Collins q4f2 . . . q5m3 . . . q6

Prob(1995) = p1 * (1 - (1 - q1)*(1 - q2)*(1 - q3))

Prob(1982) = 1 - (1 - )* (1 - )

p2 * (1 - (1 - q4)*(1 - q5))p3 * (1 - (1 - q6))

Filmp Castp

Q(x) :- Film(x,y), Cast(y,z)

23

Summary of the new Paradigm

System converts all imprecisions into prob db Ip

User asks standard query Q

System computes Q(Ip), ranks tuples, returns top k

Excellent exploratory queries, not business process

Variations: Expose probabilities: Trio [Stanford] Hide probabilities: MystiQ [UW]

24

Outline





Conclusions

25

Technology: Top-k Ranking[Chaudhuri&Gravano’99]

Data is exact, but user is willingto accept inexact matches

SELECT * FROM HousesWHERE no_bdr = 4 and no_garages = 3 and price= 300k

Rich recent literature on evaluation techniques

26

Technology:Optimal Aggregation

Theorem [Fagin,Lotem,Naor’01,03]There are optimal algorithms for computing top k

27

Technology:Ranking Query Results

Keyword searches DBExplorer [Chaudhuri’02]DISCOVER [Hristidis’02]BANKS [Bhalotia’02]

Using ontologies XXL [Theobald’02]TOSS [Hung’04]

Approximate SQL [Agrawal’03][Dalvi’04]

Data/queries are inexact

28

Technology:Object Matching

AKA: Object fusion, record linkage,merge/purge, de-duplication, ....

Long history, rich literature

Problem: given n objects, find duplicates assuming typos, misspellings, ...

Theorem: AI-complete

[Felligi&Sunter’69, Winkler’99]

29

Technology:Information Extraction

KnoItAll [Etzioni et al.’04]

Extractors defined per class. E.g. scientist

Collect large body of information by searching the Web

Albert Einstein 0.9262188000Niels Bohr 0.9207664000Enrico Fermi 0.9199148000Marie Curie 0.9156590000Max Planck 0.9126473000Ernest Rutherford 0.9108679000Michael Faraday 0.9105023000Louis Pasteur 0.8986677000Stephen Hawking 0.8986677000Henry Cavendish 0.8981776000

Lots of usefulprobabilisticdatabases !

30

Theory:Random Graphs (1/2)

Erdos and Reny

Independent tuple distribution: Pr[t] = p(n);

Given property Q, let µ[Q] = limn→∞ Pr[Q]

Theorem Every monotone Q has “threshod function” t(n): if p(n) << t(n) then µ[Q] = 0 if p(n) >> t(n) then µ[Q] = 1

p(n) = C/n2

31

Theory:Random Graphs (2/2)

Erdos and Reny

Programfind “threshold” functions p(n) for various properties Q

Important example: subgraph property: Q ⊆ GThreshold is: t(n) = 1/nh (note: >> 1/n2) where h = minH H/e(H) (inv. density)

32

Theory: 0/1 Laws

Shelah and Spencer

Pr[t] = 1/nß

0/1 law holds if ß is irrational

Grove, Halpern, Koller

Pr[t] = 1/2µ[Q|V] does not always exist, and it’s undecidable

Fagin’76

Pr[t] = 1/2for an Q in FO, µ[Q] = 0 or µ[Q] = 1

p(n) = C/n2

33

Theory: Prob DB’s (1/3)[Fuhr&Roellke]

Independent events: e1, e2, ... with probabilities: p1, p2, ...

A B Ea x e1a y e3∧e4b z e4b u e2 ∨ (¬e4∧e1)

Formalism for specifying a prob db

Intensional database:

Ip =

Extensional database: all events are independent

34

Theory: Prob DB’s (2/3)[Fuhr&Roellke]

given E⊆{e1, e2, ...} define I = inst(Ip,E)

Pr[I] = ∑ {Pr[E] | I = inst(Ip,E)}

Possible worlds semantics

Theorem [Dalvi&S] Every probability distribution Pr[I] comes from an intentional db

Theorem Computing Pr[t ∈ Ip] is #P complete

35

Theory: Prob DB’s (3/3)

A B Ea x e1a y e3∧e4b z e4b u e2 ∨ (¬e4∧e1)

Theorem [Fuhr&Roellke] Closed under FO evaluation

A Ea e1∨(e3∧e4)b e4 ∨ (e2 ∨ (¬e4∧e1) )

∃y.actor(x,y)

Intesional databases: powerful but expensive

36

Summary of SupportingWork

Supporting technology

a lot exists

need to understand it in order to make correct assumptions

Theoretical roots

Old and solid

Do not apply out of the box

37

Outline





Conclusions

38

An Agenda for a Theory of Probabilistic Queries

Specification languages

Query evaluation: data complexity

Query evaluation using views

Randomized PTIME query approximation

Probabilistic XML

39

Specification Languages

Extensional databases

Intensional databases

Probabilistic Relational Model

Implicit probabilities

[Fefferman’99]

[Fuhr&Roellke’97]

[Erdos&Reny,Shelah&Spencer]

Agenda Item:Expressibility/complexity/conciseness tradeoffs

40

Query Evaluation (1/3)Given: Q, Ip

Compute: Q(Ip)

Data complexity of Q is in PTIME

Prob(1982) = 1 - (1 - )* (1 - )

p2 * (1 - (1 - q4)*(1 - q5))p3 * (1 - (1 - q6))

Q(x) :- Film(x,y), Cast(y,z)

year filmID prob

1982 f1 p21982 f2 p3

FilmfilmID actorID prob

f1 a1 q4f1 a2 q5f2 a2 q6

Cast

Variation: return top k

41

Query Evaluation (2/3)

Prob(1982, Kevin) = ???

Q(x,u) :- Film(x,y), Cast(y,z), Actor(z,u)

Data complexity of Q is #P-complete

actorID name prob

a1 Kevin r1a2 Kevin r2

Actor

year filmID prob

1982 f1 p21982 f2 p3

FilmfilmID actorID prob

f1 a1 q4f1 a2 q5f2 a2 q6

Cast

42

Query Evaluation (3/3)Theorem [Dalvi&S.’04,Graedel’98] Data complexity is:

But: restrictions on Q, Ip

Agenda item:Data complexity = f(k, specification lang, query lang)

Q Q(Ip)

..., R(x,...),S(x,y,...),T(y,...),... #P-complete

otherwise PTIME

43

Query Answering Using Views

Given: Q, V, Jp

Possible world: Ip s.t. V(Ip) = Jp

Compute: Q(Ip) = Answer(Q, V, Jp), top k

Theorem[Dalvi&S.’05] Answer(Q, V, Jp) computable in exptime

Agenda item:Data complexity = f(k, specification lang, query lang)

But: restrictions on Q, Jp

Corollary µ[Q | V] computable in exptime p(n) = C/n2

44

Query Approximation

Agenda item:Algorithm complexity = f(# calls to query engine)

Given: Q, Ip

Compute: approximate Q(Ip) in PTIME, top k

Theorem [Karp’89] Given a DNF expression E,Pr[E] can be approximated efficiently usingMonte Carlo simulation

45

Probabilistic XML

PXML: introduced by [Hung,Getoor,Subr.’03]

Still open:

Possible world semantics

Query evaluation

Agenda item:Theory of probabilistic XML data

46

Outline



Supporting technologies

Theoretical roots


Conclusions

47

Conclusions

Imprecisions in data are ubiquitous

Data management is ripe for a paradigm shift

Pre-existing technology, theory

Lots of interest from the industry

Kinds of theory: exploratory v.s. explanatory

Questions ?

Documents

Semistructured Data: The Next Stephomes.cs.washington.edu/~suciu/ssd-next.pdfa2 R. Keven 0.9 a3 Calvin C. 0.8 a4 Collins 0.2 Actor Actor.name ≈ ‘Kevin’