48
Semistructured Data: The Next Step Dan Suciu

Semistructured Data: The Next Stephomes.cs.washington.edu/~suciu/ssd-next.pdfa2 R. Keven 0.9 a3 Calvin C. 0.8 a4 Collins 0.2 Actor Actor.name ≈ ‘Kevin’

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Semistructured Data: The Next Stephomes.cs.washington.edu/~suciu/ssd-next.pdfa2 R. Keven 0.9 a3 Calvin C. 0.8 a4 Collins 0.2 Actor Actor.name ≈ ‘Kevin’

Semistructured Data:The Next Step

Dan Suciu

Page 2: Semistructured Data: The Next Stephomes.cs.washington.edu/~suciu/ssd-next.pdfa2 R. Keven 0.9 a3 Calvin C. 0.8 a4 Collins 0.2 Actor Actor.name ≈ ‘Kevin’

2

ThesisBack then: semistructured data addressed imprecisions in structure: e.g. data integration, “new data”

Novel paradigm based on semi-structure

Today: many more variety of imprecisions found in data and schema: e.g. data integration, “new data”

Novel paradigm based on probabilities

Page 3: Semistructured Data: The Next Stephomes.cs.washington.edu/~suciu/ssd-next.pdfa2 R. Keven 0.9 a3 Calvin C. 0.8 a4 Collins 0.2 Actor Actor.name ≈ ‘Kevin’

3

Outline

Semistructured data: past and present

Probabilistic databases: the new paradigm

Supporting technologies and theoretical roots

Towards a theory of probabilistic queries

Conclusions

Page 4: Semistructured Data: The Next Stephomes.cs.washington.edu/~suciu/ssd-next.pdfa2 R. Keven 0.9 a3 Calvin C. 0.8 a4 Collins 0.2 Actor Actor.name ≈ ‘Kevin’

4

Origins 1/2

Motivation: imprecisions in the structure

data integration [Tsimmis’94-98] handling “new data” [Lorel’97], [UnQL’95]

Solution: relax the structure

Page 5: Semistructured Data: The Next Stephomes.cs.washington.edu/~suciu/ssd-next.pdfa2 R. Keven 0.9 a3 Calvin C. 0.8 a4 Collins 0.2 Actor Actor.name ≈ ‘Kevin’

5

Origins 2/2

Enabling technologies: OO databases non first normal form data querying SGML files text databases

Theoretical roots: trees, regular expressions

Page 6: Semistructured Data: The Next Stephomes.cs.washington.edu/~suciu/ssd-next.pdfa2 R. Keven 0.9 a3 Calvin C. 0.8 a4 Collins 0.2 Actor Actor.name ≈ ‘Kevin’

6

TodaySemistructured data in industry: XML support in RDBMS XML engines XML for data exchange stream XML processing

Semistructured data in research: often largest session in SIGMOD/VLDB/PODS dedicated symposiums, conferences, workshops 2nd Dagstuhl workshop !

Page 7: Semistructured Data: The Next Stephomes.cs.washington.edu/~suciu/ssd-next.pdfa2 R. Keven 0.9 a3 Calvin C. 0.8 a4 Collins 0.2 Actor Actor.name ≈ ‘Kevin’

7

Today:Imprecisions Everywhere

misspellings

different terminologies

object matching

schema matching

measurement errors

constraint violations

None is addressed by semistructured data

Page 8: Semistructured Data: The Next Stephomes.cs.washington.edu/~suciu/ssd-next.pdfa2 R. Keven 0.9 a3 Calvin C. 0.8 a4 Collins 0.2 Actor Actor.name ≈ ‘Kevin’

8

Handling Imprecisions Today

several tools:

edit distance, q-gram distance, data cleaning, ontologies (WordNet), record linkeage, schema mathings, repairs, ....

always outside the database system

Should be done inside DBMS

Page 9: Semistructured Data: The Next Stephomes.cs.washington.edu/~suciu/ssd-next.pdfa2 R. Keven 0.9 a3 Calvin C. 0.8 a4 Collins 0.2 Actor Actor.name ≈ ‘Kevin’

9

Summary of Past and Present

imprecisions in data structure:

lead to semistructured data model

many more imprecisions remain today

require new data model

Page 10: Semistructured Data: The Next Stephomes.cs.washington.edu/~suciu/ssd-next.pdfa2 R. Keven 0.9 a3 Calvin C. 0.8 a4 Collins 0.2 Actor Actor.name ≈ ‘Kevin’

10

Outline

Semistructured data: past and present

Probabilistic databases: the new paradigm

Supporting technologies and theoretical roots

Towards a theory of probabilistic

Conclusions

Page 11: Semistructured Data: The Next Stephomes.cs.washington.edu/~suciu/ssd-next.pdfa2 R. Keven 0.9 a3 Calvin C. 0.8 a4 Collins 0.2 Actor Actor.name ≈ ‘Kevin’

11

Probabilistic Database

DefinitionA prob DB = a database where each tuple has a probability

will make this more precise later

first, let’s see examples

Page 12: Semistructured Data: The Next Stephomes.cs.washington.edu/~suciu/ssd-next.pdfa2 R. Keven 0.9 a3 Calvin C. 0.8 a4 Collins 0.2 Actor Actor.name ≈ ‘Kevin’

12

Uncertain Predicates 1/2

SELECT DISTINCT A.name FROM Actor A, Casts C, Film FWHERE C.filmID = F.filmID and C.actorID = A.actorID and A.name ≈ ‘Kevin’ and F.year ≈ 1995 and F.rating ≈ ‘high’

misspellings, data translation, ontologies

Q(x) :- Actor(x,y), Casts(y,z), Film(z,u,v), x ≈ ‘Kevin’, u ≈ 1995, v ≈ ‘high’

Page 13: Semistructured Data: The Next Stephomes.cs.washington.edu/~suciu/ssd-next.pdfa2 R. Keven 0.9 a3 Calvin C. 0.8 a4 Collins 0.2 Actor Actor.name ≈ ‘Kevin’

13

Uncertain Predicates 2/2Assign a probabilistic score to each tuple

↓probabilistic database

filmID name rating prob. score

f1 ... 9.2 0.95

f2 ... 7.7 0.65

m3 ... 5.3 0.1

m4 ... 8.1 0.7

Film

Film.rating ≈ high

actorID name prob. score

a1 Kevin B. 1.0

a2 R. Keven 0.9a3 Calvin C. 0.8

a4 Collins 0.2

Actor

Actor.name ≈ ‘Kevin’

Now evaluate Q(x) :- Actor(x,y), Casts(y,z), Film(z,u,v)on probabilistic database

Page 14: Semistructured Data: The Next Stephomes.cs.washington.edu/~suciu/ssd-next.pdfa2 R. Keven 0.9 a3 Calvin C. 0.8 a4 Collins 0.2 Actor Actor.name ≈ ‘Kevin’

14

Inexact Schema Matches 1/2

S2(name, address)

S1(customerName, officeAddress, homeAddress)

?prob=0.8

prob=0.2

Page 15: Semistructured Data: The Next Stephomes.cs.washington.edu/~suciu/ssd-next.pdfa2 R. Keven 0.9 a3 Calvin C. 0.8 a4 Collins 0.2 Actor Actor.name ≈ ‘Kevin’

15

Inexact Schema Matches 2/2

Query translation:select *from S2where S2.name ≈ ‘Kevin’ and S2.address = ‘Seattle’

customerName officeAddress homeAddress prob. score

Kevin Spokane Seattle 0.2 * 1.00

Keven Portland Bellvue 0.0 * 0.95

Calvin Seattle Redmond 0.8 * 0.37

Collins Ballard Seattle 0.2 * 0.03

S1

Probabilities allow us to consider both mappings

“Run” this on S1

Page 16: Semistructured Data: The Next Stephomes.cs.washington.edu/~suciu/ssd-next.pdfa2 R. Keven 0.9 a3 Calvin C. 0.8 a4 Collins 0.2 Actor Actor.name ≈ ‘Kevin’

16

Handling Inconsistencies 1/2

Name City Email

Larry Big Seattle lb@com. . . .

.. . . . .

Name City Phone

Larry Big Bellvue . . .

. . . . .

Data Source V1 Data Source V2

Global constraint: each person, only one city

Q: Larry Big’s email ?A: no answers, due to ‘repairs’

[Chomicki]

Page 17: Semistructured Data: The Next Stephomes.cs.washington.edu/~suciu/ssd-next.pdfa2 R. Keven 0.9 a3 Calvin C. 0.8 a4 Collins 0.2 Actor Actor.name ≈ ‘Kevin’

17

Handling Inconsistencies 2/2

Name City Email prob

Larry Big Seattle lb@com p1. . . .

.. . . . .

Name City Phone prob

Larry Big Bellvue . . . p2

. . . . .

Data Source V1 Data Source V2

Global constraint: p1 + p2 ≤ 1

Q: Larry Big’s email ?A: lb@com, with probability p1 = 0.5

no other info ? p1=p2=0.5

Probabilities allow us to use all the data

Page 18: Semistructured Data: The Next Stephomes.cs.washington.edu/~suciu/ssd-next.pdfa2 R. Keven 0.9 a3 Calvin C. 0.8 a4 Collins 0.2 Actor Actor.name ≈ ‘Kevin’

18

Formal Definition

A probabilistic database = a probability distribution on all instances

Definition

Pr: Inst → [0,1], ∑I∈Inst Pr[I] = 1

Very powerful: any correlations between tuples

Notation: probabilistic database Ip

“Possible Worlds”

Page 19: Semistructured Data: The Next Stephomes.cs.washington.edu/~suciu/ssd-next.pdfa2 R. Keven 0.9 a3 Calvin C. 0.8 a4 Collins 0.2 Actor Actor.name ≈ ‘Kevin’

19

Query Semantics

Given Ip, query Q, tuple t:

Pr[t ∈Q] = ∑I∈Inst: t ∈Q(I) Pr[I]

Definition

Note: EVERY query has well defined semantics !

Notation: Q(Ip) = prob distribution on tuples

Return to user: ranked tuples, top k

Page 20: Semistructured Data: The Next Stephomes.cs.washington.edu/~suciu/ssd-next.pdfa2 R. Keven 0.9 a3 Calvin C. 0.8 a4 Collins 0.2 Actor Actor.name ≈ ‘Kevin’

20

Example

Film

What are the “possible worlds” ?How do we “evaluate” a query ?

Actor

filmID title year prob

f1 t1 1995 p1f2 t2 1982 p2m3 t3 1982 p3m4 t4 1971 p4

filmID actorName prob

f1 Kevin B. q1f1 R. Keven q2f1 Calvin C. q3f2 Collins q4f2 . . . . . .

Page 21: Semistructured Data: The Next Stephomes.cs.washington.edu/~suciu/ssd-next.pdfa2 R. Keven 0.9 a3 Calvin C. 0.8 a4 Collins 0.2 Actor Actor.name ≈ ‘Kevin’

21

Example: Possible WorldsfilmID title year prob

f1 T1 1995 p1f2 T2 1982 p2m3 T3 1982 p3m4 T4 1971 p4

filmID title year

m3 T3 1982

(1-p1)*(1-p2)*p3*(1-p4)

filmID title year

f1 T1 1995

f2 T2 1982

m4 T4 1971

p1*p2*(1-p3)*p4

filmID title year

f1 T1 1995

f2 T2 1982

m3 T3 1982

m4 T4 1971

p1*p2*p3*p4

. . .

. . .

Ip

Page 22: Semistructured Data: The Next Stephomes.cs.washington.edu/~suciu/ssd-next.pdfa2 R. Keven 0.9 a3 Calvin C. 0.8 a4 Collins 0.2 Actor Actor.name ≈ ‘Kevin’

22

Example: Query Evaluationyear filmID prob

1995 f1 p11982 f2 p21982 m3 p3

filmID actorName prob

f1 Kevin B. q1f1 R. Keven q2f1 Calvin C. q3f2 Collins q4f2 . . . q5m3 . . . q6

Prob(1995) = p1 * (1 - (1 - q1)*(1 - q2)*(1 - q3))

Prob(1982) = 1 - (1 - )* (1 - )

p2 * (1 - (1 - q4)*(1 - q5))p3 * (1 - (1 - q6))

Filmp Castp

Q(x) :- Film(x,y), Cast(y,z)

Page 23: Semistructured Data: The Next Stephomes.cs.washington.edu/~suciu/ssd-next.pdfa2 R. Keven 0.9 a3 Calvin C. 0.8 a4 Collins 0.2 Actor Actor.name ≈ ‘Kevin’

23

Summary of the new Paradigm

System converts all imprecisions into prob db Ip

User asks standard query Q

System computes Q(Ip), ranks tuples, returns top k

Excellent exploratory queries, not business process

Variations: Expose probabilities: Trio [Stanford] Hide probabilities: MystiQ [UW]

Page 24: Semistructured Data: The Next Stephomes.cs.washington.edu/~suciu/ssd-next.pdfa2 R. Keven 0.9 a3 Calvin C. 0.8 a4 Collins 0.2 Actor Actor.name ≈ ‘Kevin’

24

Outline

Semistructured data: past and present

Probabilistic databases: the new paradigm

Supporting technologies and theoretical roots

Towards a theory of probabilistic queries

Conclusions

Page 25: Semistructured Data: The Next Stephomes.cs.washington.edu/~suciu/ssd-next.pdfa2 R. Keven 0.9 a3 Calvin C. 0.8 a4 Collins 0.2 Actor Actor.name ≈ ‘Kevin’

25

Technology: Top-k Ranking[Chaudhuri&Gravano’99]

Data is exact, but user is willingto accept inexact matches

SELECT * FROM HousesWHERE no_bdr = 4 and no_garages = 3 and price= 300k

Rich recent literature on evaluation techniques

Page 26: Semistructured Data: The Next Stephomes.cs.washington.edu/~suciu/ssd-next.pdfa2 R. Keven 0.9 a3 Calvin C. 0.8 a4 Collins 0.2 Actor Actor.name ≈ ‘Kevin’

26

Technology:Optimal Aggregation

Theorem [Fagin,Lotem,Naor’01,03]There are optimal algorithms for computing top k

Page 27: Semistructured Data: The Next Stephomes.cs.washington.edu/~suciu/ssd-next.pdfa2 R. Keven 0.9 a3 Calvin C. 0.8 a4 Collins 0.2 Actor Actor.name ≈ ‘Kevin’

27

Technology:Ranking Query Results

Keyword searches DBExplorer [Chaudhuri’02]DISCOVER [Hristidis’02]BANKS [Bhalotia’02]

Using ontologies XXL [Theobald’02]TOSS [Hung’04]

Approximate SQL [Agrawal’03][Dalvi’04]

Data/queries are inexact

Page 28: Semistructured Data: The Next Stephomes.cs.washington.edu/~suciu/ssd-next.pdfa2 R. Keven 0.9 a3 Calvin C. 0.8 a4 Collins 0.2 Actor Actor.name ≈ ‘Kevin’

28

Technology:Object Matching

AKA: Object fusion, record linkage,merge/purge, de-duplication, ....

Long history, rich literature

Problem: given n objects, find duplicates assuming typos, misspellings, ...

Theorem: AI-complete

[Felligi&Sunter’69, Winkler’99]

Page 29: Semistructured Data: The Next Stephomes.cs.washington.edu/~suciu/ssd-next.pdfa2 R. Keven 0.9 a3 Calvin C. 0.8 a4 Collins 0.2 Actor Actor.name ≈ ‘Kevin’

29

Technology:Information Extraction

KnoItAll [Etzioni et al.’04]

Extractors defined per class. E.g. scientist

Collect large body of information by searching the Web

Albert Einstein 0.9262188000Niels Bohr 0.9207664000Enrico Fermi 0.9199148000Marie Curie 0.9156590000Max Planck 0.9126473000Ernest Rutherford 0.9108679000Michael Faraday 0.9105023000Louis Pasteur 0.8986677000Stephen Hawking 0.8986677000Henry Cavendish 0.8981776000

Lots of usefulprobabilisticdatabases !

Page 30: Semistructured Data: The Next Stephomes.cs.washington.edu/~suciu/ssd-next.pdfa2 R. Keven 0.9 a3 Calvin C. 0.8 a4 Collins 0.2 Actor Actor.name ≈ ‘Kevin’

30

Theory:Random Graphs (1/2)

Erdos and Reny

Independent tuple distribution: Pr[t] = p(n);

Given property Q, let µ[Q] = limn→∞ Pr[Q]

Theorem Every monotone Q has “threshod function” t(n): if p(n) << t(n) then µ[Q] = 0 if p(n) >> t(n) then µ[Q] = 1

p(n) = C/n2

Page 31: Semistructured Data: The Next Stephomes.cs.washington.edu/~suciu/ssd-next.pdfa2 R. Keven 0.9 a3 Calvin C. 0.8 a4 Collins 0.2 Actor Actor.name ≈ ‘Kevin’

31

Theory:Random Graphs (2/2)

Erdos and Reny

Programfind “threshold” functions p(n) for various properties Q

Important example: subgraph property: Q ⊆ GThreshold is: t(n) = 1/nh (note: >> 1/n2) where h = minH H/e(H) (inv. density)

Page 32: Semistructured Data: The Next Stephomes.cs.washington.edu/~suciu/ssd-next.pdfa2 R. Keven 0.9 a3 Calvin C. 0.8 a4 Collins 0.2 Actor Actor.name ≈ ‘Kevin’

32

Theory: 0/1 Laws

Shelah and Spencer

Pr[t] = 1/nß

0/1 law holds if ß is irrational

Grove, Halpern, Koller

Pr[t] = 1/2µ[Q|V] does not always exist, and it’s undecidable

Fagin’76

Pr[t] = 1/2for an Q in FO, µ[Q] = 0 or µ[Q] = 1

p(n) = C/n2

Page 33: Semistructured Data: The Next Stephomes.cs.washington.edu/~suciu/ssd-next.pdfa2 R. Keven 0.9 a3 Calvin C. 0.8 a4 Collins 0.2 Actor Actor.name ≈ ‘Kevin’

33

Theory: Prob DB’s (1/3)[Fuhr&Roellke]

Independent events: e1, e2, ... with probabilities: p1, p2, ...

A B Ea x e1a y e3∧e4b z e4b u e2 ∨ (¬e4∧e1)

Formalism for specifying a prob db

Intensional database:

Ip =

Extensional database: all events are independent

Page 34: Semistructured Data: The Next Stephomes.cs.washington.edu/~suciu/ssd-next.pdfa2 R. Keven 0.9 a3 Calvin C. 0.8 a4 Collins 0.2 Actor Actor.name ≈ ‘Kevin’

34

Theory: Prob DB’s (2/3)[Fuhr&Roellke]

given E⊆{e1, e2, ...} define I = inst(Ip,E)

Pr[I] = ∑ {Pr[E] | I = inst(Ip,E)}

Possible worlds semantics

Theorem [Dalvi&S] Every probability distribution Pr[I] comes from an intentional db

Theorem Computing Pr[t ∈ Ip] is #P complete

Page 35: Semistructured Data: The Next Stephomes.cs.washington.edu/~suciu/ssd-next.pdfa2 R. Keven 0.9 a3 Calvin C. 0.8 a4 Collins 0.2 Actor Actor.name ≈ ‘Kevin’

35

Theory: Prob DB’s (3/3)

A B Ea x e1a y e3∧e4b z e4b u e2 ∨ (¬e4∧e1)

Theorem [Fuhr&Roellke] Closed under FO evaluation

A Ea e1∨(e3∧e4)b e4 ∨ (e2 ∨ (¬e4∧e1) )

∃y.actor(x,y)

Intesional databases: powerful but expensive

Page 36: Semistructured Data: The Next Stephomes.cs.washington.edu/~suciu/ssd-next.pdfa2 R. Keven 0.9 a3 Calvin C. 0.8 a4 Collins 0.2 Actor Actor.name ≈ ‘Kevin’

36

Summary of SupportingWork

Supporting technology

a lot exists

need to understand it in order to make correct assumptions

Theoretical roots

Old and solid

Do not apply out of the box

Page 37: Semistructured Data: The Next Stephomes.cs.washington.edu/~suciu/ssd-next.pdfa2 R. Keven 0.9 a3 Calvin C. 0.8 a4 Collins 0.2 Actor Actor.name ≈ ‘Kevin’

37

Outline

Semistructured data: past and present

Probabilistic databases: the new paradigm

Supporting technologies and theoretical roots

Towards a theory of probabilistic queries

Conclusions

Page 38: Semistructured Data: The Next Stephomes.cs.washington.edu/~suciu/ssd-next.pdfa2 R. Keven 0.9 a3 Calvin C. 0.8 a4 Collins 0.2 Actor Actor.name ≈ ‘Kevin’

38

An Agenda for a Theory of Probabilistic Queries

Specification languages

Query evaluation: data complexity

Query evaluation using views

Randomized PTIME query approximation

Probabilistic XML

Page 39: Semistructured Data: The Next Stephomes.cs.washington.edu/~suciu/ssd-next.pdfa2 R. Keven 0.9 a3 Calvin C. 0.8 a4 Collins 0.2 Actor Actor.name ≈ ‘Kevin’

39

Specification Languages

Extensional databases

Intensional databases

Probabilistic Relational Model

Implicit probabilities

[Fefferman’99]

[Fuhr&Roellke’97]

[Erdos&Reny,Shelah&Spencer]

Agenda Item:Expressibility/complexity/conciseness tradeoffs

Page 40: Semistructured Data: The Next Stephomes.cs.washington.edu/~suciu/ssd-next.pdfa2 R. Keven 0.9 a3 Calvin C. 0.8 a4 Collins 0.2 Actor Actor.name ≈ ‘Kevin’

40

Query Evaluation (1/3)Given: Q, Ip

Compute: Q(Ip)

Data complexity of Q is in PTIME

Prob(1982) = 1 - (1 - )* (1 - )

p2 * (1 - (1 - q4)*(1 - q5))p3 * (1 - (1 - q6))

Q(x) :- Film(x,y), Cast(y,z)

year filmID prob

1982 f1 p21982 f2 p3

FilmfilmID actorID prob

f1 a1 q4f1 a2 q5f2 a2 q6

Cast

Variation: return top k

Page 41: Semistructured Data: The Next Stephomes.cs.washington.edu/~suciu/ssd-next.pdfa2 R. Keven 0.9 a3 Calvin C. 0.8 a4 Collins 0.2 Actor Actor.name ≈ ‘Kevin’

41

Query Evaluation (2/3)

Prob(1982, Kevin) = ???

Q(x,u) :- Film(x,y), Cast(y,z), Actor(z,u)

Data complexity of Q is #P-complete

actorID name prob

a1 Kevin r1a2 Kevin r2

Actor

year filmID prob

1982 f1 p21982 f2 p3

FilmfilmID actorID prob

f1 a1 q4f1 a2 q5f2 a2 q6

Cast

Page 42: Semistructured Data: The Next Stephomes.cs.washington.edu/~suciu/ssd-next.pdfa2 R. Keven 0.9 a3 Calvin C. 0.8 a4 Collins 0.2 Actor Actor.name ≈ ‘Kevin’

42

Query Evaluation (3/3)Theorem [Dalvi&S.’04,Graedel’98] Data complexity is:

But: restrictions on Q, Ip

Agenda item:Data complexity = f(k, specification lang, query lang)

Q Q(Ip)

..., R(x,...),S(x,y,...),T(y,...),... #P-complete

otherwise PTIME

Page 43: Semistructured Data: The Next Stephomes.cs.washington.edu/~suciu/ssd-next.pdfa2 R. Keven 0.9 a3 Calvin C. 0.8 a4 Collins 0.2 Actor Actor.name ≈ ‘Kevin’

43

Query Answering Using Views

Given: Q, V, Jp

Possible world: Ip s.t. V(Ip) = Jp

Compute: Q(Ip) = Answer(Q, V, Jp), top k

Theorem[Dalvi&S.’05] Answer(Q, V, Jp) computable in exptime

Agenda item:Data complexity = f(k, specification lang, query lang)

But: restrictions on Q, Jp

Corollary µ[Q | V] computable in exptime p(n) = C/n2

Page 44: Semistructured Data: The Next Stephomes.cs.washington.edu/~suciu/ssd-next.pdfa2 R. Keven 0.9 a3 Calvin C. 0.8 a4 Collins 0.2 Actor Actor.name ≈ ‘Kevin’

44

Query Approximation

Agenda item:Algorithm complexity = f(# calls to query engine)

Given: Q, Ip

Compute: approximate Q(Ip) in PTIME, top k

Theorem [Karp’89] Given a DNF expression E,Pr[E] can be approximated efficiently usingMonte Carlo simulation

Page 45: Semistructured Data: The Next Stephomes.cs.washington.edu/~suciu/ssd-next.pdfa2 R. Keven 0.9 a3 Calvin C. 0.8 a4 Collins 0.2 Actor Actor.name ≈ ‘Kevin’

45

Probabilistic XML

PXML: introduced by [Hung,Getoor,Subr.’03]

Still open:

Possible world semantics

Query evaluation

Agenda item:Theory of probabilistic XML data

Page 46: Semistructured Data: The Next Stephomes.cs.washington.edu/~suciu/ssd-next.pdfa2 R. Keven 0.9 a3 Calvin C. 0.8 a4 Collins 0.2 Actor Actor.name ≈ ‘Kevin’

46

Outline

Semistructured data: past and present

Probabilistic databases: the new paradigm

Supporting technologies

Theoretical roots

Towards a theory of probabilistic queries

Conclusions

Page 47: Semistructured Data: The Next Stephomes.cs.washington.edu/~suciu/ssd-next.pdfa2 R. Keven 0.9 a3 Calvin C. 0.8 a4 Collins 0.2 Actor Actor.name ≈ ‘Kevin’

47

Conclusions

Imprecisions in data are ubiquitous

Data management is ripe for a paradigm shift

Pre-existing technology, theory

Lots of interest from the industry

Kinds of theory: exploratory v.s. explanatory

Page 48: Semistructured Data: The Next Stephomes.cs.washington.edu/~suciu/ssd-next.pdfa2 R. Keven 0.9 a3 Calvin C. 0.8 a4 Collins 0.2 Actor Actor.name ≈ ‘Kevin’

Questions ?