Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Semistructured Data:The Next Step
Dan Suciu
2
ThesisBack then: semistructured data addressed imprecisions in structure: e.g. data integration, “new data”
Novel paradigm based on semi-structure
Today: many more variety of imprecisions found in data and schema: e.g. data integration, “new data”
Novel paradigm based on probabilities
3
Outline
Semistructured data: past and present
Probabilistic databases: the new paradigm
Supporting technologies and theoretical roots
Towards a theory of probabilistic queries
Conclusions
4
Origins 1/2
Motivation: imprecisions in the structure
data integration [Tsimmis’94-98] handling “new data” [Lorel’97], [UnQL’95]
Solution: relax the structure
5
Origins 2/2
Enabling technologies: OO databases non first normal form data querying SGML files text databases
Theoretical roots: trees, regular expressions
6
TodaySemistructured data in industry: XML support in RDBMS XML engines XML for data exchange stream XML processing
Semistructured data in research: often largest session in SIGMOD/VLDB/PODS dedicated symposiums, conferences, workshops 2nd Dagstuhl workshop !
7
Today:Imprecisions Everywhere
misspellings
different terminologies
object matching
schema matching
measurement errors
constraint violations
None is addressed by semistructured data
8
Handling Imprecisions Today
several tools:
edit distance, q-gram distance, data cleaning, ontologies (WordNet), record linkeage, schema mathings, repairs, ....
always outside the database system
Should be done inside DBMS
9
Summary of Past and Present
imprecisions in data structure:
lead to semistructured data model
many more imprecisions remain today
require new data model
10
Outline
Semistructured data: past and present
Probabilistic databases: the new paradigm
Supporting technologies and theoretical roots
Towards a theory of probabilistic
Conclusions
11
Probabilistic Database
DefinitionA prob DB = a database where each tuple has a probability
will make this more precise later
first, let’s see examples
12
Uncertain Predicates 1/2
SELECT DISTINCT A.name FROM Actor A, Casts C, Film FWHERE C.filmID = F.filmID and C.actorID = A.actorID and A.name ≈ ‘Kevin’ and F.year ≈ 1995 and F.rating ≈ ‘high’
misspellings, data translation, ontologies
Q(x) :- Actor(x,y), Casts(y,z), Film(z,u,v), x ≈ ‘Kevin’, u ≈ 1995, v ≈ ‘high’
13
Uncertain Predicates 2/2Assign a probabilistic score to each tuple
↓probabilistic database
filmID name rating prob. score
f1 ... 9.2 0.95
f2 ... 7.7 0.65
m3 ... 5.3 0.1
m4 ... 8.1 0.7
Film
Film.rating ≈ high
actorID name prob. score
a1 Kevin B. 1.0
a2 R. Keven 0.9a3 Calvin C. 0.8
a4 Collins 0.2
Actor
Actor.name ≈ ‘Kevin’
Now evaluate Q(x) :- Actor(x,y), Casts(y,z), Film(z,u,v)on probabilistic database
14
Inexact Schema Matches 1/2
S2(name, address)
S1(customerName, officeAddress, homeAddress)
?prob=0.8
prob=0.2
15
Inexact Schema Matches 2/2
Query translation:select *from S2where S2.name ≈ ‘Kevin’ and S2.address = ‘Seattle’
customerName officeAddress homeAddress prob. score
Kevin Spokane Seattle 0.2 * 1.00
Keven Portland Bellvue 0.0 * 0.95
Calvin Seattle Redmond 0.8 * 0.37
Collins Ballard Seattle 0.2 * 0.03
S1
Probabilities allow us to consider both mappings
“Run” this on S1
16
Handling Inconsistencies 1/2
Name City Email
Larry Big Seattle lb@com. . . .
.. . . . .
Name City Phone
Larry Big Bellvue . . .
. . . . .
Data Source V1 Data Source V2
Global constraint: each person, only one city
Q: Larry Big’s email ?A: no answers, due to ‘repairs’
[Chomicki]
17
Handling Inconsistencies 2/2
Name City Email prob
Larry Big Seattle lb@com p1. . . .
.. . . . .
Name City Phone prob
Larry Big Bellvue . . . p2
. . . . .
Data Source V1 Data Source V2
Global constraint: p1 + p2 ≤ 1
Q: Larry Big’s email ?A: lb@com, with probability p1 = 0.5
no other info ? p1=p2=0.5
Probabilities allow us to use all the data
18
Formal Definition
A probabilistic database = a probability distribution on all instances
Definition
Pr: Inst → [0,1], ∑I∈Inst Pr[I] = 1
Very powerful: any correlations between tuples
Notation: probabilistic database Ip
“Possible Worlds”
19
Query Semantics
Given Ip, query Q, tuple t:
Pr[t ∈Q] = ∑I∈Inst: t ∈Q(I) Pr[I]
Definition
Note: EVERY query has well defined semantics !
Notation: Q(Ip) = prob distribution on tuples
Return to user: ranked tuples, top k
20
Example
Film
What are the “possible worlds” ?How do we “evaluate” a query ?
Actor
filmID title year prob
f1 t1 1995 p1f2 t2 1982 p2m3 t3 1982 p3m4 t4 1971 p4
filmID actorName prob
f1 Kevin B. q1f1 R. Keven q2f1 Calvin C. q3f2 Collins q4f2 . . . . . .
21
Example: Possible WorldsfilmID title year prob
f1 T1 1995 p1f2 T2 1982 p2m3 T3 1982 p3m4 T4 1971 p4
filmID title year
m3 T3 1982
(1-p1)*(1-p2)*p3*(1-p4)
filmID title year
f1 T1 1995
f2 T2 1982
m4 T4 1971
p1*p2*(1-p3)*p4
filmID title year
f1 T1 1995
f2 T2 1982
m3 T3 1982
m4 T4 1971
p1*p2*p3*p4
. . .
. . .
Ip
22
Example: Query Evaluationyear filmID prob
1995 f1 p11982 f2 p21982 m3 p3
filmID actorName prob
f1 Kevin B. q1f1 R. Keven q2f1 Calvin C. q3f2 Collins q4f2 . . . q5m3 . . . q6
Prob(1995) = p1 * (1 - (1 - q1)*(1 - q2)*(1 - q3))
Prob(1982) = 1 - (1 - )* (1 - )
p2 * (1 - (1 - q4)*(1 - q5))p3 * (1 - (1 - q6))
Filmp Castp
Q(x) :- Film(x,y), Cast(y,z)
23
Summary of the new Paradigm
System converts all imprecisions into prob db Ip
User asks standard query Q
System computes Q(Ip), ranks tuples, returns top k
Excellent exploratory queries, not business process
Variations: Expose probabilities: Trio [Stanford] Hide probabilities: MystiQ [UW]
24
Outline
Semistructured data: past and present
Probabilistic databases: the new paradigm
Supporting technologies and theoretical roots
Towards a theory of probabilistic queries
Conclusions
25
Technology: Top-k Ranking[Chaudhuri&Gravano’99]
Data is exact, but user is willingto accept inexact matches
SELECT * FROM HousesWHERE no_bdr = 4 and no_garages = 3 and price= 300k
Rich recent literature on evaluation techniques
26
Technology:Optimal Aggregation
Theorem [Fagin,Lotem,Naor’01,03]There are optimal algorithms for computing top k
27
Technology:Ranking Query Results
Keyword searches DBExplorer [Chaudhuri’02]DISCOVER [Hristidis’02]BANKS [Bhalotia’02]
Using ontologies XXL [Theobald’02]TOSS [Hung’04]
Approximate SQL [Agrawal’03][Dalvi’04]
Data/queries are inexact
28
Technology:Object Matching
AKA: Object fusion, record linkage,merge/purge, de-duplication, ....
Long history, rich literature
Problem: given n objects, find duplicates assuming typos, misspellings, ...
Theorem: AI-complete
[Felligi&Sunter’69, Winkler’99]
29
Technology:Information Extraction
KnoItAll [Etzioni et al.’04]
Extractors defined per class. E.g. scientist
Collect large body of information by searching the Web
Albert Einstein 0.9262188000Niels Bohr 0.9207664000Enrico Fermi 0.9199148000Marie Curie 0.9156590000Max Planck 0.9126473000Ernest Rutherford 0.9108679000Michael Faraday 0.9105023000Louis Pasteur 0.8986677000Stephen Hawking 0.8986677000Henry Cavendish 0.8981776000
Lots of usefulprobabilisticdatabases !
30
Theory:Random Graphs (1/2)
Erdos and Reny
Independent tuple distribution: Pr[t] = p(n);
Given property Q, let µ[Q] = limn→∞ Pr[Q]
Theorem Every monotone Q has “threshod function” t(n): if p(n) << t(n) then µ[Q] = 0 if p(n) >> t(n) then µ[Q] = 1
p(n) = C/n2
31
Theory:Random Graphs (2/2)
Erdos and Reny
Programfind “threshold” functions p(n) for various properties Q
Important example: subgraph property: Q ⊆ GThreshold is: t(n) = 1/nh (note: >> 1/n2) where h = minH H/e(H) (inv. density)
32
Theory: 0/1 Laws
Shelah and Spencer
Pr[t] = 1/nß
0/1 law holds if ß is irrational
Grove, Halpern, Koller
Pr[t] = 1/2µ[Q|V] does not always exist, and it’s undecidable
Fagin’76
Pr[t] = 1/2for an Q in FO, µ[Q] = 0 or µ[Q] = 1
p(n) = C/n2
33
Theory: Prob DB’s (1/3)[Fuhr&Roellke]
Independent events: e1, e2, ... with probabilities: p1, p2, ...
A B Ea x e1a y e3∧e4b z e4b u e2 ∨ (¬e4∧e1)
Formalism for specifying a prob db
Intensional database:
Ip =
Extensional database: all events are independent
34
Theory: Prob DB’s (2/3)[Fuhr&Roellke]
given E⊆{e1, e2, ...} define I = inst(Ip,E)
Pr[I] = ∑ {Pr[E] | I = inst(Ip,E)}
Possible worlds semantics
Theorem [Dalvi&S] Every probability distribution Pr[I] comes from an intentional db
Theorem Computing Pr[t ∈ Ip] is #P complete
35
Theory: Prob DB’s (3/3)
A B Ea x e1a y e3∧e4b z e4b u e2 ∨ (¬e4∧e1)
Theorem [Fuhr&Roellke] Closed under FO evaluation
A Ea e1∨(e3∧e4)b e4 ∨ (e2 ∨ (¬e4∧e1) )
∃y.actor(x,y)
Intesional databases: powerful but expensive
36
Summary of SupportingWork
Supporting technology
a lot exists
need to understand it in order to make correct assumptions
Theoretical roots
Old and solid
Do not apply out of the box
37
Outline
Semistructured data: past and present
Probabilistic databases: the new paradigm
Supporting technologies and theoretical roots
Towards a theory of probabilistic queries
Conclusions
38
An Agenda for a Theory of Probabilistic Queries
Specification languages
Query evaluation: data complexity
Query evaluation using views
Randomized PTIME query approximation
Probabilistic XML
39
Specification Languages
Extensional databases
Intensional databases
Probabilistic Relational Model
Implicit probabilities
[Fefferman’99]
[Fuhr&Roellke’97]
[Erdos&Reny,Shelah&Spencer]
Agenda Item:Expressibility/complexity/conciseness tradeoffs
40
Query Evaluation (1/3)Given: Q, Ip
Compute: Q(Ip)
Data complexity of Q is in PTIME
Prob(1982) = 1 - (1 - )* (1 - )
p2 * (1 - (1 - q4)*(1 - q5))p3 * (1 - (1 - q6))
Q(x) :- Film(x,y), Cast(y,z)
year filmID prob
1982 f1 p21982 f2 p3
FilmfilmID actorID prob
f1 a1 q4f1 a2 q5f2 a2 q6
Cast
Variation: return top k
41
Query Evaluation (2/3)
Prob(1982, Kevin) = ???
Q(x,u) :- Film(x,y), Cast(y,z), Actor(z,u)
Data complexity of Q is #P-complete
actorID name prob
a1 Kevin r1a2 Kevin r2
Actor
year filmID prob
1982 f1 p21982 f2 p3
FilmfilmID actorID prob
f1 a1 q4f1 a2 q5f2 a2 q6
Cast
42
Query Evaluation (3/3)Theorem [Dalvi&S.’04,Graedel’98] Data complexity is:
But: restrictions on Q, Ip
Agenda item:Data complexity = f(k, specification lang, query lang)
Q Q(Ip)
..., R(x,...),S(x,y,...),T(y,...),... #P-complete
otherwise PTIME
43
Query Answering Using Views
Given: Q, V, Jp
Possible world: Ip s.t. V(Ip) = Jp
Compute: Q(Ip) = Answer(Q, V, Jp), top k
Theorem[Dalvi&S.’05] Answer(Q, V, Jp) computable in exptime
Agenda item:Data complexity = f(k, specification lang, query lang)
But: restrictions on Q, Jp
Corollary µ[Q | V] computable in exptime p(n) = C/n2
44
Query Approximation
Agenda item:Algorithm complexity = f(# calls to query engine)
Given: Q, Ip
Compute: approximate Q(Ip) in PTIME, top k
Theorem [Karp’89] Given a DNF expression E,Pr[E] can be approximated efficiently usingMonte Carlo simulation
45
Probabilistic XML
PXML: introduced by [Hung,Getoor,Subr.’03]
Still open:
Possible world semantics
Query evaluation
Agenda item:Theory of probabilistic XML data
46
Outline
Semistructured data: past and present
Probabilistic databases: the new paradigm
Supporting technologies
Theoretical roots
Towards a theory of probabilistic queries
Conclusions
47
Conclusions
Imprecisions in data are ubiquitous
Data management is ripe for a paradigm shift
Pre-existing technology, theory
Lots of interest from the industry
Kinds of theory: exploratory v.s. explanatory
Questions ?