Upload
tuan
View
33
Download
0
Embed Size (px)
DESCRIPTION
Representing and Querying XML with Incomplete Information. Serge Abiteboul INRIA. Luc Segoufin INRIA. Victor Vianu UCSD. Organization. Motivations Simplifying assumptions Model of incompleteness Answering queries Results Discussion Conclusion. Motivations. - PowerPoint PPT Presentation
Citation preview
Representing and QueryingXMLwith Incomplete Information
Serge AbiteboulINRIA
Luc SegoufinINRIA
Victor VianuUCSD
pods 2001 Abiteboul-Segoufin-Vianu 2
Organization
• Motivations• Simplifying assumptions• Model of incompleteness• Answering queries• Results• Discussion• Conclusion
Motivations
pods 2001 Abiteboul-Segoufin-Vianu 4
The Web is a world of incompleteness
• Information you get from the web is seldom complete:• Queries return you some - not all - data • Limited storage capability• Documents change on the Web:
expiration• Sites are unavailable…
• Context: A warehouse of XML documents from the Web, Xyleme
pods 2001 Abiteboul-Segoufin-Vianu 5
This work
• This work: simple, practically appealing approach to managing incomplete information
• Sequence of queries to the web • (q1,A1)+(q2,A2)+… • Answers are cached
• Process a new query without access to the web• Give an incomplete answer• Explain incompleteness to user • Seek additional information, i.e., find minimal set
of queries to fully answer
pods 2001 Abiteboul-Segoufin-Vianu 6
Related works
• Semantic caching• Answering queries using views
• keep (Qi,Ai)
• try to rewrite query Q into Q’(A1,...,An)
• reject if you cannot
• Incomplete database • (Qi,Ai) is some incomplete knowledge of DB
• Related to querying incomplete information – e.g. Lipski-Imielinski
pods 2001 Abiteboul-Segoufin-Vianu 7
Challenge: balance expressiveness and tractability
• Choice of data model• Choice of the query language• Choice of a representation of
incompleteness
• Results• Simple, practical solution• Extra features lead to serious problems
Simplifying Assumptions
pods 2001 Abiteboul-Segoufin-Vianu 9
Data is XML: trees
dealer
UsedCars NewCars
ad ad
model year model
<dealer> <UsedCars> <ad> <model>Honda</model> <year>96</year> </ad> </UsedCars> <NewCars> <ad> <model>Acura</model> </ad> </NewCars></dealer>
Honda 96 Acura
pods 2001 Abiteboul-Segoufin-Vianu 10
Simplified XML
=can =444 =electronique=can =444 =electronique=nik =234 =electronic=nik =234 =electronic
=camera=camera=camera=camera
=c.jpg=c.jpg
value functionvalue function
unordered treesunordered trees
name price cat picturename price cat picture
catalogcatalog
productproduct
subcategorysubcategory
productproduct
name price categoryname price category
subcategorysubcategory
labelling functionlabelling function
pods 2001 Abiteboul-Segoufin-Vianu 11
Simple XML types
catalogcatalog
productproduct
name price cat picturename price cat picture
subcategorysubcategory
**
**
1 : 1 child (default)1 : 1 child (default)* : 0 or more* : 0 or more+ : 1 or more+ : 1 or more? : 0 or 1? : 0 or 1
pods 2001 Abiteboul-Segoufin-Vianu 12
Prefix Selection Queries (ps-queries)
catalogcatalog
productproduct
name price cat=elecname price cat=elec
subcategorysubcategory
<200<200
Query1Query1catalogcatalog
productproduct
name name
Query2Query2
picturepicture
pods 2001 Abiteboul-Segoufin-Vianu 13
Simplifications
Data
• No order• No distinction
attribute/element• No recursion• No links
Query
• No complex path expressions
• No join• No repeated child
productproduct
name cat=elec cat=toyname cat=elec cat=toy
NONO
pods 2001 Abiteboul-Segoufin-Vianu 14
Crucial assumption: XID
prodprod
canon 120 eleccanon 120 elec
cameracamera
&245&245 prodprod&245&245
c.jpgc.jpg
++c.jpgc.jpg
prodprod
canon 120 eleccanon 120 elec
&245&245
cameracamera
==
• URLsURLs• ID/IDrefsID/IDrefs
Representation of incomplete information:
Incomplete trees
pods 2001 Abiteboul-Segoufin-Vianu 16
Document Type Definition (DTD) are used to represent incompleteness
• Set of rules: e r• e element name• r regular expression• Set of trees satisfying a
DTD d: tree(d)• Shortcoming of DTDs
• An element has a single definition independently of the context
• Type of ad depends on the context
dealerdealer
newxarnewxarusedcarusedcar
adadadad
modelmodel yearyear modelmodel
pods 2001 Abiteboul-Segoufin-Vianu 17
Solution: specialization (decoupled tags)
• adused and adnew h(adused)=h(adnew )=ad
dealerdealer
newxarnewxarusedcarusedcar
adadnewadadused
modelmodel yearyear modelmodel
dealerdealer
newxarnewxarusedcarusedcar
adadadad
modelmodel yearyear modelmodel
hh
pods 2001 Abiteboul-Segoufin-Vianu 18
DTDs + Specialization
The sets of trees that can be specified: the regular unranked tree languages [Bruggeman—Klein+Murata+Wood]
• Same closure properties: intersection, union, complement
• Same complexity
pods 2001 Abiteboul-Segoufin-Vianu 19
Example
Q1: name, subcat, price of electronic products with price Q1: name, subcat, price of electronic products with price less than $200less than $200
Q2: name, pictures of cameras at least pictured onceQ2: name, pictures of cameras at least pictured once
--------------------------------------------------------
Q3: name, price, pictures of cameras costing less than Q3: name, price, pictures of cameras costing less than $100 and at least pictured once$100 and at least pictured once
can be can be completelycompletely answered using A1, A2 answered using A1, A2
Q4: list all camerasQ4: list all cameras
can be can be partiallypartially answered using A1, A2 answered using A1, A2
pods 2001 Abiteboul-Segoufin-Vianu 20
catalogcatalog
cdplayercdplayer
productproduct
canon 120 eleccanon 120 elec
cameracamera
productproduct
nikon 199 elecnikon 199 elec
cameracamera
productproduct
sony 175 elecsony 175 elec
product1product1 product2product2
****
Q1: name, subcat, price of electronic products with price less than 200Q1: name, subcat, price of electronic products with price less than 200
missingmissing
pods 2001 Abiteboul-Segoufin-Vianu 21
Missing data after Q1
product1product1
name price cat picturename price cat picture
subcategorysubcategory
**
product2product2
name price cat picturename price cat picture
subcategorysubcategory
**
!=elec!=elec =elec=elec>200>200
pods 2001 Abiteboul-Segoufin-Vianu 22
catalogcatalog
productproduct
canon 120 eleccanon 120 elec
cameracamera
productproduct
nikon 199 elecnikon 199 elec
cameracamera
productproduct
sony 175 elecsony 175 elec
cdplayercdplayer
product2aproduct2a
Q2: name, pictures of cameras at least pictured onceQ2: name, pictures of cameras at least pictured once
product1product1
missingmissingproduct2cproduct2c
product2product2**
** product2bproduct2b**
c.jpgc.jpg akai a.jpg elecakai a.jpg elec
cameracamera
3333
pods 2001 Abiteboul-Segoufin-Vianu 23
Incomplete information
• Known information• Prefix of the real data tree
• Missing information• Extended tree type• Conditions on data values• Specializations, disjunctions
pods 2001 Abiteboul-Segoufin-Vianu 24
product1product1
name price name price catcat picture picture
subcategorysubcategory
**
!=elec!=elec
product2product2aa
name name priceprice catcat picture picture
subcategorysubcategory
=elec=elec>200>200
name price name price catcat
product3product3
elecelecproduct2product2bb
namename priceprice catcat picturepicture
**
=elec=elec>200>200
product2product2cc
namename priceprice catcat
subcategorysubcategory
=elec=elec>200>200
subcategorysubcategory!=camera!=camera
subcategorysubcategory!=camera!=camera
no pictureno picture
no pictureno picture
product +product +
Known data
Missing data
Answering Queries
pods 2001 Abiteboul-Segoufin-Vianu 26
Complete answer to Q3
• Q3: name, price, Q3: name, price, pictures of cameras pictures of cameras costing less than $150 costing less than $150 and having at least one and having at least one picturepicture
• Can be fully answered Can be fully answered using available using available informationinformation
• Need to check whether Need to check whether answer is completeanswer is complete
catalogcatalog
prodprod
canon 120 canon 120 c.jpgc.jpg
pods 2001 Abiteboul-Segoufin-Vianu 27
Incomplete answer to Q4• Provide known cameras• Explain incompleteness
canoncanon nikonnikon sony sony akaiakai
more productsmore products
name name
price>200price>200andandno pictureno picture
pods 2001 Abiteboul-Segoufin-Vianu 28
Completing answer to Q4
• It suffices to ask:
productproduct
name price cat name price cat
sub=camerasub=camera
=elec=elec>200>200 picturepicture
00
pods 2001 Abiteboul-Segoufin-Vianu 29
Revisit the types• DTD • Conditions• Specialization: same
element name may have several types
• Not sufficient
• Need to extend again the types: disjunctions
productproduct2b2b
**
=elec=elec>200>200
subcategorysubcategory!=camera!=camera
namename priceprice catcat picturepicture
pods 2001 Abiteboul-Segoufin-Vianu 30
Disjunction
??
??
vehiclevehicle
datadata engineengine
descriptiondescription
sailsail
vehiclevehicle
datadata
descriptiondescription
vehiclevehicle
datadataengineengine
sailsail
Query1’Query1’ Query2’Query2’
vehiclevehicle
data=“….”data=“….”
description=“….”description=“….”
Empty!Empty!&322&322
pods 2001 Abiteboul-Segoufin-Vianu 31
Disjunction continued
• Type of &322vehicle1 + vehicle2
vehicle2vehicle2
datadata
descriptiondescription
sailsail
vehicle1vehicle1
datadata engineengine
descriptiondescription
The type of &322 can not be described independently of that of data below
Results
pods 2001 Abiteboul-Segoufin-Vianu 33
Representation System:Lipski’s+Imielinski’s
reprep rep(T)rep(T)Set of possible Set of possible worldsworlds
q(rep(T))q(rep(T))==
rep(q(T))rep(q(T))
Set of possible Set of possible answersanswers
TT
Representation Representation of informationof information
q(T)q(T) reprep
Representation Representation of resultof result
pods 2001 Abiteboul-Segoufin-Vianu 34
Representation System for PS-queries
• Incomplete tree T to representq1
-1(A1) … qk-1(Ak)
• PS-query q
• q(T) can be computed in ptime(representation of the answer can be
computed in ptime)
pods 2001 Abiteboul-Segoufin-Vianu 35
Querying Incomplete Trees
• Given T and a query q, one can • Give in ptime the sure answers up to
our current knowledge• Check in ptime whether query q can be
fully anwered• Generate in ptime queries to complete
answer
pods 2001 Abiteboul-Segoufin-Vianu 36
Comparison with IL
Relational model
• Relational calculus/algebra
• Conditional table
• Closed or open world
• Representation system
XML tree model
• Weaker language (no join)
• Weaker system (no variable)
+ Closed and open World
• Representation system
pods 2001 Abiteboul-Segoufin-Vianu 37
Drawback: exponential blowup
• Incomplete information may become exponential w.r.t the sequence of query/answer q1/A1;q2/A2…
11 11 qqii::
Answers are emptyAnswers are empty
databasedatabase
a=ia=i b=ib=i
databasedatabase
aa bb
Type:Type:
pods 2001 Abiteboul-Segoufin-Vianu 38
Dealing with exponential blowup
• Make the representation more complex using disjunctions of types• Size of representation stays polynomial• Manipulations much more complex
• Restrict tree types and PS-queries • Already very/too? simple
• Accept to loose some information • Ask extra queries to simplify
representation
Discussion
pods 2001 Abiteboul-Segoufin-Vianu 40
Discussion: extend language
• Some results in paper• Extensions often lead to intractability
• E.G. : K-pebble transducers [Milo,Suciu,Vianu] that somehow subsume XML-QL and XSL• No (known) representation system• Testing rep(T) is empty is non-
elementary
pods 2001 Abiteboul-Segoufin-Vianu 41
Discussion : node Ids
Without node Ids• much less information to integrate
results• more complex• tedious case analysis
pods 2001 Abiteboul-Segoufin-Vianu 42
Discussion: ordering
• Ordering in XML, DTD, queries • Problem is totally different and very complex
• Example: • Q1/A1: list of males; Q2/A2: list of females; Q3: list all
• Depending on the type of input• (Male)*(Female)* A3= A1 || A2• (Male Female)* A3= shuffle(A1,A2)• (Male + Female)* we cannot answer A3
• Regular expression processing
pods 2001 Abiteboul-Segoufin-Vianu 43
Conclusion
• Framework for acquiring, maintaining, querying incomplete XML data
• Limitations: • simple queries• no order and Id assumption • small extensions lead to problems
• Possible to represent the incompleteness• Possible to answer with incompleteness• Possible to obtain queries to provide full
answer