43
Representing and Querying XML with Incomplete Information Serge Abiteboul INRIA Luc Segoufin INRIA Victor Vianu UCSD

Representing and Querying XML with Incomplete Information

  • Upload
    tuan

  • View
    33

  • Download
    0

Embed Size (px)

DESCRIPTION

Representing and Querying XML with Incomplete Information. Serge Abiteboul INRIA. Luc Segoufin INRIA. Victor Vianu UCSD. Organization. Motivations Simplifying assumptions Model of incompleteness Answering queries Results Discussion Conclusion. Motivations. - PowerPoint PPT Presentation

Citation preview

Page 1: Representing and Querying XML with Incomplete Information

Representing and QueryingXMLwith Incomplete Information

Serge AbiteboulINRIA

Luc SegoufinINRIA

Victor VianuUCSD

Page 2: Representing and Querying XML with Incomplete Information

pods 2001 Abiteboul-Segoufin-Vianu 2

Organization

• Motivations• Simplifying assumptions• Model of incompleteness• Answering queries• Results• Discussion• Conclusion

Page 3: Representing and Querying XML with Incomplete Information

Motivations

Page 4: Representing and Querying XML with Incomplete Information

pods 2001 Abiteboul-Segoufin-Vianu 4

The Web is a world of incompleteness

• Information you get from the web is seldom complete:• Queries return you some - not all - data • Limited storage capability• Documents change on the Web:

expiration• Sites are unavailable…

• Context: A warehouse of XML documents from the Web, Xyleme

Page 5: Representing and Querying XML with Incomplete Information

pods 2001 Abiteboul-Segoufin-Vianu 5

This work

• This work: simple, practically appealing approach to managing incomplete information

• Sequence of queries to the web • (q1,A1)+(q2,A2)+… • Answers are cached

• Process a new query without access to the web• Give an incomplete answer• Explain incompleteness to user • Seek additional information, i.e., find minimal set

of queries to fully answer

Page 6: Representing and Querying XML with Incomplete Information

pods 2001 Abiteboul-Segoufin-Vianu 6

Related works

• Semantic caching• Answering queries using views

• keep (Qi,Ai)

• try to rewrite query Q into Q’(A1,...,An)

• reject if you cannot

• Incomplete database • (Qi,Ai) is some incomplete knowledge of DB

• Related to querying incomplete information – e.g. Lipski-Imielinski

Page 7: Representing and Querying XML with Incomplete Information

pods 2001 Abiteboul-Segoufin-Vianu 7

Challenge: balance expressiveness and tractability

• Choice of data model• Choice of the query language• Choice of a representation of

incompleteness

• Results• Simple, practical solution• Extra features lead to serious problems

Page 8: Representing and Querying XML with Incomplete Information

Simplifying Assumptions

Page 9: Representing and Querying XML with Incomplete Information

pods 2001 Abiteboul-Segoufin-Vianu 9

Data is XML: trees

dealer

UsedCars NewCars

ad ad

model year model

<dealer> <UsedCars> <ad> <model>Honda</model> <year>96</year> </ad> </UsedCars> <NewCars> <ad> <model>Acura</model> </ad> </NewCars></dealer>

Honda 96 Acura

Page 10: Representing and Querying XML with Incomplete Information

pods 2001 Abiteboul-Segoufin-Vianu 10

Simplified XML

=can =444 =electronique=can =444 =electronique=nik =234 =electronic=nik =234 =electronic

=camera=camera=camera=camera

=c.jpg=c.jpg

value functionvalue function

unordered treesunordered trees

name price cat picturename price cat picture

catalogcatalog

productproduct

subcategorysubcategory

productproduct

name price categoryname price category

subcategorysubcategory

labelling functionlabelling function

Page 11: Representing and Querying XML with Incomplete Information

pods 2001 Abiteboul-Segoufin-Vianu 11

Simple XML types

catalogcatalog

productproduct

name price cat picturename price cat picture

subcategorysubcategory

**

**

1 : 1 child (default)1 : 1 child (default)* : 0 or more* : 0 or more+ : 1 or more+ : 1 or more? : 0 or 1? : 0 or 1

Page 12: Representing and Querying XML with Incomplete Information

pods 2001 Abiteboul-Segoufin-Vianu 12

Prefix Selection Queries (ps-queries)

catalogcatalog

productproduct

name price cat=elecname price cat=elec

subcategorysubcategory

<200<200

Query1Query1catalogcatalog

productproduct

name name

Query2Query2

picturepicture

Page 13: Representing and Querying XML with Incomplete Information

pods 2001 Abiteboul-Segoufin-Vianu 13

Simplifications

Data

• No order• No distinction

attribute/element• No recursion• No links

Query

• No complex path expressions

• No join• No repeated child

productproduct

name cat=elec cat=toyname cat=elec cat=toy

NONO

Page 14: Representing and Querying XML with Incomplete Information

pods 2001 Abiteboul-Segoufin-Vianu 14

Crucial assumption: XID

prodprod

canon 120 eleccanon 120 elec

cameracamera

&245&245 prodprod&245&245

c.jpgc.jpg

++c.jpgc.jpg

prodprod

canon 120 eleccanon 120 elec

&245&245

cameracamera

==

• URLsURLs• ID/IDrefsID/IDrefs

Page 15: Representing and Querying XML with Incomplete Information

Representation of incomplete information:

Incomplete trees

Page 16: Representing and Querying XML with Incomplete Information

pods 2001 Abiteboul-Segoufin-Vianu 16

Document Type Definition (DTD) are used to represent incompleteness

• Set of rules: e r• e element name• r regular expression• Set of trees satisfying a

DTD d: tree(d)• Shortcoming of DTDs

• An element has a single definition independently of the context

• Type of ad depends on the context

dealerdealer

newxarnewxarusedcarusedcar

adadadad

modelmodel yearyear modelmodel

Page 17: Representing and Querying XML with Incomplete Information

pods 2001 Abiteboul-Segoufin-Vianu 17

Solution: specialization (decoupled tags)

• adused and adnew h(adused)=h(adnew )=ad

dealerdealer

newxarnewxarusedcarusedcar

adadnewadadused

modelmodel yearyear modelmodel

dealerdealer

newxarnewxarusedcarusedcar

adadadad

modelmodel yearyear modelmodel

hh

Page 18: Representing and Querying XML with Incomplete Information

pods 2001 Abiteboul-Segoufin-Vianu 18

DTDs + Specialization

The sets of trees that can be specified: the regular unranked tree languages [Bruggeman—Klein+Murata+Wood]

• Same closure properties: intersection, union, complement

• Same complexity

Page 19: Representing and Querying XML with Incomplete Information

pods 2001 Abiteboul-Segoufin-Vianu 19

Example

Q1: name, subcat, price of electronic products with price Q1: name, subcat, price of electronic products with price less than $200less than $200

Q2: name, pictures of cameras at least pictured onceQ2: name, pictures of cameras at least pictured once

--------------------------------------------------------

Q3: name, price, pictures of cameras costing less than Q3: name, price, pictures of cameras costing less than $100 and at least pictured once$100 and at least pictured once

can be can be completelycompletely answered using A1, A2 answered using A1, A2

Q4: list all camerasQ4: list all cameras

can be can be partiallypartially answered using A1, A2 answered using A1, A2

Page 20: Representing and Querying XML with Incomplete Information

pods 2001 Abiteboul-Segoufin-Vianu 20

catalogcatalog

cdplayercdplayer

productproduct

canon 120 eleccanon 120 elec

cameracamera

productproduct

nikon 199 elecnikon 199 elec

cameracamera

productproduct

sony 175 elecsony 175 elec

product1product1 product2product2

****

Q1: name, subcat, price of electronic products with price less than 200Q1: name, subcat, price of electronic products with price less than 200

missingmissing

Page 21: Representing and Querying XML with Incomplete Information

pods 2001 Abiteboul-Segoufin-Vianu 21

Missing data after Q1

product1product1

name price cat picturename price cat picture

subcategorysubcategory

**

product2product2

name price cat picturename price cat picture

subcategorysubcategory

**

!=elec!=elec =elec=elec>200>200

Page 22: Representing and Querying XML with Incomplete Information

pods 2001 Abiteboul-Segoufin-Vianu 22

catalogcatalog

productproduct

canon 120 eleccanon 120 elec

cameracamera

productproduct

nikon 199 elecnikon 199 elec

cameracamera

productproduct

sony 175 elecsony 175 elec

cdplayercdplayer

product2aproduct2a

Q2: name, pictures of cameras at least pictured onceQ2: name, pictures of cameras at least pictured once

product1product1

missingmissingproduct2cproduct2c

product2product2**

** product2bproduct2b**

c.jpgc.jpg akai a.jpg elecakai a.jpg elec

cameracamera

3333

Page 23: Representing and Querying XML with Incomplete Information

pods 2001 Abiteboul-Segoufin-Vianu 23

Incomplete information

• Known information• Prefix of the real data tree

• Missing information• Extended tree type• Conditions on data values• Specializations, disjunctions

Page 24: Representing and Querying XML with Incomplete Information

pods 2001 Abiteboul-Segoufin-Vianu 24

product1product1

name price name price catcat picture picture

subcategorysubcategory

**

!=elec!=elec

product2product2aa

name name priceprice catcat picture picture

subcategorysubcategory

=elec=elec>200>200

name price name price catcat

product3product3

elecelecproduct2product2bb

namename priceprice catcat picturepicture

**

=elec=elec>200>200

product2product2cc

namename priceprice catcat

subcategorysubcategory

=elec=elec>200>200

subcategorysubcategory!=camera!=camera

subcategorysubcategory!=camera!=camera

no pictureno picture

no pictureno picture

product +product +

Known data

Missing data

Page 25: Representing and Querying XML with Incomplete Information

Answering Queries

Page 26: Representing and Querying XML with Incomplete Information

pods 2001 Abiteboul-Segoufin-Vianu 26

Complete answer to Q3

• Q3: name, price, Q3: name, price, pictures of cameras pictures of cameras costing less than $150 costing less than $150 and having at least one and having at least one picturepicture

• Can be fully answered Can be fully answered using available using available informationinformation

• Need to check whether Need to check whether answer is completeanswer is complete

catalogcatalog

prodprod

canon 120 canon 120 c.jpgc.jpg

Page 27: Representing and Querying XML with Incomplete Information

pods 2001 Abiteboul-Segoufin-Vianu 27

Incomplete answer to Q4• Provide known cameras• Explain incompleteness

canoncanon nikonnikon sony sony akaiakai

more productsmore products

name name

price>200price>200andandno pictureno picture

Page 28: Representing and Querying XML with Incomplete Information

pods 2001 Abiteboul-Segoufin-Vianu 28

Completing answer to Q4

• It suffices to ask:

productproduct

name price cat name price cat

sub=camerasub=camera

=elec=elec>200>200 picturepicture

00

Page 29: Representing and Querying XML with Incomplete Information

pods 2001 Abiteboul-Segoufin-Vianu 29

Revisit the types• DTD • Conditions• Specialization: same

element name may have several types

• Not sufficient

• Need to extend again the types: disjunctions

productproduct2b2b

**

=elec=elec>200>200

subcategorysubcategory!=camera!=camera

namename priceprice catcat picturepicture

Page 30: Representing and Querying XML with Incomplete Information

pods 2001 Abiteboul-Segoufin-Vianu 30

Disjunction

??

??

vehiclevehicle

datadata engineengine

descriptiondescription

sailsail

vehiclevehicle

datadata

descriptiondescription

vehiclevehicle

datadataengineengine

sailsail

Query1’Query1’ Query2’Query2’

vehiclevehicle

data=“….”data=“….”

description=“….”description=“….”

Empty!Empty!&322&322

Page 31: Representing and Querying XML with Incomplete Information

pods 2001 Abiteboul-Segoufin-Vianu 31

Disjunction continued

• Type of &322vehicle1 + vehicle2

vehicle2vehicle2

datadata

descriptiondescription

sailsail

vehicle1vehicle1

datadata engineengine

descriptiondescription

The type of &322 can not be described independently of that of data below

Page 32: Representing and Querying XML with Incomplete Information

Results

Page 33: Representing and Querying XML with Incomplete Information

pods 2001 Abiteboul-Segoufin-Vianu 33

Representation System:Lipski’s+Imielinski’s

reprep rep(T)rep(T)Set of possible Set of possible worldsworlds

q(rep(T))q(rep(T))==

rep(q(T))rep(q(T))

qq

Set of possible Set of possible answersanswers

TT

Representation Representation of informationof information

q(T)q(T) reprep

qq

Representation Representation of resultof result

Page 34: Representing and Querying XML with Incomplete Information

pods 2001 Abiteboul-Segoufin-Vianu 34

Representation System for PS-queries

• Incomplete tree T to representq1

-1(A1) … qk-1(Ak)

• PS-query q

• q(T) can be computed in ptime(representation of the answer can be

computed in ptime)

Page 35: Representing and Querying XML with Incomplete Information

pods 2001 Abiteboul-Segoufin-Vianu 35

Querying Incomplete Trees

• Given T and a query q, one can • Give in ptime the sure answers up to

our current knowledge• Check in ptime whether query q can be

fully anwered• Generate in ptime queries to complete

answer

Page 36: Representing and Querying XML with Incomplete Information

pods 2001 Abiteboul-Segoufin-Vianu 36

Comparison with IL

Relational model

• Relational calculus/algebra

• Conditional table

• Closed or open world

• Representation system

XML tree model

• Weaker language (no join)

• Weaker system (no variable)

+ Closed and open World

• Representation system

Page 37: Representing and Querying XML with Incomplete Information

pods 2001 Abiteboul-Segoufin-Vianu 37

Drawback: exponential blowup

• Incomplete information may become exponential w.r.t the sequence of query/answer q1/A1;q2/A2…

11 11 qqii::

Answers are emptyAnswers are empty

databasedatabase

a=ia=i b=ib=i

databasedatabase

aa bb

Type:Type:

Page 38: Representing and Querying XML with Incomplete Information

pods 2001 Abiteboul-Segoufin-Vianu 38

Dealing with exponential blowup

• Make the representation more complex using disjunctions of types• Size of representation stays polynomial• Manipulations much more complex

• Restrict tree types and PS-queries • Already very/too? simple

• Accept to loose some information • Ask extra queries to simplify

representation

Page 39: Representing and Querying XML with Incomplete Information

Discussion

Page 40: Representing and Querying XML with Incomplete Information

pods 2001 Abiteboul-Segoufin-Vianu 40

Discussion: extend language

• Some results in paper• Extensions often lead to intractability

• E.G. : K-pebble transducers [Milo,Suciu,Vianu] that somehow subsume XML-QL and XSL• No (known) representation system• Testing rep(T) is empty is non-

elementary

Page 41: Representing and Querying XML with Incomplete Information

pods 2001 Abiteboul-Segoufin-Vianu 41

Discussion : node Ids

Without node Ids• much less information to integrate

results• more complex• tedious case analysis

Page 42: Representing and Querying XML with Incomplete Information

pods 2001 Abiteboul-Segoufin-Vianu 42

Discussion: ordering

• Ordering in XML, DTD, queries • Problem is totally different and very complex

• Example: • Q1/A1: list of males; Q2/A2: list of females; Q3: list all

• Depending on the type of input• (Male)*(Female)* A3= A1 || A2• (Male Female)* A3= shuffle(A1,A2)• (Male + Female)* we cannot answer A3

• Regular expression processing

Page 43: Representing and Querying XML with Incomplete Information

pods 2001 Abiteboul-Segoufin-Vianu 43

Conclusion

• Framework for acquiring, maintaining, querying incomplete XML data

• Limitations: • simple queries• no order and Id assumption • small extensions lead to problems

• Possible to represent the incompleteness• Possible to answer with incompleteness• Possible to obtain queries to provide full

answer