From Semistructured Data to XML
Dan Suciu
AT&T Labshttp://www.research.att.com/~suciu/vldb99-tutorial.pdf
How the Web is Today
• HTML documents
• all intended for human consumption
• many generated automatically by applications
Easy to fetch any Web page, from any server, any platform
Limits of the Web Today
• application cannot consume HTML
• HTML wrapper technology is brittle– screen scraping
• OO technology (Corba) requires controlled environment
• companies merge, form partnerships; need interoperability fast
people are inventive: send data by fax !
Paradigm Shift on the Web
• new Web standard XML:– XML generated by applications– XML consumed by applications
• data exchange– across platforms: enterprise interoperability– across enterprises
Web: from collection of documents to data and documents
Database Community Can Help
• query optimization, processing
• views, transformations
• data warehouses, data integration
• mediators, query rewriting
• secondary storage, indexes
But Needs a Paradigm Shift Too
• Web data differs from database data:– self-describing, schema-less– structure changes without notice– heterogeneous, deeply nested, irregular– documents and data mixed together
• designed by document, not db experts
• need Web data management
What This Tutorial is About
• what the database community has done– semistructured data model– query languages, schemas
• what the Web community has done:– data formats/models: XML, RDF– transformation language (XSL), schemas
• where they meet and where they differ
Outline
• Semistructured data and XML
• Query languages
• Schemas
• Systems issues
• Conclusions
Part 1Semistructured Data and XML
Semistructured Data
Origins:
• integration of heterogeneous sources
• data sources with non-rigid structure
• biological data
• Web data
The Semistructured Data Model
&o1
&o12 &o24 &o29
&o43&96
&243 &206
&25
“Serge”“Abiteboul”
1997
“Victor”“Vianu”
122 133
paperbook
paper
references
referencesreferences
authortitle
yearhttp
author
authorauthor
title publisherauthor
authortitle
page
firstnamelastname
firstname lastname firstlast
Bib
Object Exchange Model (OEM)
complex object
atomic object
Syntax for Semistructured Data
Bib: &o1 { paper: &o12 { … },
book: &o24 { … },
paper: &o29
{ author: &o52 “Abiteboul”,
author: &o96 { firstname: &243 “Victor”,
lastname: &o206 “Vianu”},
title: &o93 “Regular path queries with constraints”,
references: &o12,
references: &o24,
pages: &o25 { first: &o64 122, last: &o92 133}
}
}
Syntax for Semistructured Data
May omit oid’s:
{ paper: { author: “Abiteboul”,
author: { firstname: “Victor”,
lastname: “Vianu”},
title: “Regular path queries …”,
page: { first: 122, last: 133 }
}
}
Characteristics of Semistructured Data
• missing or additional attributes
• multiple attributes
• different types in different objects
• heterogeneous collections
self-describing, irregular data, no a priori structure
Comparison with Relational Data
{ row: { name: “John”, phone: 3634 },
row: { name: “Sue”, phone: 6343 },
row: { name: “Dick”, phone: 6363 }
}
n a m e p h o n e
J o h n 3 6 3 4
S u e 6 3 4 3
D i c k 6 3 6 3
row row row
name name namephone phone phone
“John” 3634“Sue” “Dick”6343 6363
XML
• a W3C standard to complement HTML
• origins: structured text SGML
• motivation:– HTML describes presentation– XML describes content
•
• http://www.w3.org/TR/REC-xml (2/98)
SGMLXMLHTML4.0
From HTML to XML
HTML describes the presentation
HTML
<h1> Bibliography </h1>
<p> <i> Foundations of Databases </i>
Abiteboul, Hull, Vianu
<br> Addison Wesley, 1995
<p> <i> Data on the Web </i>
Abiteoul, Buneman, Suciu
<br> Morgan Kaufmann, 1999
XML<bibliography>
<book> <title> Foundations… </title>
<author> Abiteboul </author>
<author> Hull </author>
<author> Vianu </author>
<publisher> Addison Wesley </publisher>
<year> 1995 </year>
</book>
…
</bibliography>XML describes the content
XML Terminology• tags: book, title, author, …• start tag: <book>, end tag: </book>• elements: <book>…<book>,<author>…</author>• elements are nested• empty element: <red></red> abbrv. <red/>• an XML document: single root element
well formed XML document: if it has matching tags
More XML: Attributes
<book price = “55” currency = “USD”>
<title> Foundations of Databases </title>
<author> Abiteboul </author>
…
<year> 1995 </year>
</book>
attributes are alternative ways to represent data
More XML: Oids and References
<person id=“o555”> <name> Jane </name> </person>
<person id=“o456”> <name> Mary </name>
<children idref=“o123 o555”/>
</person>
<person id=“o123” mother=“o456”><name>John</name>
</person>
oids and references in XML are just syntax
XML Data Model
• does not exists• Document Object Model (DOM):
– http://www.w3.org/TR/REC-DOM-Level-1 (10/98)– class hierarchy (node, element, attribute,…)– objects have behavior– defines API to inspect/modify the document
XML Parsers
• traditional: return data structure (DOM?)
• event based: SAX (Simple API for XML)– http://www.megginson.com/SAX– write handler for start tag and for end tag
XML Namespaces
• http://www.w3.org/TR/REC-xml-names (1/99)• name ::= [prefix:]localpart
<book xmlns:isbn=“www.isbn-org.org/def”>
<title> … </title>
<number> 15 </number>
<isbn:number> …. </isbn:number>
</book>
XML Namespaces
• syntactic: <number> , <isbn:number>
• semantic: provide URL for schema
<tag xmlns:mystyle = “http://…”>
…
<mystyle:title> … </mystyle:title>
<mystyle:number> …
</tag>
defined here
XML v.s. Semistructured Data
• both described best by a graph
• both are schema-less, self-describing
Similarities and Differences
<person id=“o123”>
<name> Alan </name>
<age> 42 </age>
<email> ab@com </email>
</person>
{ person: &o123
{ name: “Alan”,
age: 42,
email: “ab@com” }
}
person
name age email
Alan 42 ab@com
person
name age email
Alan 42 ab@com
father father
<person father=“o123”> …</person>
{ person: { father: &o123 …}}
similar on trees, different on graphs
More Differences
• XML is ordered, ssd is not
• XML can mix text and elements: <talk> Making Java easier to type and easier to type
<speaker> Phil Wadler </speaker>
</talk>
• XML has lots of other stuff: entities, processing instructions, comments
RDF
• http://www.w3.org/TR/REC-rdf-syntax (2/99)
• purpose: metadata for Web– help search engines
• syntax in XML
• semantics: edge-labeled graphs
RDF Syntax
<rdf:Description about=“www.mypage.com”>
<about> birds, butterflies, snakes </about>
<author> <rdf:Description>
<firstname> John </firstname>
<lastname> Smith </lastname>
</rdf:Description>
</author>
</rdf:Description>
RDF Data Model
www.mypage.com
birds, butterflies, snakes
John Smith
about author
firstname lastname
the RDF Data Model is very close to semistructured data
More RDF Examples
www.mypage.com
birds, butterflies, snakes
John Smith
about author
firstname lastname
www.anotherpage.com
author
related
Joe Doe
author
<rdf:Description about=“www.mypage.com”>
<about> birds, butterflies, snakes </about>
<author> <rdf:Description ID=“&o55”>
<firstname> John </firstname>
<lastname> Smith </lastname>
</rdf:Description> </author>
</rdf:Description>
<rdf:Description about=“www.anotherpage.com”> <related> <rdf:Description about=“www.mypage.com”/> </related> <author rdf:resource=“&o55”/> <author> Joe Doe </author></rdf:Description>
RDF Terminology
subject
object
predicate
statement
O E M R D Fn o d e r e s o u r c el a b e l p r o p e r t y
s o u r c e / l a b e l / d e s t s u b j e c t / p r e d i c a t e / o b j e c te d g e s t a t e m e n t
More RDF: Containers• bag, sequence, alternative
<rdf:Description> <a> <rdf:Bag>
<rdf:li> s1 </rdf:li>
<rdf:li> s2 </rdf:li>
</rdf:Bag>
</a>
</rdf:Description>
RDF Containers (cont’d)
Bag s1 s2
a
rdf:typerdf_1
rdf_2
More RDF: Higher Order Statements
“the author of www.thispage.com says: ‘the topic of www.thatpage.com is environment’ “
www.thatpage.com
environment
topic
www.thispage.com
says
author
RDF uses reification
Summary of Data Models
• semistructured data, XML, RDF
• data is self-describing, irregular
• schema embedded in the data
Part 2Query Languages
• Semistructured data and XML
• Query languages
• Schemas
• Systems issues
• Conclusions
Query Languages: Motivation
• granularity of the HTML Web: one file
• granularity of Web data varies:– single data item: “get John’s salary”– entire database: “get all salaries”– aggregates: “get average salary”
• need query language to define granularity
Query Languages: Outline
• for semistructured data:– Lorel– UnQL– StruQL
• for XML: XML-QL• a different paradigm
– structural recursion– XSL
Lorel
• part of the Lore system (Stanford)
• adapts OQL to semistructured data
select X.titlefrom Bib.paper Xwhere X.year > 1995
select X.titlefrom Bib.paper Xwhere X.year > 1995
select Bib.paper.titlefrom Bib.paperwhere Bib.paper.year > 1995
select Bib.paper.titlefrom Bib.paperwhere Bib.paper.year > 1995
example:
abbreviated to:
Lorel v.s. OQL
• implicit coercions: 1995 to “1995”
• missing attributes– empty answer v.s. type error
• set-valued attributes– in X.year>1995, X may have several years
• regular path expressions (next)
Regular Path Expressions
Useful for:• syntactic substitute for inheritance: paper|book• navigating partially known structures: lastname?• transitive closure: reference+
select X.titlefrom Bib.paper X, Bib.(paper|book) Ywhere Y.author.lastname? = “Ullman” and Y.reference+ X
select X.titlefrom Bib.paper X, Bib.(paper|book) Ywhere Y.author.lastname? = “Ullman” and Y.reference+ X
select Twhere Bib.paper: { title: T, year: Y, journal: “TODS”} and Y > 1995
select Twhere Bib.paper: { title: T, year: Y, journal: “TODS”} and Y > 1995
UnQL
• Unstructured Query Language
• patterns, templates, structural recursion
• patterns:
UnQL: Templates
select result: { fn: F, ln: L, pub: { title: T, year: Y }}where Bib.paper: { title: T, year: Y, journal: “TODS”} and Y > 1995
select result: { fn: F, ln: L, pub: { title: T, year: Y }}where Bib.paper: { title: T, year: Y, journal: “TODS”} and Y > 1995
Result looks like: { result: { fn: “John”, ln: “Smith”, pub: { title: “P equals NP”, year: 2005}}, result: { fn: “Joe”, ln: “Doe”, pub: { title: “Errata to P=NP”, year: 2006}} … }
Skolem Functions
• Maier, 1986– in OO systems
• Kifer et al, 1989– F-logic
• Hull and Yoshikawa, 1990– deductive db (ILOG)
• Papakonstantinou et al., 1996– semistructured db (MSL)
• illustrate with Strudel (next)
Skolem Functions in StruQL
• Strudel: a Web Site Management System
• StruQL: its query language
Example: Bibliography Data
{Bib: { paper: { author: “Jones”,
author: “Smith”,
title: “The Comma”,
year: 1994
}
},
{ paper: ….. }
}
Example: A Complex Web Site
Root()
YearPage(“Smith”,1994)
YearPage(“Smith”,1996)
YearPage(“Jones”,1994)
YearPage(“Jones”,1998)
YearPage(“Mark”,1996)
yearentry yearentry yearentryyearentry yearentry
publication
publicationPubPage(“The Comma”) PubPage(“The Dot”)
publication publicationpublication
title title
author
author
author
HomePage(“Smith”) HomePage(“Jones”) HomePage(“Mark”)
personperson
person
Example: Skolem Functions in
StruQLwhere Root -> “Bib” -> X, X -> “paper” -> P, P -> “author” -> A, P -> “title” -> T, P -> “year” -> Y
create Root(), HomePage(A), YearPage(A,Y), PubPage(P)
link Root() -> “person” -> HomePage(A), HomePage(A) -> “yearentry” -> YearPage(A,Y), YearPage(A,Y) -> “publication” -> PubPage(P), PubPage(P) -> “author” -> HomePage(A), PubPage(P) -> “title” -> T
where Root -> “Bib” -> X, X -> “paper” -> P, P -> “author” -> A, P -> “title” -> T, P -> “year” -> Y
create Root(), HomePage(A), YearPage(A,Y), PubPage(P)
link Root() -> “person” -> HomePage(A), HomePage(A) -> “yearentry” -> YearPage(A,Y), YearPage(A,Y) -> “publication” -> PubPage(P), PubPage(P) -> “author” -> HomePage(A), PubPage(P) -> “title” -> T
XML-QL: A Query Language for XML
• http://www.w3.org/TR/NOTE-xml-ql (8/98)
• features:– regular path expressions– patterns, templates– Skolem Functions
• based on OEM data model
Pattern Matching in XML-QL
where <book language=“french”> <publisher> <name> Morgan Kaufmann </name> </publisher> <author> $a </author> </book> in “www.a.b.c/bib.xml”construct $a
where <book language=“french”> <publisher> <name> Morgan Kaufmann </name> </publisher> <author> $a </author> </book> in “www.a.b.c/bib.xml”construct $a
Simple Constructors in XML-QL
Note: </> abbreviates </book> or </result> or ...
where <book language = $l> <author> $a </> </> in “www.a.b.c/bib.xml”construct <result> <author> $a </> <lang> $l </> </>
where <book language = $l> <author> $a </> </> in “www.a.b.c/bib.xml”construct <result> <author> $a </> <lang> $l </> </>
<result> <author>Smith</author><lang>English</lang></result><result> <author>Smith</author><lang>Mandarin</lang></result><result> <author>Doe</author><lang>English</lang></result>
Skolem Functions in XML-QL
where <book language = $l> <author> $a </> </> in “www.a.b.c/bib.xml”construct <result> <author id=F($a)> $a</> <lang> $l </> </>
where <book language = $l> <author> $a </> </> in “www.a.b.c/bib.xml”construct <result> <author id=F($a)> $a</> <lang> $l </> </>
<result> <author>Smith</author> <lang>English</lang> <lang>Mandarin</lang> </result><result> <author>Doe</author> <lang>English</lang> </result>
A Different Paradigm: Structural Recursion
Data as sets with a union operator:
{a:3, a:{b:”one”, c:5}, b:4} =
{a:3} U {a:{b:”one”,c:5}} U {b:4}
Structural Recursion
f(T1 U T2) = f(T1) U f(T2)f({L: T}) = f(T)f({}) = {}f(V) = if isInt(V) then {result: V} else {}
f(T1 U T2) = f(T1) U f(T2)f({L: T}) = f(T)f({}) = {}f(V) = if isInt(V) then {result: V} else {}
Example: retrieve all integers in the data
a a b
b c3
“one” 5
4result result result
3 5 4
standard textbook programming on trees
Structural Recursion
Example: increase all engine prices by 10%
f(T1 U T2) = f(T1) U f(T2)f({L: T}) = if L= engine then {L: g(T)} else {L: f(T)}f({}) = {}f(V) = V
f(T1 U T2) = f(T1) U f(T2)f({L: T}) = if L= engine then {L: g(T)} else {L: f(T)}f({}) = {}f(V) = V
g(T1 U T2) = g(T1) U g(T2)g({L: T}) = if L= price then {L:1.1*T} else {L: g(T)}g({}) = {}g(V) = V
g(T1 U T2) = g(T1) U g(T2)g({L: T}) = if L= price then {L:1.1*T} else {L: g(T)}g({}) = {}g(V) = V
engine body
part price
price price
part price
100
1000
100
1000
engine body
part price
price price
part price
110
1100
100
1000
XSL
• two W3C drafts: XSLT and XPATH– http://www.w3.org/TR/xpath, 7/99– http://www.w3.org/TR/WD-xslt, 7/99
• in commercial products (e.g. IE5.0)
• purpose: stylesheet specification language:– stylesheet: XML -> HTML– in general: XML -> XML
XSL Templates and Rules
• query = collection of template rules
• template rule = match pattern + template
<xsl:template> <xsl:apply-templates/> </xsl:template>
<xsl:template match = “/bib/*/title”> <result> <xsl:value-of/> </result></xsl:template>
<xsl:template> <xsl:apply-templates/> </xsl:template>
<xsl:template match = “/bib/*/title”> <result> <xsl:value-of/> </result></xsl:template>
Retrieve all book titles:
XPath Expressions in Match Patterns
bib matches a bib element
* matches any element
/ matches the root element
/bib matches a bib element under root
bib/paper matches a paper in bib
bib//paper matches a paper in bib, at any depth
//paper matches a paper at any depth
paper|book matches a paper or a book
@price matches a price attribute
bib/book/@price matches price attribute in book, in bib
Flow Control in XSL
<xsl:template> <xsl:apply-templates/> </xsl:template>
<xsl:template match=“a”> <A><xsl:apply-templates/></A></xsl:template>
<xsl:template match=“b”> <B><xsl:apply-templates/></B></xsl:template>
<xsl:template match=“c”> <C><xsl:value-of/></C></xsl:template>
<xsl:template> <xsl:apply-templates/> </xsl:template>
<xsl:template match=“a”> <A><xsl:apply-templates/></A></xsl:template>
<xsl:template match=“b”> <B><xsl:apply-templates/></B></xsl:template>
<xsl:template match=“c”> <C><xsl:value-of/></C></xsl:template>
<a> <e> <b> <c> 1 </c>
<c> 2 </c>
</b>
<a> <c> 3 </c>
</a>
</e>
<c> 4 </c>
</a>
<A> <B> <C> 1 </C>
<C> 2 </C>
</B>
<A> <C> 3 </C>
</A>
<C> 4 </C>
</A>
XSL is Structural Recursion
Equivalent to:
f(T1 U T2) = f(T1) U f(T2)f({L: T}) = if L= c then {C: t} else L= b then {B: f(t)} else L= a then {A: f(t)} else f(t)f({}) = {}f(V) = V
f(T1 U T2) = f(T1) U f(T2)f({L: T}) = if L= c then {C: t} else L= b then {B: f(t)} else L= a then {A: f(t)} else f(t)f({}) = {}f(V) = V
XSL query = single functionXSL query with modes = multiple function
XSL and Structural Recursion
XSL:• trees only• may loop
Structural Recursion:• arbitrary graphs• always terminates
<xsl:template match = “e”> <xsl:apply-patterns select=“/”/></xsl:template>
<xsl:template match = “e”> <xsl:apply-patterns select=“/”/></xsl:template>
stack overflow on IE 5.0
add the following rule:
Summary of Query Languages
• studied extensively in semistructured data
• some quite powerful features
• no standard for XML QL yet (WG soon)
• XSL available today (for stylesheets)
• XSL = structural recursion
Part 3Schemas
• Semistructured data and XML
• Query languages
• Schemas
• Systems issues
• Conclusions
Schemas
• why ?– XML: to describe semantics– semistructured data: to improve processing
• what ?– semistructured data: foundational – XML: several concrete proposals
here lies our interest
Schemas
• when ?– semistructured data, XML: a posteriori– RDBMS: a priori, to interpret binary data
• how ?– semistructured data: schema is independent – XML: schema is hardwired with the data
Outline
• schemas for semistructured data:– foundations– schema extraction
• schemas for XML:– DTD– XML-Schema– RDF-Schema
Schemas: An Example
&r1
&c1 &c2
&s2 &s3 &s6 &s7
&s10
companycompany
nameaddress name
url
address
“Widget” “Trenton” “Gadget”
“www.gp.fr”
“Paris”
&p2&p1 &p3
&s0 &s1 &s4 &s5 &s8 &s9
personperson
person
“Smith”
nameposition name phonename
position
“Manager” “Jones” “5552121” “Dupont” “Sales”
employeemanages
c.e.o.works-for works-for
works-for
c.e.o.
&a1
&a2 &a3
&a4&a5
&a6&a7
description
description
procurement salesrep
contact
task
eval1997
1998
“on target”
“below target”
Some database:
Lower-Bound Schemas
Root
Company Employee
string
companyperson
works-for
c.e.o.
address
name
managed-by
name
Upper Bound Schemas
Root
Company Employee
string
companyperson
works-for
c.e.o. | employee
name | address | url
managed-by
name | phone | position
Any
description
-
The Two Questions to Ask
Conformance: does that data conform to this schema ?
Classification: if so, then which objects belong to what classes ?
Graph Simulation
Definition Two edge-labeled graphs G1, G2
A simulation is a relation R between nodes:• if (x1, x2) in R, and (x1,a,y1) in G1,
then exists (x2,a,y2) in G2 (same label)
s.t. (y1,y2) in R
x1 x2
a
R
G1 G2
y1
a
Ry2
Note: a simulation can be efficiently computed [Henzinger, et a. 1995]
Using Simulation
Data graph D, schema S
• upper bound schema:– conformance: find simulation R from D to S– classification: check if (x,c) in R
• lower bound schema– conformance: find simulation R from S to D– classification: check if (c,x) in R
[Buneman et al 1997]
Example
&r1
&c1 &c2
&s2 &s3 &s6 &s7
&s10
companycompany
nameaddress name
url
address
“Widget” “Trenton” “Gadget”
“www.gp.fr”
“Paris”
&p2&p1 &p3
&s0 &s1 &s4 &s5 &s8 &s9
personperson
person
“Smith”
nameposition namephonename
position
“Manager” “Jones” “5552121” “Dupont” “Sales”
employeemanages
c.e.o.works-for works-for
works-forc.e.o.
&a1
&a2 &a3
&a4&a5
&a6&a7
description
description
procurement salesrep
contact
task
eval1997
1998
“on target”
“below target”
Root
Company Employee
string
company
person
works-for
c.e.o.
address
name
managed-by
name
Root
Company Employee
string
company
person
works-for
c.e.o. | employee
name | address | url
managed-by
name | phone | position
Any
description
-
DatabaseLower Bound Upper Bound
simulation: efficient technique for checking conformance to schema
Application 1: Improve Secondary Storage
Root
Company Employee
string
company
person
works-for
c.e.o.
address
name
managed-by
name
o i d n a m e a d d r e s s c . e . o .… … … …… … … …
Company
o i d n a m e m a n a g e d - b y w o r k s - f o r… … … …… … … …
Employee
Store rest in overflow graph
Lower-bound schema
Application 2: Query Optimization
Bib
paper book
yearjournal
title
int string string
addressauthor
title
zip city street
lastname
firstname
string string string string string
string
select X.titlefrom Bib._ Xwhere X.*.zip = “12345”
select X.titlefrom Bib._ Xwhere X.*.zip = “12345”
select X.titlefrom Bib.book Xwhere X.address.zip = “12345”
select X.titlefrom Bib.book Xwhere X.address.zip = “12345”
Upper-bound schema[Fernandez, Suciu 1998]
Schema Extraction(From Data)
Problem statement
• given data instance D
• find the “most specific” schema S for D
In practice: S too large, need to relax
[Nestorov et al. 1998]
Schema Extraction: Sample Data
&r
&p8&p1 &p2 &p3 &p4 &p5 &p6 &p7
&c
company
employeeemployee
employeeemployee employee employee
employeeemployee
worksfor
worksfor
worksforworksforworksfor
worksforworksfor
worksfor
manages
manages
manages
manages
managedby
managedbymanagedby
manages
managedby
managedby
Lower Bound Schema Extraction
Root&r
Bosses&p1,&p4,&p6
Regulars&p2,&p3,&p5,&p7,&p8
Company&c
company employee
manages
managedby
worksfor
worksfor
employee
Upper Bound Schema Extraction: Data Guides
Root&r
Employees&p1,&p1,&p3,P4
&p5,&p6,&p7,&p8
Bosses&p1,&p4,&p6
Regulars&p2,&p3,&p5,&p7,&p8
Company&c
company
employee
managesmanagedby
manages
managedby
worksfor
worksfor
worksfor
Schemas in XML
• Document Type Definition (DTD)
• XML Schema
• RDF Schema
Document Type Definition: DTD
• part of the original XML specification
• an XML document may have a DTD
• terminology for XML:– well-formed: if tags are correctly closed– valid: if it has a DTD and conforms to it
• validation is useful in data exchange
DTDs as Grammars
<!DOCTYPE paper [ <!ELEMENT paper (section*)> <!ELEMENT section ((title,section*) | text)> <!ELEMENT title (#PCDATA)> <!ELEMENT text (#PCDATA)>]>
<!DOCTYPE paper [ <!ELEMENT paper (section*)> <!ELEMENT section ((title,section*) | text)> <!ELEMENT title (#PCDATA)> <!ELEMENT text (#PCDATA)>]>
<paper> <section> <text> </text> </section> <section> <title> </title> <section> … </section> <section> … </section> </section></paper>
DTDs as Schemas
Not so well suited:• impose unwanted constraints on order
<!ELEMENT person (name,phone)>
• references cannot be constraint
• can be to vague: <!ELEMENT person ((name|phone|email)*)>
XML Schemas
• very recent proposal
• unifies previous schema proposals
• generalizes DTDs
• uses XML syntax
• two documents: structure and datatypes– http://www.w3.org/TR/xmlschema-1– http://www.w3.org/TR/xmlschema-2
XML Schemas<elementType name=“paper”>
<sequence>
<elementTypeRef name=“title”/>
<elementTypeRef name=“author” minOccurs=“0”/>
<elementTypeRef name=“year”/>
<choice> <elementTypeRef name=“journal”/>
<elementTypeRef name=“conference”/>
</choice>
</sequence>
</elementType>DTD: <!ELEMENT paper (title,author*,year, (journal|conference))>
RDF Schemas
• http://www.w3.org/TR/PR-rdf-schema (3/99)
• object-oriented flavor
RDF Schemas
• recall RDF data:– resources– properties
• RDF schema:– classes– properties
subject
object
predicate
statement
RDF Schemas
Data:
<rdf:Description ID=“car001”>
<name> My Honda </name>
<miles> 50000 </miles>
<rdf:type resource=“#MotorVehicle”/>
</rdf:Description>
RDF SchemasSchema:
<rdf:Description ID=“MotorVehicle”>
<rdf:type resource=“#Class”/>
<rdf:subClassOf resource=“#Resource”/>
</rdf:Description>
<rdf:Description ID=“Truck”>
<rdf:type resource=“#Class”/>
<rdf:subClassOf resource=“#MotorVehicle”/>
</rdf:Description>
RDF Schemas
Truck
MotorVehicle
car001
type
type
Classtype
subClassOf
name miles
My Honda 50000
RDF Schemas
• different from object-oriented systems:– OO: define a class by set of properties– RDF: define a property in terms of its classes
• metadata in RDF:– an RDF schema described as an RDF data
Summary of Schemas
• in SS data: – graph theoretic– data and schema are decoupled– used in data processing
• in XML– from grammar to object-oriented– schema wired with the data– emphasis on semantics for exchange
Part 4Systems Issues
• Semistructured data and XML
• Query languages
• Schemas
• Systems issues
• Conclusions
Systems Issues
• servers
• mediators
Servers for Semistructured Data / XML
• storage
• index• query evaluation [McHugh, Widom 1999]
XML Storage
• text file (XML)
• store in ternary relation
• use DTD to derive schema
• mine data to derive schema
• build special purpose repository (Lore)
XML Storage: Text File
• advantages– simple– less space than one thinks– reasonable clustering
• disadvantage– no updates– require special purpose query processor
&o1
&o3
&o2
&o4 &o5
paper
title author authoryear
&o6
“The Calculus” “…” “…” “1986”
Store XML in Ternary Relation
[Florescu, Kossman 1999]
S o u r c e L a b e l D e s t
& o 1 p a p e r & o 2& o 2 t i t l e & o 3& o 2 a u t h o r & o 4& o 2 a u t h o r & o 5& o 2 y e a r & o 6
N o d e V a l u e
& o 3 T h e C a l c u l u s& o 4 …& o 5 …& o 6 1 9 8 6
Ref
Val
Use DTD to derive Schema
• DTD:
• ODMG classes:
• [Christophides et al. 1994 , Shanmugasundaram et al. 1999]
<!ELEMENT employee (name, address, project*)><!ELEMENT address (street, city, state, zip)>
class Employee public type tuple (name:string, address:Address, project:List(Project))class Address public type tuple (street:string, …)
Mine Data to Derive Schema
paperpaper paper
paper
authorauthor author author author
titletitle title title
year
fn fn fn fn lnlnlnln
a u t h o r t i t l eX X
f n 1 l n 1 f n 2 l n 2 t i t l e y e a r
X X X X X -X X - - X XX X - - X -
Paper1
Paper2
[Deutsch et al. 1999]
Indexing Semistructured Data
• coercions: 1995 v.s. “1995”
• regular path expressions– data guides [Goldman, Widom, 1997]– T-indexes [Milo, Suciu, 1999]
Indexing All Paths in the Data1
2 3 4 5 6
7 8 9 10 11 12 13
t t t t t
a b a c a d a a b
Semistructured Data
1
2 3 4 5 6
7 8 10 12 13 7 13 9 11
t
ab c
d
Data Guide
1
2 3 4 5 6
7 13 8 10 12 9 11
t
ab c db
T-Index
Mediators for Semistructured Data / XML
• XML = virtual view of Relational/OO/OR sources• mediator = translation, integration• issues:
– query composition and rewriting [Papakonstatinou, et al. 1996]– limited source capabilities [Yerneni, et al. 1999]
Example: An XML Mediator
• relational database:
• virtual XML view:
<store> <name> n1 </name> <book> ... </book> <book> ... </book> ... </store> <store> <name>n2 </name> <book> ... </book> <book> ... </book> …</store>
s i d n a m e… …… …
Stores i d b i d… …… …
SBb i d t i t l e… …… …
Book
Example: An XML Mediator
• specify mediator declaratively (a view):
from Store, SB, Bookwhere Store.sid=SB.sid and SB.bid=Book.bidconstruct <store ID=f(Store.sid)> <name> Store.name </name> <book> Book.title </book> </store>
from Store, SB, Bookwhere Store.sid=SB.sid and SB.bid=Book.bidconstruct <store ID=f(Store.sid)> <name> Store.name </name> <book> Book.title </book> </store>
Example: An XML Mediator
• users ask XML-QL queries:– find stores who sell “The Calculus”
where <store> <name> $n </name> <book> The Calculus </book> <store>construct <result> $n </result>
where <store> <name> $n </name> <book> The Calculus </book> <store>construct <result> $n </result>
Example: An XML Mediator
• system composes query with view:
from Store, SB, Bookwhere Store.sid=SB.sid and SB.bid=Book.bid and Book.title=“The Calculus”construct <result> Store.name </result>
from Store, SB, Bookwhere Store.sid=SB.sid and SB.bid=Book.bid and Book.title=“The Calculus”construct <result> Store.name </result>
Summary of Systems
• unclear today how XML will be used– materialized ? Need servers– virtual ? Need mediators
• most work is still ahead
Part 5Conclusions
• Semistructured data and XML
• Query languages
• Schemas
• Systems issues
• Conclusions
Summary
• XML = what is out there
• semistructured data = what we can process
• paradigm shift, for both Web and db
• covered in tutorial:– data models, queries, schemas
Current and Future Technologies
• Web applications possible today:– export relational data to XML (e.g. Oracle)– import XML directly into applications
• Web applications in the future:– mediator technology (XML view)– store/process native XML data– compress XML– mine/analyze XML
Why This Is Cool for Database Researchers
• put to work what you teach in CS101 !– tree traversals (structural recursion, XSL)– automata theory (DTD’s, path expressions)– graph theory (simulation)
• adapt old DB tricks to new kind of data
• save the trees: from fax to XML
The End
Further Readings
www. w3.org/XML
www-db.stanford.edu/~widom
www-rocq.inria.fr/~abiteboul
db.cis.upenn.edu
www.research.att.com/~suciu
Abiteboul, Buneman, Suciu
Data on the Web: From Relational to Semistructured to XML
Morgan Kaufmann, 1999 (appears in October)