Upload
vonhi
View
222
Download
0
Embed Size (px)
Citation preview
Suciua/Ramakrishnan/ Gehrke/Borgida 1
XML and Semi-structured data
Suciua/Ramakrishnan/ Gehrke/Borgida 2
What is XML?• Text annotation/markup language (�eXtensible
Markup Language)
• Think of markup as meta-data (data about data)• Resulting document is structured like a tree
<BOOK genre="Science" format="Hardcover"> <AUTHOR> <FIRSTNAME>Rich</FIRSTNAME>
<LASTNAME>Feynman</LASTNAME> </AUTHOR> <TITLE>The Character of Physical Law</TITLE> <PUBLISHED>1980</PUBLISHED></BOOK>
Suciua/Ramakrishnan/ Gehrke/Borgida 3
HTML vs XML
<h1> Bibliography </h1><p> <i> Foundations of Databases </i> Abiteboul, Hull, Vianu <br> Addison Wesley, 1995<p> <i> Data on the Web </i> Abiteoul, Buneman, Suciu <br> Morgan Kaufmann, 1999
<bibliography> <book> <title> Foundations… </title> <author> Abiteboul </author> <author> Hull </author> <author> Vianu </author> <publisher> Addison Wesley </publisher> <year> 1995 </year> </book> …
</bibliography>
HTML vs XML• Syntax like html, but set of tags is not fixed• Separate web page content (specified in XML) from display format (specified in different language, XSL); plus language(s) for transforming document structure (XSLT, XQuery)
Suciua/Ramakrishnan/ Gehrke/Borgida 4
Success of XML• Ability to represent �varying format data� (semi-
structured)• Ability to introduce new tags, led to publication of �standards� for many sub-areas.
data exchange• Example: Chemical Markup Language
<molecule><weight>234.5</weight><Spectra>…</Spectra><Figures>…</Figures>
</molecule>
Suciua/Ramakrishnan/ Gehrke/Borgida 5
Semi-structured Data Management
ASKTELL XMLData Manager
Lquestion
Lanswer Ltell
Ltell = XML document Lquestion = XPath, ( XQuery ) Lanswer = XML documentLdeclare = DTD (, XML Schema)
DECLARE/CONSTRAin
Suciua/Ramakrishnan/ Gehrke/Borgida 6
The syntax of XML�
(Silberschatz or Kiffer text better)
Suciua/Ramakrishnan/ Gehrke/Borgida 7
Sample XML document
<bibliography> <book> <title> Foundations… </title> <author> Abiteboul </author> <author> Hull </author> <author> Vianu </author> <publisher> Addison Wesley </publisher> <year> 1995 </year> </book> …
</bibliography>
Suciua/Ramakrishnan/ Gehrke/Borgida 8
XML Terminology• tags: book, title, author, …• start tag: <book>, end tag: </book>• elements:
<book>…</book> <author>…</author>
• elements are nested and bottom out at data values (hence form a tree)
• empty element: <red></red> abbrev. <red/>• an XML document must be a single element (�root�)
(a well formed XML document has properly nestedmatching tags)
Suciua/Ramakrishnan/ Gehrke/Borgida 9
More XML: Attributes
<book price = �55��currency = �USD�> <title> Foundations of Databases </title> <author> Abiteboul </author> … <year> 1995 </year></book>
• Attributes are alternative ways to represent data• At most one occurrence of attribute per element• Attribute value is a single string
(Multiple values for an attribute separated by blank/tab)Good attributes are meta-data: status =�outOfPrint�, language=�English�, categ=�fiction�� Suciua/Ramakrishnan/ Gehrke/Borgida 14
Example doc for XPath Queries<bib>�
<book> <publisher> Addison-Wesley </publisher>� <author> Serge Abiteboul </author>� <author> <first-name> Rick </first-name>� <last-name> Hull </last-name>� </author>� <author> Victor Vianu </author>� <title> Foundations of Databases </title>� <year> 1995 </year>�</book>�<book price=�55�>� <publisher> Freeman </publisher>� <author> Jeffrey D. Ullman </author>� <title> Principles of Database and Knowledge Base Systems </title>� <year> 1998 </year>�</book>
</bib>
Suciua/Ramakrishnan/ Gehrke/Borgida 15
The ordered tree view of an XML document
bib
book book
publisher author . . .
Addison-Wesley Abiteboul
author authorprice
Ullman �55�
publisher
Freeman
(draw on board)
Element�node
Text�node
Children nodes are ordered!
title year
Attribute�node
DB Pples1998
Suciua/Ramakrishnan/ Gehrke/Borgida 16
* XML Namespaces• A way to share tags, attributes,...• http://www.w3.org/TR/REC-xml-names
• name ::= localtag OR prefix:tag<book xmlns:isbn=�www.isbn-org.org/def�>
<title> … </title>
<number> 15 </number>
<isbn:number> …. </isbn:number>
</book>
A local “number” tag, meaninghow many books are in stock
The “number” tag of ISBN, whichis the unique number assigned to
each book edition by an international agency.
Suciua/Ramakrishnan/ Gehrke/Borgida 17
One way to represent Relational Data in XML
<persons><row> <name>John</name> <phone> 3634</phone></row> <row> <name>Sue</name> <phone> 6343</phone> <row> <name>Dick</name> <phone> 6363</phone></row>
</persons>
name phone
John 3634
Sue 6343
Dick 6363
row row row
name name namephone phone phone�John� 3634 �Sue� �Dick�6343 6363
personsXML:
persons
Suciua/Ramakrishnan/ Gehrke/Borgida 18
XML vs Relational Data
• XML is self-describing• schema elements become part of the data (tags):
person(name,phone) vs <person> <name> </name> <phone> </phone> ...
• so XML is more flexible because do not have to follow slavishly a single flat schema:
SEMISTRUCTURED DATA
Suciua/Ramakrishnan/ Gehrke/Borgida 19
Semistructured Data• fields may be missing
<person> <name>bob</name> <phone>5-4544</phone></person><person> <name>anna</name> </person>
• fields may be repeated<person> <name>bob</name>
<phone>5-4544</phone> <phone> 3-5436</phone> </person>
• fields may be nested<person> <name> <first>bob</first><last>jones</last></name><person>
• fields may be heterogeneous<name> <first>bob</first><last>jones</last></name><name> <first>bob</first><mid> t</mid> <last>jones</last></name>
• collections may be heterogeneous <persons>
<teacher> ... </teacher><student> ...</student>
</persons>Suciua/Ramakrishnan/ Gehrke/Borgida 20
DTD – Document Type Definition
• A DTD is a schema/grammar for XML data• A DTD says what elements and attributes are
required or optional– Defines the formal structure of the doc
Suciua/Ramakrishnan/ Gehrke/Borgida 21
DTD – An Example <!ELEMENT Basket (Cherry+, (Apple | Orange)*) >
<!ELEMENT Cherry EMPTY><!ATTLIST Cherry flavor CDATA #REQUIRED>
<!ELEMENT Apple EMPTY><!ATTLIST Apple color CDATA #REQUIRED>
<!ELEMENT Orange EMPTY><!ATTLIST Orange location �Florida�>
-------------------------------------------------------------------------
<Basket> <Apple/> <Cherry flavor=�good�/> <Orange/></Basket>
<Basket> <Cherry flavor=�good�/> <Apple color=�red�/> <Apple color=�green�/></Basket>
2 documents:
Suciua/Ramakrishnan/ Gehrke/Borgida 22
DTD: !ELEMENT
<!ELEMENT Basket (Cherry+, (Apple | Orange)*) >
• !ELEMENT declares an element name, and what children elements it should have
• Content types:– Other elements– #PCDATA (parsed character data)– etc
Name Children
Suciua/Ramakrishnan/ Gehrke/Borgida 23
DTD: !ELEMENT in general
• A regular expression describing the content has the following structure:– exp1, exp2, exp3, …, expk: An ordered list of regular
expressions– exp*: An optional expression with zero or more
occurrences– exp+: An expression with one or more occurrences– exp?: An optional expression with zero or one
occurrence– exp1 | exp2 | … | expk: A set of alternative expressions
Suciua/Ramakrishnan/ Gehrke/Borgida 24
DTD - !ATTLIST
<!ATTLIST Cherry flavor CDATA #REQUIRED>
�
• !ATTLIST defines a list of attributes for an element• * Attributes can be of different types, can be
required or not required, and they can have default values.
Element Attribute Type Flag
Suciua/Ramakrishnan/ Gehrke/Borgida 25
Attribute types in DTDs
Types include:• CDATA = string• (Monday | Wednesday | Friday) = enumeration
[* We will not discuss these in the course ]• ID = key• IDREF = foreign key• IDREFS = foreign keys separated by space
Suciua/Ramakrishnan/ Gehrke/Borgida 27
(*Relationship between DTD & Extended BNF)<!DOCTYPE paper [ <!ELEMENT paper (section*)> <!ELEMENT section ((title,section*) | text)> <!ELEMENT title (#PCDATA)> <!ELEMENT text (#PCDATA)>]>
paper ::= section * section ::= ( title section * ) | texttitle ::= stringtext ::= string
Equivalent extended BNF
Suciua/Ramakrishnan/ Gehrke/Borgida 28
* DTDs as Grammars
• A DTD = a grammar• A valid XML document = a parse tree for that
grammar
* Problem with DTD: not in XML notation!!!There is more advanced notation: XML Schema (not
in this course!)
Suciua/Ramakrishnan/ Gehrke/Borgida 29
(* XML Schema )
<xsd:element name=�paper��type=�papertype�/><xsd:complexType name=�papertype�> <xsd:sequence> <xsd:element name=�title��type=�xsd:string�/> <xsd:element name=�author��minOccurs=�0�/> <xsd:element name=�year�/> <xsd: choice> < xsd:element name=�journal�/> <xsd:element name=�conference�/> </xsd:choice> </xsd:sequence></xsd:element>
DTD: <!ELEMENT paper (title,author*,year, (journal|conference))>
becomes
Suciua/Ramakrishnan/ Gehrke/Borgida 30
Querying XML Documents
(based on notes by D.Suciu/UofW
Suciua/Ramakrishnan/ Gehrke/Borgida 31
XPath
• http://www.w3.org/TR/xpath/
Suciua/Ramakrishnan/ Gehrke/Borgida 32
Example doc for XPath Queries<bib>�
<book> <publisher> Addison-Wesley </publisher>� <author> Serge Abiteboul </author>� <author> <first-name> Rick </first-name>� <last-name> Hull </last-name>� </author>� <author> Victor Vianu </author>� <title> Foundations of Databases </title>� <year> 1995 </year>�</book>�<book price=�55�>� <publisher> Freeman </publisher>� <author> Jeffrey D. Ullman </author>� <title> Principles of Database and Knowledge Base Systems </title>� <year> 1998 </year>�</book>
</bib>
Suciua/Ramakrishnan/ Gehrke/Borgida 33
Corresponding tree for XPath (a bit different, to make things resemble UNIX nested file paths)
book book
publisher author . . .
Addison-Wesley Abiteboul
author authorprice
Ullman �55�
publisher
Freeman
Element�node
Text�node
title year
Attribute�node
DB Pples1998
bib
The root
The root element
Suciua/Ramakrishnan/ Gehrke/Borgida 34
Corresponding tree for XPath (a bit different, to make things resemble UNIX nested file paths) �
bib
book book
publisher author . . .
Addison-Wesley Abiteboul
The root
The root element
author authorprice
Jeff Ullman�55�
publisher
Freeman
(draw on board)
Suciua/Ramakrishnan/ Gehrke/Borgida 35
XPath: Simple Expressions (matching element nodes)
Result: <year> 1995 </year> <year> 1998 </year>
Result: empty (there were no papers)
/bib/book/year
/bib/paper/year
Suciua/Ramakrishnan/ Gehrke/Borgida 36
XPath
One way to think about it is that each initial part of a path “marks” certain nodes in the tree as being acceptable – part of the current collection. The next step in the path unmarks these and marks as acceptable only those children of previously marked nodes which pass some additional test.
(* In the full XPath, one can go from marked nodes to their descendants, parents, ancestors, left and right siblings.)
Suciua/Ramakrishnan/ Gehrke/Borgida 37
XPath: Restricted Kleene Closure
Result:<author> Serge Abiteboul </author> <author> <first-name> Rick </first-name> <last-name> Hull </last-name> </author> <author> Victor Vianu </author> <author> Jeffrey D. Ullman </author>
Result: <first-name> Rick </first-name>
//author
/bib//first-name
Suciua/Ramakrishnan/ Gehrke/Borgida 38
Xpath: Wildcard
Result: <first-name> Rick </first-name> <last-name> Hull </last-name>
* Matches any element
/*/*/author
//author/*
�authors at 3rd level�
Suciua/Ramakrishnan/ Gehrke/Borgida 39
Xpath: matching Text Nodes
Result: Serge Abiteboul Jeffrey D. Ullman
Rick Hull doesn�t appear because he has firstname, lastname
(* Other functions in XPath:– text() = matches a text value– node() = matches any node (= * or @* or text())– name() = returns the name of the current tag
/bib/book/author/text()
/bib/book/*/name()publisher author author author title year publisher author title year
Suciua/Ramakrishnan/ Gehrke/Borgida 40
Xpath: matching Attribute Nodes
Result: price="55"@price means that there is a price attribute with a
value present
/bib/book/@price
Suciua/Ramakrishnan/ Gehrke/Borgida 41
Xpath: Qualifiers
Result: <author> <first-name> Rick </first-name> <last-name> Hull </last-name> </author>
[first-name] ≡�has first-name element�
//author[first-name]
Suciua/Ramakrishnan/ Gehrke/Borgida 42
Xpath: More Qualifiers
Result: <lastname> … </lastname>
/bib/book/author[first-name][address[//zip][city]]/lastname
�lastname of author (which has firstname and
address (which (has zip below it) and has city))�
Suciua/Ramakrishnan/ Gehrke/Borgida 43
Xpath: Qualifiers with conditions on values
/bib/book[@price < "60"]
/bib/book[author/@age < "25"]
/bib/book[author/text()]
Suciua/Ramakrishnan/ Gehrke/Borgida 44
Xpath: Summarybib matches a bib element* matches any element/ matches the root element/bib matches a bib element under rootbib/paper matches a paper in bibbib//paper matches a paper in bib, at any depth//paper matches a paper at any depth//paper/.. matches the parent of paper at any depthpaper | book matches a paper or a book@price matches a price attributebib/book/@price matches price attribute in book, in bibbib/book/[@price<�55�]/author/lastname matches…
Suciua/Ramakrishnan/ Gehrke/Borgida 45
XQuery
• http://www.w3.org/TR/xquery/
Suciua/Ramakrishnan/ Gehrke/Borgida 46
FLWR (�Flower�) Expressions
for ... let... for... let...where...return...
Suciua/Ramakrishnan/ Gehrke/Borgida 47
XQuery
“Find all book titles published after 1995”:
for $x in document("bib.xml")/bib/book
where $x/year > 1995
return $x/title
Result: <title>Principles of Database and Knowledge Base Systems</title>
Suciua/Ramakrishnan/ Gehrke/Borgida 48
XQuery: nested queries�For each author of a book by AW, list all books she
published:�
for $a in document("bib.xml") � /bib/book[publisher=�AW�]/authorreturn <ans> { $a, for $t in /bib/book[author=$a]/title return $t� } </ans> Beware of
forgetting the { and };they mean �evaluate nested expression� Suciua/Ramakrishnan/ Gehrke/Borgida 49
XQuery<ans> <author>Serge Abiteboul</author> <title>Foundations of Databases</title></ans><ans> <author> <first-name>Rick</first-name> <last-name>Hull</last-name> </author> <title>Foundations of Databases</title></ans><ans> <author>Victor Vianu</author> <title>Foundations of Databases</title></ans><ans> <author>Jeffrey D. Ullman</author> <title>Principles of Database and Knowledge Base Systems</title></ans>
Result:
Suciua/Ramakrishnan/ Gehrke/Borgida 50
XQuery: let expressions
• for $x in expr -- binds $x in turn to each value in the list expr
• let $x := expr -- binds $x once to the entire list expr– Useful for common subexpressions and for aggregations
Suciua/Ramakrishnan/ Gehrke/Borgida 51
XQuery
count = a (aggregate) function that returns the number of elements in its argument set
<big_publishers>
for $p in document("bib.xml")//publisher
let $b := document("bib.xml")//book[publisher = $p]
where count($b) >= 1
return $p
</big_publishers>
Suciua/Ramakrishnan/ Gehrke/Borgida 52
XQuery
�Find books whose price is larger than average�:
let $a := avg(document("bib.xml")/bib/book/price)
for $b in document("bib.xml")/bib/book
where $b/price > $a
return $b
Suciua/Ramakrishnan/ Gehrke/Borgida 56
for $h in //catalogoue return <catalogue> { $h/title, if $h/@type = "Journal" then $h/editor else $h/author } </catalogue>
If-Then-Else
Suciua/Ramakrishnan/ Gehrke/Borgida 57
Existential Quantifiers
for $b in //book
where some $p in $b//para satisfies
contains($p, "sailing")
and contains($p, "windsurfing")
return $b/title
�Books which have some paragraph containing boththe words sailing and windsurfing�
Suciua/Ramakrishnan/ Gehrke/Borgida 58
Universal Quantifiers
for $b in //book
where every $p in $b//para satisfies
contains($p, "sailing")
return $b/title
�Books in which all paragraphs contain the wordsailing�
Suciua/Ramakrishnan/ Gehrke/Borgida 59
Try out queries at��http://www.w3.org/TR/xquery-use-cases/�
Suciua/Ramakrishnan/ Gehrke/Borgida 64
e.g., Flattening
• �Flatten� the authors, i.e. return a list of (author, title) pairs
for $b in document("bib.xml")/bib/book,� $x in $b/title/text(),� $y in $b/author/text()�return <answer>� <title> { $x } </title>� <author> { $y } </author>� </answer>
Result:�<answer>� <title> abc </title> <author> efg </author>�</answer>�<answer>� <title> abc </title> <author> hkj </author>�</answer>
Suciua/Ramakrishnan/ Gehrke/Borgida 65
e.g., Re-grouping
• �For each author, return all titles of her/his books�
for $b in document("bib.xml")/bib, � $x in $b/book/author�return � <answer>� <author> { $x } </author>� { for $y in $b/book[author=$x]/title� return $y }� </answer> What about�
duplicate�authors ?
Result:�<answer>� <author> efg </author>� <title> abc </title> <title> klm </title>� . . . .</answer>
Suciua/Ramakrishnan/ Gehrke/Borgida 66
• Same, but eliminate duplicate authors:for $b in document("bib.xml")/bib�let $a := distinct-values($b/book/author/text() )�for $x in $a�return � <answer>� <author> { $x }</author>� { for $y in $b/book[author=$x]/title� return $y }� </answer>
distinct-values eliminates duplicates (but must be applied to acollection of text values, not of elements)
Suciua/Ramakrishnan/ Gehrke/Borgida 70
SQL and XQuery Side-by-side
Product(pid, name, maker, price)�Find all product names, prices�
SELECT x.name, x.price�FROM Product x
SQL
XQuery
<db> <Product> <row> <pid 1234 /> <name �bulb’/> <maker </row>
<answer>� { for $x in document(�db.xml�)/db/Product/row � return <row> { $x/name, $x/price } </row> }�</answer>
Suciua/Ramakrishnan/ Gehrke/Borgida 71
<answer> <name> abc </name> <price> 7 </price>�</answer>� <answer> <name> def </name> <price> 23 </price>�</answer> . . . .
Xquery�s Answer
Notice: this is NOT a�well-formed document !�(WHY ???)
Suciua/Ramakrishnan/ Gehrke/Borgida 72
Query Producing a Well-Formed Answer
<myQuery>� { for $x in document(�db.xml�)/db/Product/row � return <row> � { $x/name, $x/price }� </row>� }�</myQuery>
Suciua/Ramakrishnan/ Gehrke/Borgida 73
<myQuery>� <row> <name> abc </name> <price> 7 </price>� </row>� <row> <name> def </name> <price> 23 </price>� </row> . . . .�</myQuery>
Xquery�s Answer
Now it is well-formed !
Suciua/Ramakrishnan/ Gehrke/Borgida 74
SQL and XQuery Side-by-side
Product(pid, name, maker, price)�Company(cid, name, city, revenues)
�Find all products made in Seattle�
SELECT x.name�FROM Product x, Company y �where x.maker=y.cid � and y.city=�Seattle�
SQL
for $x in $db/Product/row, � $y in $db/Company/row �where � $x/maker=$y/cid � and $y/city = �Seattle��return { $x/name }
XQuery
for $y in /db/Company/row[city=�Seattle�],� $x in /db/Product/row[maker=$y/cid]�return $x/name
Compact�XQuery
Suciua/Ramakrishnan/ Gehrke/Borgida 75
<product> <row> <pid> 123 </pid> <name> abc </name> <maker> efg </maker> </row> <row> …. </row> …</product>�<product>� . . .</product>�. . . .
Suciua/Ramakrishnan/ Gehrke/Borgida 76
SQL and XQuery Side-by-side
For each company with revenues < 1M count its products over $100
SELECT c.name, count(*) �FROM Product p, Company c�where p.price > 100 and p.maker=c.cid and c.revenue < 1000000�GROUP BY c.cid, c.name
for $r in document(�db.xml�)/db, � $c in $r/Company/row[revenue<1000000]�return � <proudCompany>� <companyName> { $c/name } </companyName>� <numberOfExpensiveProducts>� { count($r/Product/row[maker=$c/cid][price>100]) }� </numberOfExpensiveProducts>� </proudCompany>
Suciua/Ramakrishnan/ Gehrke/Borgida 77
SQL and XQuery Side-by-side
Find companies with at least 30 products, and their average priceSELECT y.name, avg(x.price)�FROM Product x, Company y�WHERE x.maker=y.cid�GROUP BY y.cid, y.name�HAVING count(*) > 30
for $r in document(�db.xml�)/db,� $y in $r/Company/row �let $p := $r/Product/row[maker=$y]/cid]�where count($p) > 30�return � <theCompany>� <companyName> { $y/name } � </companyName>� <avgPrice> avg($p/price) </avgPrice>� </theCompany>
An element