Database Management Systems, R. Ramakrishnan 1
Introduction to Semistructured Data and
XMLChapter 27
Database Management Systems, R. Ramakrishnan 2
How the Web is Today
HTML documents• often generated by applications• consumed by humans only• easy access: across platforms, across
organizations No application interoperability:
• HTML not understood by applications• Database technology: client-server
Database Management Systems, R. Ramakrishnan 3
New Universal Data Exchange Format: XML
A recommendation from the W3C XML = data XML generated by applications XML consumed by applications Easy access: across platforms,
organizations
Database Management Systems, R. Ramakrishnan 4
Paradigm Shift on the Web
From documents (HTML) to data (XML) From information retrieval to data
management For databases, also a paradigm shift:
• from relational model to semistructured data
• from data processing to data/query translation
• from storage to transport
Database Management Systems, R. Ramakrishnan 5
HTML HTML is widely used for formatting and
structuring Web documents. Designed to describe how a Web browser should
arrange text, images and push-buttons on a page.
Easy to learn, but does not convey structure and meaning of data in the Web pages.
Fixed tag set.<HTML><HEAD><TITLE>Welcome to the XML course</TITLE></HEAD><BODY>
<H1>Introduction</H1><IMG SRC=”dragon.jpeg" WIDTH="200" HEIGHT="150” >
</BODY></HTML>
Opening tag Text (PCDATA)
Closing tag “Bachelor” tagAttribute name Attribute
value
Database Management Systems, R. Ramakrishnan 6
Semistructure data
1. Information integration: important new application that motivates what follows.
2. Semistructured data: a new data model designed to cope with problems of information integration.
3. XML (Extensible Markup Language) : a new Web standard that is essentially semistructured data.
4. XQUERY: an emerging standard query language for XML data.
Database Management Systems, R. Ramakrishnan 7
Information IntegrationProblem: related data exists in many places. They
talk about the same things, but differ in model, schema, conventions (e.g., terminology).
Example: In the real world, every bar has its own database.
Some may have relations like beer-price; others have an Microsoft Word file from which the menu is printed.
Some keep phones of manufacturers but not addresses.
Some distinguish beers and ales; others do not.
Database Management Systems, R. Ramakrishnan 8
The Semistructured Data Model
&o1
&o12 &o24 &o29
&o43&96
&243 &206
&25
“Serge” “Abiteboul”
1997
“Victor” “Vianu” 122 133
paper bookpaper
references
references references
author title year httpauthor
authorauthor
title publisherauthor
authortitle
page
firstnamelastname firstname lastname first
last
Bib
Object Exchange Model (OEM) complex object
atomic object
Database Management Systems, R. Ramakrishnan 9
Characteristics of Semistructured Data Missing or additional attributes Multiple attributes Different types in different objects Heterogeneous collections
Self-describing, irregular data, no a priori structure
Database Management Systems, R. Ramakrishnan 10
Comparison with Relational Data
{ row: { name: “John”, phone: 3634 },
row: { name: “Sue”, phone: 6343 },
row: { name: “Dick”, phone: 6363 }
}
n a m e p h o n e
J o h n 3 6 3 4
S u e 6 3 4 3
D i c k 6 3 6 3
row row row
name name namephone phone phone
“John” 3634“Sue” “Dick”6343 6363
Database Management Systems, R. Ramakrishnan 11
XML (Extensible Markup Language)
A W3C standard to complement HTML Origins: Structured text SGML
• Large-scale electronic publishing• Data exchange on the web
Motivation:• HTML describes presentation• XML describes content
Database Management Systems, R. Ramakrishnan 12
From HTML to XML
HTML describes the presentation
Database Management Systems, R. Ramakrishnan 13
HTML
<h1> Bibliography </h1><p> <i> Foundations of Databases </i> Abiteboul, Hull, Vianu <br> Addison Wesley, 1995<p> <i> Data on the Web </i> Abiteboul, Buneman, Suciu <br> Morgan Kaufmann, 1999
Database Management Systems, R. Ramakrishnan 14
XML<bibliography>
<book> <title> Foundations… </title> <author> Abiteboul </author> <author> Hull </author> <author> Vianu </author> <publisher> Addison Wesley
</publisher> <year> 1995 </year> </book> …
</bibliography>
XML describes the content
Database Management Systems, R. Ramakrishnan 15
Why are we DB’ers interested?
It’s data. That’s us. Database issues:
• How are we going to model XML? (graphs).• How are we going to query XML? (XQuery)• How are we going to store XML (in a
relational database? object-oriented? native?)
• How are we going to process XML efficiently? (many interesting research questions!)
Database Management Systems, R. Ramakrishnan 16
XML Terminology
Tags: book, title, author, …• start tag: <book>, end tag: </book>
Elements: <book>…<book>,<author>…</author>• elements can be nested• empty element: <red></red> (Can be abbrv. <red/>)
XML document: Has a single root element Well-formed XML document: Has matching tags Valid XML document: conforms to a schema
Database Management Systems, R. Ramakrishnan 17
Well-Formed XML1. Declaration = <? ... ?> .
• Normal declaration is<? XML VERSION = "1.0" STANDALONE = "yes" ?>
• “Standalone” means that there is no DTD specified.
2. Root tag surrounds the entire balance of the document. <FOO> is balanced by </FOO>, as in HTML.
3. Any balanced structure of tags OK.• Option of tags that don’t require balance, like <P>
in HTML.
Database Management Systems, R. Ramakrishnan 18
XML: An Example<?xml version="1.0" encoding="UTF-8" standalone="yes"?><BOOKLIST> <BOOK genre="Science" format="Hardcover"> <AUTHOR> <FIRSTNAME>Richard</FIRSTNAME><LASTNAME>Feynman</LASTNAME> </AUTHOR> <TITLE>The Character of Physical Law</TITLE> <PUBLISHED>1980</PUBLISHED> </BOOK> <BOOK genre="Fiction"> <AUTHOR> <FIRSTNAME>R.K.</FIRSTNAME><LASTNAME>Narayan</LASTNAME> </AUTHOR> <TITLE>Waiting for the Mahatma</TITLE> <PUBLISHED>1981</PUBLISHED> </BOOK> <BOOK genre="Fiction"> <AUTHOR> <FIRSTNAME>R.K.</FIRSTNAME><LASTNAME>Narayan</LASTNAME> </AUTHOR> <TITLE>The English Teacher</TITLE> <PUBLISHED>1980</PUBLISHED> </BOOK></BOOKLIST>
Database Management Systems, R. Ramakrishnan 19
XML – Elements
<BOOK genre="Science" format="Hardcover">…</BOOK>
Xml is case and space sensitive Element opening and closing tag names must be identical Opening tags: “<” + element name + “>” Closing tags: “</” + element name + “>” Empty Elements have no data and no closing tag:
• They begin with a “<“ and end with a “/>” <BOOK/>
closing tagattribute
attribute value
dataopen tagelement name
Database Management Systems, R. Ramakrishnan 20
XML – Attributes
<BOOK genre="Science" format="Hardcover">…</BOOK>
Attributes provide additional information for element tags. There can be zero or more attributes in every element; each
one has the the form:attribute_name=‘attribute_value’- There is no space between the name and the “=‘”- Attribute values must be surrounded by “ or ‘ characters
Multiple attributes are separated by white space (one or more spaces or tabs).
closing tagattribute
attribute value
dataopen tagelement name
Database Management Systems, R. Ramakrishnan 21
ElementsThe segment of an XML document between an opening and a corresponding closing tag is called an element.
<person> <name> Malcolm Atchison </name>
<tel> (215) 898 4321 </tel> <tel> (215) 898 4321 </tel>
<email> [email protected] </email> </person>
element
not an elementelement, a sub-elementof
Database Management Systems, R. Ramakrishnan 22
XML – Data and Comments
<BOOK genre="Science" format="Hardcover">…</BOOK>
Xml data is any information between an opening and closing tag
Xml data must not contain the ‘<‘ or ‘>’ characters
Comments:<!- comment ->
closing tagattribute
attribute value data
open tagelement name
Database Management Systems, R. Ramakrishnan 23
XML textXML has only one “basic” type -- text.
It is bounded by tags, e.g. <title> The Big Sleep </title> <year> 1935 </ year> --- 1935 is still text
XML text is called PCDATA (for parsedcharacter data). It uses a 16-bit encoding.
Database Management Systems, R. Ramakrishnan 24
XML – Nesting & Hierarchy
Xml tags can be nested in a tree hierarchy Xml documents can have only one root tag Between an opening and closing tag you can insert:
1. Data2. More Elements3. A combination of data and elements
<root> <tag1> Some Text <tag2>More</tag2> </tag1></root>
Database Management Systems, R. Ramakrishnan 25
Representing relational DBs:Two ways projects:
title budget managedBy
employees:name ssn age
Database Management Systems, R. Ramakrishnan 26
Project and Employee relations in XML
<db> <project> <title> Pattern recognition
</title> <budget> 10000 </budget> <managedBy>
Joe</managedBy> </project> <employee> <name> Joe </name> <ssn> 344556 </ssn> <age> 34 < /age> </employee>
<employee> <name> Sandra </name> <ssn> 2234 </ssn> <age> 35 </age> </employee> <project> <title> Auto guided vehicle </title> <budget> 70000 </budget> <managedBy> Sandra </managedBy> </project> :</db>
Projects and employees are intermixed
Database Management Systems, R. Ramakrishnan 27
<db><projects>
<project> <title> Pattern recognition </title> <budget> 10000 </budget> <managedBy>Joe </managedBy>
</project> <project> <title>Auto guided vehicles</title> <budget> 70000 </budget>
<managedBy>Sandra</managedBy> </project> : </projects>
Project and Employee relations in XML (cont’d)
<employees><employee>
<name> Joe </name> <ssn> 344556 </ssn> <age> 34 </age> </employee> <employee> <name> Sandra
</name> <ssn> 2234 </ssn>
<age>35 </age> </employee> : <employees></db>
Employees follows projects
Database Management Systems, R. Ramakrishnan 28
More XML: Oids and References<person id=“o555”> <name> Jane </name>
</person>
<person id=“o456”> <name> Mary </name>
<children idref=“o123 o555”/>
</person>
<person id=“o123” mother=“o456”><name>John</name>
</person>oids and references in XML are just syntax
Database Management Systems, R. Ramakrishnan 29
XML Data Model (Graph)
bookb1
b2
title authorauthor author
pcdataComplete... P rincip les...Chamberlin Bernste in Newcomer
pcdata pcdata pcdata pcdata
publisher
nam e state
CAMorgan...
pcdata pcdata
pub pub
db
mkp
#1 #2 #3 #4 #5 #6 #7
#0
book
title
Database Management Systems, R. Ramakrishnan 30
Document Type Descriptors
<!ELEMENT Book (title, author*) >
<!ELEMENT title #PCDATA> <!ELEMENT author (name, address,age?)>
<!ATTLIST Book id ID #REQUIRED> <!ATTLIST Book pub IDREF #IMPLIED>
Sort of like a schema but not really.
Inherited from SGML DTD standard BNF grammar establishing constraints on element structure and content Definitions of entities
Database Management Systems, R. Ramakrishnan 31
DTD – An Example
<?xml version='1.0'?><!ELEMENT Basket (Cherry+, (Apple | Orange)*) >
<!ELEMENT Cherry EMPTY><!ATTLIST Cherry flavor CDATA #REQUIRED>
<!ELEMENT Apple EMPTY><!ATTLIST Apple color CDATA #REQUIRED>
<!ELEMENT Orange EMPTY><!ATTLIST Orange location ‘Florida’>
-------------------------------------------------------------------------------- <Basket>
<Apple/> <Cherry flavor=‘good’/> <Orange/></Basket>
<Basket> <Cherry flavor=‘good’/> <Apple color=‘red’/> <Apple color=‘green’/></Basket>
Database Management Systems, R. Ramakrishnan 32
DTD - !ELEMENT
<!ELEMENT Basket (Cherry+, (Apple | Orange)*) >
!ELEMENT declares an element name, and what children elements it should have
Content types:• Other elements• #PCDATA (parsed character data)• EMPTY (no content)• ANY (no checking inside this structure)• A regular expression
Name Children
Database Management Systems, R. Ramakrishnan 33
DTD - !ELEMENT (Contd.)
A regular expression has the following structure:• exp1, exp2, exp3, …, expk: A list of regular
expressions• exp*: An optional expression with zero or more
occurrences• exp+: An optional expression with one or more
occurrences• exp1 | exp2 | … | expk: A disjunction of
expressions
Database Management Systems, R. Ramakrishnan 34
DTD - !ATTLIST
<!ATTLIST Cherry flavor CDATA #REQUIRED>
<!ATTLIST Orange location CDATA #REQUIREDcolor ‘orange’>
!ATTLIST defines a list of attributes for an element
Attributes can be of different types, can be required or not required, and they can have default values.
Element Attribute Type Flag
Database Management Systems, R. Ramakrishnan 35
DTD – Well-Formed and Valid<?xml version='1.0'?><!ELEMENT Basket (Cherry+)>
<!ELEMENT Cherry EMPTY><!ATTLIST Cherry flavor CDATA #REQUIRED>
--------------------------------------------------------------------------------
Well-Formed and Valid<Basket> <Cherry flavor=‘good’/></Basket>
Not Well-Formed<basket> <Cherry flavor=good></Basket>
Well-Formed but Invalid<Job> <Location>Home</Location></Job>
Database Management Systems, R. Ramakrishnan 36
Example: An Address Book<person> <name> MacNiel, John </name><greet> Dr. John MacNiel </greet><addr>1234 Huron Street </addr><addr> Rome, OH 98765 </addr><tel> (321) 786 2543 </tel><fax> (321) 786 2543 </fax><tel> (321) 786 2543 </tel><email> [email protected] </email></person>
Exactly one nameAt most one greetingAs many address lines as needed (in order)
Mixed telephones and faxes
As manyas needed
Database Management Systems, R. Ramakrishnan 37
Specifying the structure name to specify a name element greet? to specify an optional
(0 or 1) greet elements
name,greet? to specify a name followed by an optional greet
Database Management Systems, R. Ramakrishnan 38
Specifying the structure (cont) addr* to specify 0 or more address
lines tel | fax a tel or a fax element (tel | fax)* 0 or more repeats of tel or
fax email* 0 or more email elements
Database Management Systems, R. Ramakrishnan 39
A DTD for the address book<!DOCTYPE addressbook [ <!ELEMENT addressbook (person*)> <!ELEMENT person (name, greet?, address*, (fax | tel)*, email*)> <!ELEMENT name (#PCDATA)> <!ELEMENT greet (#PCDATA)> <!ELEMENT address(#PCDATA)> <!ELEMENT tel (#PCDATA)> <!ELEMENT fax (#PCDATA)> <!ELEMENT email (#PCDATA)>]>
Database Management Systems, R. Ramakrishnan 40
DTD for the example relational DB
<!DOCTYPE db [<!ELEMENT db (projects,employees)><!ELEMENT projects (project*)><!ELEMENT employees (employee*)>
<!ELEMENT project (title, budget, managedBy)>
<!ELEMENT employee (name, ssn, age)>...
]>
Database Management Systems, R. Ramakrishnan 41
Summary of XML regular expressions Each element name is a tag. Its components are the tags that appear
nested within, in the order specified. A The tag A occurs e1,e2 The expression e1 followed by e2 e* 0 or more occurrences of e e? Optional -- 0 or 1 occurrences e+ 1 or more occurrences e1 | e2 either e1 or e2 (e) grouping
Database Management Systems, R. Ramakrishnan 42
XML Querying
Path Expressions : Bib.paper Bib.book.publisher Bib.paper.author.lastname
Given an OEM instance, the value of a path expression p is a set of objects
Database Management Systems, R. Ramakrishnan 43
Path Expressions
Examples:
DB =
&o1
&o12 &o24 &o29
&o43
&o70 &o71
&96
&243 &206
&25
“Serge” “Abiteboul”
1997
“Victor” “Vianu” 122 133
paper bookpaper
references
references references
authortitle year httpauthor
authorauthor
title publisherauthor
authortitle
page
firstnamelastname firstname lastname first
last
Bib
&o44 &o45 &o46
&o47 &o48 &o49 &o50 &o51
&o52
Bib.paper={&o12,&o29}Bib.book.publisher={&o51}Bib.paper.author.lastname={&o71,&206}
Database Management Systems, R. Ramakrishnan 44
XQueryEmerging standard for querying XML documents. Basic
form:FOR <variables ranging over sets of elements>WHERE <condition>RETURN <set of elements>;
Sets of elements described by paths, consisting of:1. URL, if necessary.2. Element names forming a path in the semistructured
data graph, e.g., //BAR/NAME =“start at any BAR node and go to a NAME child.”
3. Ending condition of the form[<condition about subelements, @attributes, and
values>]
Database Management Systems, R. Ramakrishnan 45
XQueryOverview: FOR-LET-WHERE-ORDERBY-RETURN = FLWOR
FOR/LET Clauses
WHERE Clause
ORDERBY/RETURN Clause
List of tuples
List of tuples
Instance of Xquery data model
Database Management Systems, R. Ramakrishnan 46
XQuery
FOR $x in expr -- binds $x to each value in the list expr
LET $x = expr -- binds $x to the entire list expr• Useful for common subexpressions and for
aggregations
Database Management Systems, R. Ramakrishnan 47
FOR v.s. LET
FOR $x IN document("bib.xml")/bib/book
RETURN <result> $x </result>
Returns: <result> <book>...</book></result> <result> <book>...</book></result> <result> <book>...</book></result> ...
LET $x IN document("bib.xml")/bib/book
RETURN <result> $x </result>
Returns: <result> <book>...</book> <book>...</book> <book>...</book> ...</result>
Database Management Systems, R. Ramakrishnan 48
XQuery
Find all book titles published after 1995:
FOR $x IN document("bib.xml")/bib/book
WHERE $x/year > 1995
RETURN $x/title
Result: <title> abc </title> <title> def </title> <title> ghi </title>
Database Management Systems, R. Ramakrishnan 49
XQuery
For each author of a book by Morgan Kaufmann, list all books s/he published:
FOR $a IN distinct(document("bib.xml") /bib/book[publisher=“Morgan Kaufmann”]/author)
RETURN <result>
$a,
FOR $t IN /bib/book[author=$a]/title
RETURN $t
</result>
distinct = a function that eliminates duplicates
Database Management Systems, R. Ramakrishnan 50
XQuery
Result: <result> <author>Jones</author> <title> abc </title> <title> def </title> </result> <result> <author> Smith </author> <title> ghi </title> </result>
Database Management Systems, R. Ramakrishnan 51
XQuery
count = a (aggregate) function that returns the number of elms
<big_publishers> FOR $p IN distinct(document("bib.xml")//publisher) LET $b := document("bib.xml")/book[publisher = $p] WHERE count($b) > 100 RETURN $p </big_publishers>
Database Management Systems, R. Ramakrishnan 52
XQuery
Find books whose price is larger than average:
LET $a=avg(document("bib.xml")/bib/book/price)
FOR $b in document("bib.xml")/bib/book
WHERE $b/price > $a
RETURN $b
Database Management Systems, R. Ramakrishnan 53
Examples for XQuery queries FOR $x IN
doc(www.company.com/info.xml) //employee [employeeSalary gt 70000]/employeeName
RETURN <res> $x/firstName, $x/lastName </res> FOR $x IN
doc(www.company.com/info.xml)/company/employeeWHERE $x/employeeSalary gt 70000RETURN <res> $x/employeeName/firstName,
$x/employeeName/lastName </res> FOR $x IN
doc(www.company.com/info.xml)/company/project [projectNumber = 5]/projectWorker,
$y INdoc(www.company.com/info.xml)/company/employee WHERE $x/hours gt 20.0 AND $y.ssn = $x.ssnRETURN <res> $x/EmployeeName/firstName, $y/employeeName/lastName, $x/hours </res>