View
220
Download
1
Embed Size (px)
Citation preview
1
Introduction to XML
Yanlei DiaoUMass AmherstApril 19, 2007
Slides Courtesy of Ramakrishnan & Gehrke, Dan Suciu, Zack Ives and Gerome Miklau.
2
Structure in Data Representation
Relational data is highly structured structure is defined by the schema good for system design good for precise query semantics / answers
Structure can be limiting data exchange hard: integration of diff
schema authoring is constrained: schema-first querying constrained: must know schema changes to structure not easy
3
Data Integration1. Find all departments whose total employee salaries exceed 1% of the budget of the company.
US
EuropeAsia
Australia
Internet
2. Find names of employees with the top sales record last month.
4
WWW
Structured data - Databases
Unstructured Text - Documents
Semistructured Data
Integration of Text and Structured Data
5
Need for A New Data Model
Loose (and rich) structure Integration of structured, but
heterogeneous data sources Evolving, unknown, or irregular structure Textual data with tags and links Combination of data models
5
6
XML: Universal Data Exchange Format
XML is the confluence of many factors: Databases needed a more flexible interchange format. Data needed to be generated and consumed by
applications. The Web needed a more declarative format for data. Documents needed a mechanism for extended tags.
XML was originally proposed for online publishing, is becoming the wire format for data exchange.
W3C Recommendation: http://www.w3.org/TR/REC-xml/
7
From HTML to XML
HTML describes the presentation.
8
HTML
<h1> Bibliography </h1><p> <i> Foundations of Databases </i> Abiteboul, Hull, Vianu <br> Addison Wesley, 1995<p> <i> Data on the Web </i> Abiteboul, Buneman, Suciu <br> Morgan Kaufmann, 1999
9
XML
<bibliography> <book> <title> Foundations… </title> <author> Abiteboul </author> <author> Hull </author> <author> Vianu </author> <publisher> Addison Wesley
</publisher> <year> 1995 </year> </book> …
</bibliography>
XML describes the content!
10
XML: Syntax & Typing
11
XML Syntax
Tags: book, title, author, … start tag: <book> end tag: </book>
Elements: <book>…</book>,<author>…</author> elements are nested empty element: <red></red>, abbrv. <red/>
An XML document: single root element
An XML document is well formed if it has matching tags
12
XML Syntax
<book price = “55” currency = “USD”> <title> Foundations of Databases </title> <author> Abiteboul </author> … <year> 1995 </year></book>
<book price = “55” currency = “USD”> <title> Foundations of Databases </title> <author> Abiteboul </author> … <year> 1995 </year></book>
Attributes are alternative ways to represent data.
13
XML Syntax
<person id=“o555”> <name> Jane </name> </person>
<person id=“o456”> <name> Mary </name> <children idref=“o123 o555”/></person><person id=“o123”
mother=“o456”><name>John</name></person>
<person id=“o555”> <name> Jane </name> </person>
<person id=“o456”> <name> Mary </name> <children idref=“o123 o555”/></person><person id=“o123”
mother=“o456”><name>John</name></person>
Oids and references in XML are just syntax.
14
XML Semantics: a Tree !<data> <person id=“o555” > <name> Mary </name> <address> <street> Maple </street> <no> 345 </no> <city> Seattle </city> </address> </person> <person> <name> John </name> <address> Thailand
</address> <phone> 23456 </phone> </person></data>
<data> <person id=“o555” > <name> Mary </name> <address> <street> Maple </street> <no> 345 </no> <city> Seattle </city> </address> </person> <person> <name> John </name> <address> Thailand
</address> <phone> 23456 </phone> </person></data>
data
Mary
person
person
name address
name address
street no city
Maple 345 Seattle
JohnThai
phone
23456
id
o555
Elementnode
Textnode
Attributenode
Order matters ! IDREF will turn it to a graph.
15
XML Data XML is self-describing Schema elements become part of the
data– Relational schema: persons(name,phone)– In XML <persons>, <name>, <phone>
are part of the data, and are repeated many times
Consequence: XML is much more flexible
Some real data:
http://www.cs.washington.edu/research/xmldatasets/
16
Relational Data as XML
<person><row> <name>John</name> <phone>
3634</phone></row> <row> <name>Sue</name> <phone> 6343</phone> <row> <name>Dick</name> <phone>
6363</phone></row></person>
<person><row> <name>John</name> <phone>
3634</phone></row> <row> <name>Sue</name> <phone> 6343</phone> <row> <name>Dick</name> <phone>
6363</phone></row></person>
row row row
name name namephone phone phone
“John” 3634 “Sue” “Dick”6343 6363
personXML: person
name phone
John 3634
Sue 6343
Dick 6363
17
XML is Semi-structured Data
Missing attributes:
Could represent ina table with nulls
<data> <person> <name> John</name> <phone>1234</phone> </person> <person> <name>Joe</name> </person></data>
<data> <person> <name> John</name> <phone>1234</phone> </person> <person> <name>Joe</name> </person></data>
← no phone !
name phone
John 1234
Joe -
18
XML is Semi-structured Data
Repeated attributes
Impossible in tables:nested collections
(non 1NF)
<person> <name> Mary</name> <phone>2345</phone> <phone>3456</phone></person>
<person> <name> Mary</name> <phone>2345</phone> <phone>3456</phone></person>
← two phones !
name phone
Mary 2345 3456 ???
19
XML is Semi-structured Data Attributes with different types in different
objects
Mixed content:– <db> contains both <book>s and <publisher>s
<data> <person> <name> <first> John </first> <last> Smith </last> </name> <phone>1234</phone> </person> <person> <name> M. Carey</name> <phone>3456</phone> </person></data>
<data> <person> <name> <first> John </first> <last> Smith </last> </name> <phone>1234</phone> </person> <person> <name> M. Carey</name> <phone>3456</phone> </person></data>
← structured name !
← unstructured name !
20
Data Typing in XML Data typing in the relational model:
schema Data typing in XML
– Much more complex– Typing restricts valid trees that can occur
• theoretical foundation: tree languages
– Practical methods:• DTD (Document Type Definition)• XML Schema
21
Document Type Definitions (DTD)
Part of the original XML specification To be replaced by XML Schema
– Much more complex An XML document may have a DTD XML document:
well-formed = if tags are correctly closedValid = if it has a DTD and conforms to it
Validation is useful in data exchange
22
DTD Example
<!DOCTYPE company [ <!ELEMENT company ((person|product)*)> <!ELEMENT person (ssn, name, office, phone?)> <!ELEMENT ssn (#PCDATA)> <!ELEMENT name (#PCDATA)> <!ELEMENT office (#PCDATA)> <!ELEMENT phone (#PCDATA)> <!ELEMENT product (pid, name, description?)> <!ELEMENT pid (#PCDATA)> <!ELEMENT description (#PCDATA)>]>
<!DOCTYPE company [ <!ELEMENT company ((person|product)*)> <!ELEMENT person (ssn, name, office, phone?)> <!ELEMENT ssn (#PCDATA)> <!ELEMENT name (#PCDATA)> <!ELEMENT office (#PCDATA)> <!ELEMENT phone (#PCDATA)> <!ELEMENT product (pid, name, description?)> <!ELEMENT pid (#PCDATA)> <!ELEMENT description (#PCDATA)>]>
23
DTD Example
<company> <person> <ssn> 123456789 </ssn> <name> John </name> <office> B432 </office> <phone> 1234 </phone> </person> <person> <ssn> 987654321 </ssn> <name> Jim </name> <office> B123 </office> </person> <product> ... </product> ...</company>
<company> <person> <ssn> 123456789 </ssn> <name> John </name> <office> B432 </office> <phone> 1234 </phone> </person> <person> <ssn> 987654321 </ssn> <name> Jim </name> <office> B123 </office> </person> <product> ... </product> ...</company>
Example of valid XML document:
24
DTD: The Content Model
Content model:– Complex = a regular expression over other
elements– Text-only = #PCDATA– Empty = EMPTY– Any = ANY– Mixed content = (#PCDATA | A | B | C)*
<!ELEMENT tag (CONTENT)><!ELEMENT tag (CONTENT)>
contentmodel
25
DTD: Regular Expressions
<!ELEMENT name (firstName, lastName))
<!ELEMENT name (firstName, lastName))
<name> <firstName> . . . . . </firstName> <lastName> . . . . . </lastName></name>
<name> <firstName> . . . . . </firstName> <lastName> . . . . . </lastName></name>
<!ELEMENT name (firstName?, lastName))<!ELEMENT name (firstName?, lastName))
DTD XML
<!ELEMENT person (name, phone*))<!ELEMENT person (name, phone*))
sequence
optional
<!ELEMENT person (name, (phone|email)))<!ELEMENT person (name, (phone|email)))
Kleene star
alternation
<person> <name> . . . . . </name> <phone> . . . . . </phone> <phone> . . . . . </phone> <phone> . . . . . </phone> . . . . . .</person>
<person> <name> . . . . . </name> <phone> . . . . . </phone> <phone> . . . . . </phone> <phone> . . . . . </phone> . . . . . .</person>
26
Attributes in DTDs
<!ELEMENT person (ssn, name, office, phone?)><!ATTLIST person age CDATA #REQUIRED>
<!ELEMENT person (ssn, name, office, phone?)><!ATTLIST person age CDATA #REQUIRED>
<person age=“25”> <name> ....</name> ...</person>
<person age=“25”> <name> ....</name> ...</person>
27
Attributes in DTDs
<!ELEMENT person (ssn, name, office, phone?)><!ATTLIST person age CDATA #REQUIRED id ID #REQUIRED manager IDREF #REQUIRED manages IDREFS #REQUIRED>
<!ELEMENT person (ssn, name, office, phone?)><!ATTLIST person age CDATA #REQUIRED id ID #REQUIRED manager IDREF #REQUIRED manages IDREFS #REQUIRED>
<person age=“25” id=“p29432” manager=“p48293” manages=“p34982 p423234”> <name> ....</name> ...</person>
<person age=“25” id=“p29432” manager=“p48293” manages=“p34982 p423234”> <name> ....</name> ...</person>
28
Attributes in DTDs
Types: CDATA = string ID = key IDREF = foreign key IDREFS = foreign keys separated by
space (Monday | Wednesday | Friday) =
enumeration
29
Attributes in DTDs
Kind: #REQUIRED #IMPLIED = optional value = default value value #FIXED = the only value allowed
30
Using DTDs
Must include in the XML document Either include the entire DTD:
– <!DOCTYPE rootElement [ ....... ]> Or include a reference to it:
– <!DOCTYPE rootElement SYSTEM “http://www.mydtd.org”>
Or mix the two... (e.g. to override the external definition)
31
XML Schema DTDs capture grammatical structure, but
have some drawbacks: Not themselves in XML, inconvenient to build tools Don’t capture database datatypes’ domains No way of defining OO-like inheritance…
XML Schema addresses shortcomings of DTDs XML syntax Subclassing Domains and built-in datatypes nin. and max # of occurrences of elements http://www.w3.org/XML/Schema
32
Basics of XML Schema Need to use the XML Schema namespace
(generally named xsd) simpleTypes are a way of restricting domains on
scalars Can define a simpleType based on integer, with values
within a particular range complexTypes are a way of defining element
structures Basically equivalent to !ELEMENT, but more powerful Specify sequence, choice between child elements Specify minOccurs and maxOccurs (default 1)
Must associate an element/attribute with a simpleType, or an element with a complexType
33
Simple Schema Example<xsd:schema
xmlns:xsd="http://www.w3.org/2001/XMLSchema"> <xsd:element name=“mastersthesis" type=“ThesisType"/> <xsd:complexType name=“ThesisType">
<xsd:attribute name=“mdate" type="xsd:date"/><xsd:attribute name=“key" type="xsd:string"/><xsd:attribute name=“advisor" type="xsd:string"/><xsd:sequence>
<xsd:element name=“author" type=“xsd:string"/> <xsd:element name=“title" type=“xsd:string"/> <xsd:element name=“year" type=“xsd:integer"/> <xsd:element name=“school" type=“xsd:string”/> <xsd:element name=“committeemember"
type=“CommitteeType” minOccurs=“0"/> </xsd:sequence>
</xsd:complexType> </xsd:schema>
34
Questions
35
How the Web was Yesterday
HTML documents• often generated by applications• consumed by humans only• easy access: across platforms, across
organizations No application interoperability:
• HTML not understood by applications• Database technology: client-server
36
Application InteroperabilityPurchase order
Amazon
Supplier1Supplier2
Supplier3
Internet
39
New Universal Data Exchange Format: XML
A recommendation from the W3C XML = data XML generated by applications XML consumed by applications Easy access: across platforms,
organizations
40
XML
A W3C standard to complement HTML Origins: Structured text SGML
• Large-scale electronic publishing• Data exchange on the web
Motivation:• HTML describes presentation• XML describes content
http://www.w3.org/TR/2000/REC-xml-20001006 (version 2, 10/2000)
41
Paradigm Shift on the Web
From documents (HTML) to data (XML) From information retrieval to data
management For databases, also a paradigm shift:
• from relational model to XML model• from data processing to data/query
translation• from storage to transport
42
Database Issues
How are we going to model XML? (graphs). Compared to relational model,
• XML is hierarchical• XML allows missing or additional attributes• XML allows multiple instances of an attribute (set-
valued)• XML allows different types in different objects• XML integrates structure and text data …
How are we going to query XML? (XQuery) How are we going to store XML (in a relational
database? object-oriented? native?) How are we going to process XML efficiently?
(many interesting research questions!)
43
Designing an XML Schema/DTD
Not as formalized as relational data design We can still use ER diagrams to break into entity,
relationship sets Note that often we already have our data in
relations and need to design the XML schema to export them!
Generally orient the XML tree around the “central” objects
Big decision: element vs. attribute Element if it has its own properties, or if you
*might* have more than one of them Attribute if it is a single property – or perhaps not!