Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
1
1
XML andSemistructured Data
Lecture 8
2
Outline• XML (Section 17)
– XML syntax, semistructured data– Document Type Definitions (DTDs)
• XPath
2
3
Additional Readings on XML• XML
– http://www.w3.org/XML/1999/XML-in-10-points– www.zvon.org/xxl/XMLTutorial/General/book_en.html– http://db.bell-labs.com/galax/– http://www.w3.org/TR/REC-xml-names (1/99)
• Xpath– http://java.sun.com/webservices/docs/ea2/tutorial/doc/JAXPXSLT2.html
• Xquery– http://www.w3.org/TR/xmlquery-use-cases/– http://www.xmlportfolio.com/xquery.html
• Main source: www.w3.org (but hard to read)
4
XML
• eXtensible Markup Language• XML 1.0 – a recommendation from W3C, 1998• Roots: SGML (used in publishing).• After the roots: a format for sharing data
3
5
XML Data• Relational data does not have a syntax
– I can’t “give” you my relational database– Need to import it from other syntax,
like CSV (comma-separated-values)• XML = rich syntax for data
– But XML is not relational: semistructured• Usage:
– Map any data to XML– Store it in files, exchange on the Web, etc.– Even query it directly, using XPath, XQuery
6
XML Data Sharing and Exchange
application
relational data
Transform
Integrate
Warehouse
XML Data WEB (HTTP)
application
application
legacy data
object-relational
Specific data management tasks
4
7
From HTML to XML
HTML describes the layout
8
HTML
<h1> Bibliography </h1><p> <i> Foundations of Databases </i> Abiteboul, Hull, Vianu <br> Addison Wesley, 1995<p> <i> Data on the Web </i> Abiteoul, Buneman, Suciu <br> Morgan Kaufmann, 1999
5
9
XML<bibliography>
<book> <title> Foundations… </title> <author> Abiteboul </author> <author> Hull </author> <author> Vianu </author> <publisher> Addison Wesley </publisher> <year> 1995 </year> </book> …
</bibliography>
XML describes the structure
10
XML Terminology• tags: book, title, author, …• start tag: <book>, end tag: </book>• elements: <book>…</book>,<author>…</author>• elements are nested• empty element: <red></red> abbrv. <red/>
well formed XML document• if it has matching tags• tags are properly nested• single root element• and more constraints, e.g. on names
6
11
More XML: Attributes<book price = “55” currency = “USD”> <title> Foundations of Databases </title> <author> Abiteboul </author> … <year> 1995 </year></book>
attributes are alternative ways to represent data
12
More XML: IDs and References<person id=“o555”> <name> Jane </name> </person>
<person id=“o456”> <name> Mary </name> <children idref=“o123 o555”/></person>
<person id=“o123” mother=“o456”><name>John</name></person>
Scope of IDs and references is the document
7
13
More XML: CDATA Section
• Syntax: <![CDATA[ .....any text here...]]>
• Example:
<example> <![CDATA[ some text here </notAtag> <>]]></example>
14
More XML: Entity References
• Syntax: &entityname;• Used like macros• Example:
<element>this is less than <</element>
Unicode char&
“"
‘'
&&
>>
<<
complete list: http://www.w3.org/TR/xhtml-modularization/dtd_module_defs.html
some predefined entities
8
15
More XML: ProcessingInstructions
• Syntax: <?target argument?>• Example:
• Processed by external applications, e.g. php(bad style)
<product> <name> Alarm Clock </name> <?ringBell 20?> <price> 19.99 </price></product>
16
More XML: Comments
• Syntax <!-- .... Comment text... -->
• Yes, they are part of the data model !!!
9
17
XML Data: a Tree !<data>
<person id=“o555” ><name> Mary </name><address>
<street> Maple </street> <no> 345 </no> <city> Seattle </city>
</address></person><person>
<name> John </name><address> Thailand </address><phone> 23456 </phone>
</person></data>
data
Mary
personperson
name addressname address
street no city
Maple 345 Seattle
JohnThai
phone
23456
id
o555
Elementnode
Textnode
Attributenode
Order matters !!!
18
From Relational Data to XML Data
<persons><row> <name>John</name> <phone> 3634</phone></row> <row> <name>Sue</name> <phone> 6343</phone> <row> <name>Dick</name> <phone> 6363</phone></row>
</persons>
n a m e p h o n e
J o h n 3 6 3 4
S u e 6 3 4 3
D i c k 6 3 6 3
row row row
name name namephone phone phone
“John” 3634 “Sue” “Dick”6343 6363
persons XML: persons
10
19
XML Data• XML is self-describing• Schema elements become part of the data
– Relational schema: persons(name,phone)– In XML <persons>, <name>, <phone> are part of the
data, and are repeated many times• Consequence: XML is much more flexible• XML = semistructured data
20
Semi-structured Data Explained• Missing attributes:
• Could represent ina table with nulls
<person> <name> John</name> <phone>1234</phone> </person>
<person> <name>Joe</name></person> no phone !
-Joe1234Johnphonename
11
21
Semi-structured Data Explained• Repeated attributes
• Impossible in tables:
<person> <name> Mary</name> <phone>2345</phone> <phone>3456</phone></person>
two phones !
34562345Maryphonename
???
22
Semistructured Data Explained• Attributes with different types in different objects
• Nested collections (no 1NF)• Heterogeneous collections:
– <db> contains both <book>s and <publisher>s
<person> <name> <first> John </first> <last> Smith </last> </name> <phone>1234</phone></person>
structured name !
12
23
Document Type DefinitionsDTD
• part of the original XML specification• an XML document may have a DTD• XML document:
well-formed = if tags are correctly closedvalid = if it has a DTD and conforms to it
• validation is useful in data exchange
24
Very Simple DTD
<!DOCTYPE company [ <!ELEMENT company ((person|product)*)> <!ELEMENT person (ssn, name, office, phone?)> <!ELEMENT ssn (#PCDATA)> <!ELEMENT name (#PCDATA)> <!ELEMENT office (#PCDATA)> <!ELEMENT phone (#PCDATA)> <!ELEMENT product (pid, name, description?)> <!ELEMENT pid (#PCDATA)> <!ELEMENT description (#PCDATA)>]>
13
25
Very Simple DTD
<company> <person> <ssn> 123456789 </ssn> <name> John </name> <office> B432 </office> <phone> 1234 </phone> </person> <person> <ssn> 987654321 </ssn> <name> Jim </name> <office> B123 </office> </person> <product> ... </product> ...</company>
Example of valid XML document:
26
DTD: The Content Model
• Content model:– Complex = a regular expression over other elements– Text-only = #PCDATA– Empty = EMPTY– Any = ANY– Mixed content = (#PCDATA | A | B | C)*
<!ELEMENT tag (CONTENT)>
contentmodel
14
27
DTD: Regular Expressions
<!ELEMENT name (firstName, lastName))
<name> <firstName> . . . . . </firstName> <lastName> . . . . . </lastName></name>
<!ELEMENT name (firstName?, lastName))
DTD XML
<!ELEMENT person (name, phone*))
sequence
optional
<!ELEMENT person (name, (phone|email)))
star (repeated occurrence)
alternation
<person> <name> . . . . . </name> <phone> . . . . . </phone> <phone> . . . . . </phone> <phone> . . . . . </phone> . . . . . .</person>
<name> <lastName> . . . . . </lastName></name>
<person> <name> . . . . . </name> <email> . . . . . </email> </person>
28
DTD: Attributes• Document Type Definition
<!ELEMENT person (ssn, name, office, phone?)><!ATTLIST person age CDATA #REQUIRED "18"
birthdate CDATA #IMPLIED nationality CDATA #FIXED "CH"
gender (male|female) "female">
• Document<person age="24" nationality="CH" gender="male">
<ssn> … </ssn> …<phone> … </phone></person>
mandatory
optional
default
enumeration
15
29
DTD: Entities
• DTD:
• Document:internal entity
external entity
30
Inclusion of DTD in Documents<?xml version="1.0" encoding="ISO-8859-1"?><!DOCTYPE test PUBLIC "-//Test AG//DTD test V1.0//EN" SYSTEM "http://www.test.org/test.dtd"><test> "test" is a document element </test>
<!DOCTYPE test SYSTEM "http://www.test.org/test.dtd" [ <!ENTITY hello "hello world">]><test>&hello;</test>
<!DOCTYPE test [ <!ELEMENT test EMPTY> ]><test/>
External DTD Declaration
Internal DTD Declaration
Mixed usage
16
31
XML Namespaces• Different DTDs can use the same names!
– how to avoid conflicts when combining namesfrom different DTDs?
• XML namespace is a collection of names(markup vocabulary)– identified by a prefix (URL reference)
32
XML Namespaces
• name ::= [prefix:]localname
<book xmlns='urn:loc.gov:book' xmlns:isbn='www.isbn-org.org/def'> <title> … </title> <number> 15 </number> <isbn:number> …. </isbn:number></book>
default name space
names belong todefault name space
17
33
<tag xmlns:mystyle = “http://…”>
…
<mystyle:title> … </mystyle:title>
<mystyle:number> …
</tag>
XML Namespaces
• syntactic: <number> , <isbn:number>• semantic: URL used as unique identifier
– URL may not exist, has no function
Belong to this namespace
34
Querying XML Data• XPath = simple navigation through the tree• XQuery = the SQL of XML• XSLT = recursive traversal
– will not discuss
• XQuery and XSLT build on XPath
18
35
Sample Data for Queries<bib>
<book> <publisher> Addison-Wesley </publisher> <author> Serge Abiteboul </author> <author> <first-name> Rick </first-name> <last-name> Hull </last-name> </author> <author> Victor Vianu </author> <title> Foundations of Databases </title> <year> 1995 </year></book><book price=“55”> <publisher> Freeman </publisher> <author> Jeffrey D. Ullman </author> <title> Principles of Database and Knowledge Base Systems </title> <year> 1998 </year></book>
</bib>
36
Data Model for XPath
bib
book book
publisher author . . . .
Addison-Wesley Serge Abiteboul
The root
The root element
19
37
XPath: Simple Expressions
Result: <year> 1995 </year> <year> 1998 </year>
Result: empty (there were no papers)
/bib/book/year
/bib/paper/year
38
XPath: Restricted KleeneClosure
Result:<author> Serge Abiteboul </author> <author> <first-name> Rick </first-name> <last-name> Hull </last-name> </author> <author> Victor Vianu </author> <author> Jeffrey D. Ullman </author>
Result: <first-name> Rick </first-name>
//author
/bib//first-name
20
39
XPath: Text Nodes
Result: Serge Abiteboul Jeffrey D. Ullman
Rick Hull doesn’t appear because he has firstname, lastname
Functions in XPath:– text() = matches the text value– node() = matches any node (= * or @* or text())– name() = returns the name of the current tag
/bib/book/author/text()
40
XPath: Wildcard
Result: <first-name> Rick </first-name> <last-name> Hull </last-name>
* Matches any element
//author/*
21
41
XPath: Attribute Nodes
Result: “55”
@price means that price is has to be anattribute
/bib/book/@price
42
XPath: Predicates
Result: <author> <first-name> Rick </first-name> <last-name> Hull </last-name> </author>
/bib/book/author[firstname]
22
43
XPath: More Predicates
Result: <lastname> … </lastname> <lastname> … </lastname>
/bib/book/author[firstname][address[.//zip][city]]/lastname
44
XPath: More Predicates
/bib/book[@price < “60”]
/bib/book[author/@age < “25”]
/bib/book[author/text()]
23
45
XPath: Summarybib matches a bib element* matches any element/ matches the root element/bib matches a bib element under rootbib/paper matches a paper in bibbib//paper matches a paper in bib, at any depth//paper matches a paper at any depthpaper|book matches a paper or a book@price matches a price attributebib/book/@price matches price attribute in book, in bibbib/book[@price<“55”]/author/lastname matches…