23
1 1 XML and Semistructured Data Lecture 8 2 Outline XML (Section 17) – XML syntax, semistructured data – Document Type Definitions (DTDs) • XPath

XML and Semistructured Datalsir · 2006-05-08 · 3 5 XML Data •Relational data does not have a syntax –I can’t “give” you my relational database –Need to import it from

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: XML and Semistructured Datalsir · 2006-05-08 · 3 5 XML Data •Relational data does not have a syntax –I can’t “give” you my relational database –Need to import it from

1

1

XML andSemistructured Data

Lecture 8

2

Outline• XML (Section 17)

– XML syntax, semistructured data– Document Type Definitions (DTDs)

• XPath

Page 2: XML and Semistructured Datalsir · 2006-05-08 · 3 5 XML Data •Relational data does not have a syntax –I can’t “give” you my relational database –Need to import it from

2

3

Additional Readings on XML• XML

– http://www.w3.org/XML/1999/XML-in-10-points– www.zvon.org/xxl/XMLTutorial/General/book_en.html– http://db.bell-labs.com/galax/– http://www.w3.org/TR/REC-xml-names (1/99)

• Xpath– http://java.sun.com/webservices/docs/ea2/tutorial/doc/JAXPXSLT2.html

• Xquery– http://www.w3.org/TR/xmlquery-use-cases/– http://www.xmlportfolio.com/xquery.html

• Main source: www.w3.org (but hard to read)

4

XML

• eXtensible Markup Language• XML 1.0 – a recommendation from W3C, 1998• Roots: SGML (used in publishing).• After the roots: a format for sharing data

Page 3: XML and Semistructured Datalsir · 2006-05-08 · 3 5 XML Data •Relational data does not have a syntax –I can’t “give” you my relational database –Need to import it from

3

5

XML Data• Relational data does not have a syntax

– I can’t “give” you my relational database– Need to import it from other syntax,

like CSV (comma-separated-values)• XML = rich syntax for data

– But XML is not relational: semistructured• Usage:

– Map any data to XML– Store it in files, exchange on the Web, etc.– Even query it directly, using XPath, XQuery

6

XML Data Sharing and Exchange

application

relational data

Transform

Integrate

Warehouse

XML Data WEB (HTTP)

application

application

legacy data

object-relational

Specific data management tasks

Page 4: XML and Semistructured Datalsir · 2006-05-08 · 3 5 XML Data •Relational data does not have a syntax –I can’t “give” you my relational database –Need to import it from

4

7

From HTML to XML

HTML describes the layout

8

HTML

<h1> Bibliography </h1><p> <i> Foundations of Databases </i> Abiteboul, Hull, Vianu <br> Addison Wesley, 1995<p> <i> Data on the Web </i> Abiteoul, Buneman, Suciu <br> Morgan Kaufmann, 1999

Page 5: XML and Semistructured Datalsir · 2006-05-08 · 3 5 XML Data •Relational data does not have a syntax –I can’t “give” you my relational database –Need to import it from

5

9

XML<bibliography>

<book> <title> Foundations… </title> <author> Abiteboul </author> <author> Hull </author> <author> Vianu </author> <publisher> Addison Wesley </publisher> <year> 1995 </year> </book> …

</bibliography>

XML describes the structure

10

XML Terminology• tags: book, title, author, …• start tag: <book>, end tag: </book>• elements: <book>…</book>,<author>…</author>• elements are nested• empty element: <red></red> abbrv. <red/>

well formed XML document• if it has matching tags• tags are properly nested• single root element• and more constraints, e.g. on names

Page 6: XML and Semistructured Datalsir · 2006-05-08 · 3 5 XML Data •Relational data does not have a syntax –I can’t “give” you my relational database –Need to import it from

6

11

More XML: Attributes<book price = “55” currency = “USD”> <title> Foundations of Databases </title> <author> Abiteboul </author> … <year> 1995 </year></book>

attributes are alternative ways to represent data

12

More XML: IDs and References<person id=“o555”> <name> Jane </name> </person>

<person id=“o456”> <name> Mary </name> <children idref=“o123 o555”/></person>

<person id=“o123” mother=“o456”><name>John</name></person>

Scope of IDs and references is the document

Page 7: XML and Semistructured Datalsir · 2006-05-08 · 3 5 XML Data •Relational data does not have a syntax –I can’t “give” you my relational database –Need to import it from

7

13

More XML: CDATA Section

• Syntax: <![CDATA[ .....any text here...]]>

• Example:

<example> <![CDATA[ some text here </notAtag> <>]]></example>

14

More XML: Entity References

• Syntax: &entityname;• Used like macros• Example:

<element>this is less than &lt;</element>

Unicode char&#38;

“&quot;

‘&apos;

&&amp;

>&gt;

<&lt;

complete list: http://www.w3.org/TR/xhtml-modularization/dtd_module_defs.html

some predefined entities

Page 8: XML and Semistructured Datalsir · 2006-05-08 · 3 5 XML Data •Relational data does not have a syntax –I can’t “give” you my relational database –Need to import it from

8

15

More XML: ProcessingInstructions

• Syntax: <?target argument?>• Example:

• Processed by external applications, e.g. php(bad style)

<product> <name> Alarm Clock </name> <?ringBell 20?> <price> 19.99 </price></product>

16

More XML: Comments

• Syntax <!-- .... Comment text... -->

• Yes, they are part of the data model !!!

Page 9: XML and Semistructured Datalsir · 2006-05-08 · 3 5 XML Data •Relational data does not have a syntax –I can’t “give” you my relational database –Need to import it from

9

17

XML Data: a Tree !<data>

<person id=“o555” ><name> Mary </name><address>

<street> Maple </street> <no> 345 </no> <city> Seattle </city>

</address></person><person>

<name> John </name><address> Thailand </address><phone> 23456 </phone>

</person></data>

data

Mary

personperson

name addressname address

street no city

Maple 345 Seattle

JohnThai

phone

23456

id

o555

Elementnode

Textnode

Attributenode

Order matters !!!

18

From Relational Data to XML Data

<persons><row> <name>John</name> <phone> 3634</phone></row> <row> <name>Sue</name> <phone> 6343</phone> <row> <name>Dick</name> <phone> 6363</phone></row>

</persons>

n a m e p h o n e

J o h n 3 6 3 4

S u e 6 3 4 3

D i c k 6 3 6 3

row row row

name name namephone phone phone

“John” 3634 “Sue” “Dick”6343 6363

persons XML: persons

Page 10: XML and Semistructured Datalsir · 2006-05-08 · 3 5 XML Data •Relational data does not have a syntax –I can’t “give” you my relational database –Need to import it from

10

19

XML Data• XML is self-describing• Schema elements become part of the data

– Relational schema: persons(name,phone)– In XML <persons>, <name>, <phone> are part of the

data, and are repeated many times• Consequence: XML is much more flexible• XML = semistructured data

20

Semi-structured Data Explained• Missing attributes:

• Could represent ina table with nulls

<person> <name> John</name> <phone>1234</phone> </person>

<person> <name>Joe</name></person> no phone !

-Joe1234Johnphonename

Page 11: XML and Semistructured Datalsir · 2006-05-08 · 3 5 XML Data •Relational data does not have a syntax –I can’t “give” you my relational database –Need to import it from

11

21

Semi-structured Data Explained• Repeated attributes

• Impossible in tables:

<person> <name> Mary</name> <phone>2345</phone> <phone>3456</phone></person>

two phones !

34562345Maryphonename

???

22

Semistructured Data Explained• Attributes with different types in different objects

• Nested collections (no 1NF)• Heterogeneous collections:

– <db> contains both <book>s and <publisher>s

<person> <name> <first> John </first> <last> Smith </last> </name> <phone>1234</phone></person>

structured name !

Page 12: XML and Semistructured Datalsir · 2006-05-08 · 3 5 XML Data •Relational data does not have a syntax –I can’t “give” you my relational database –Need to import it from

12

23

Document Type DefinitionsDTD

• part of the original XML specification• an XML document may have a DTD• XML document:

well-formed = if tags are correctly closedvalid = if it has a DTD and conforms to it

• validation is useful in data exchange

24

Very Simple DTD

<!DOCTYPE company [ <!ELEMENT company ((person|product)*)> <!ELEMENT person (ssn, name, office, phone?)> <!ELEMENT ssn (#PCDATA)> <!ELEMENT name (#PCDATA)> <!ELEMENT office (#PCDATA)> <!ELEMENT phone (#PCDATA)> <!ELEMENT product (pid, name, description?)> <!ELEMENT pid (#PCDATA)> <!ELEMENT description (#PCDATA)>]>

Page 13: XML and Semistructured Datalsir · 2006-05-08 · 3 5 XML Data •Relational data does not have a syntax –I can’t “give” you my relational database –Need to import it from

13

25

Very Simple DTD

<company> <person> <ssn> 123456789 </ssn> <name> John </name> <office> B432 </office> <phone> 1234 </phone> </person> <person> <ssn> 987654321 </ssn> <name> Jim </name> <office> B123 </office> </person> <product> ... </product> ...</company>

Example of valid XML document:

26

DTD: The Content Model

• Content model:– Complex = a regular expression over other elements– Text-only = #PCDATA– Empty = EMPTY– Any = ANY– Mixed content = (#PCDATA | A | B | C)*

<!ELEMENT tag (CONTENT)>

contentmodel

Page 14: XML and Semistructured Datalsir · 2006-05-08 · 3 5 XML Data •Relational data does not have a syntax –I can’t “give” you my relational database –Need to import it from

14

27

DTD: Regular Expressions

<!ELEMENT name (firstName, lastName))

<name> <firstName> . . . . . </firstName> <lastName> . . . . . </lastName></name>

<!ELEMENT name (firstName?, lastName))

DTD XML

<!ELEMENT person (name, phone*))

sequence

optional

<!ELEMENT person (name, (phone|email)))

star (repeated occurrence)

alternation

<person> <name> . . . . . </name> <phone> . . . . . </phone> <phone> . . . . . </phone> <phone> . . . . . </phone> . . . . . .</person>

<name> <lastName> . . . . . </lastName></name>

<person> <name> . . . . . </name> <email> . . . . . </email> </person>

28

DTD: Attributes• Document Type Definition

<!ELEMENT person (ssn, name, office, phone?)><!ATTLIST person age CDATA #REQUIRED "18"

birthdate CDATA #IMPLIED nationality CDATA #FIXED "CH"

gender (male|female) "female">

• Document<person age="24" nationality="CH" gender="male">

<ssn> … </ssn> …<phone> … </phone></person>

mandatory

optional

default

enumeration

Page 15: XML and Semistructured Datalsir · 2006-05-08 · 3 5 XML Data •Relational data does not have a syntax –I can’t “give” you my relational database –Need to import it from

15

29

DTD: Entities

• DTD:

• Document:internal entity

external entity

30

Inclusion of DTD in Documents<?xml version="1.0" encoding="ISO-8859-1"?><!DOCTYPE test PUBLIC "-//Test AG//DTD test V1.0//EN" SYSTEM "http://www.test.org/test.dtd"><test> "test" is a document element </test>

<!DOCTYPE test SYSTEM "http://www.test.org/test.dtd" [ <!ENTITY hello "hello world">]><test>&hello;</test>

<!DOCTYPE test [ <!ELEMENT test EMPTY> ]><test/>

External DTD Declaration

Internal DTD Declaration

Mixed usage

Page 16: XML and Semistructured Datalsir · 2006-05-08 · 3 5 XML Data •Relational data does not have a syntax –I can’t “give” you my relational database –Need to import it from

16

31

XML Namespaces• Different DTDs can use the same names!

– how to avoid conflicts when combining namesfrom different DTDs?

• XML namespace is a collection of names(markup vocabulary)– identified by a prefix (URL reference)

32

XML Namespaces

• name ::= [prefix:]localname

<book xmlns='urn:loc.gov:book' xmlns:isbn='www.isbn-org.org/def'> <title> … </title> <number> 15 </number> <isbn:number> …. </isbn:number></book>

default name space

names belong todefault name space

Page 17: XML and Semistructured Datalsir · 2006-05-08 · 3 5 XML Data •Relational data does not have a syntax –I can’t “give” you my relational database –Need to import it from

17

33

<tag xmlns:mystyle = “http://…”>

<mystyle:title> … </mystyle:title>

<mystyle:number> …

</tag>

XML Namespaces

• syntactic: <number> , <isbn:number>• semantic: URL used as unique identifier

– URL may not exist, has no function

Belong to this namespace

34

Querying XML Data• XPath = simple navigation through the tree• XQuery = the SQL of XML• XSLT = recursive traversal

– will not discuss

• XQuery and XSLT build on XPath

Page 18: XML and Semistructured Datalsir · 2006-05-08 · 3 5 XML Data •Relational data does not have a syntax –I can’t “give” you my relational database –Need to import it from

18

35

Sample Data for Queries<bib>

<book> <publisher> Addison-Wesley </publisher> <author> Serge Abiteboul </author> <author> <first-name> Rick </first-name> <last-name> Hull </last-name> </author> <author> Victor Vianu </author> <title> Foundations of Databases </title> <year> 1995 </year></book><book price=“55”> <publisher> Freeman </publisher> <author> Jeffrey D. Ullman </author> <title> Principles of Database and Knowledge Base Systems </title> <year> 1998 </year></book>

</bib>

36

Data Model for XPath

bib

book book

publisher author . . . .

Addison-Wesley Serge Abiteboul

The root

The root element

Page 19: XML and Semistructured Datalsir · 2006-05-08 · 3 5 XML Data •Relational data does not have a syntax –I can’t “give” you my relational database –Need to import it from

19

37

XPath: Simple Expressions

Result: <year> 1995 </year> <year> 1998 </year>

Result: empty (there were no papers)

/bib/book/year

/bib/paper/year

38

XPath: Restricted KleeneClosure

Result:<author> Serge Abiteboul </author> <author> <first-name> Rick </first-name> <last-name> Hull </last-name> </author> <author> Victor Vianu </author> <author> Jeffrey D. Ullman </author>

Result: <first-name> Rick </first-name>

//author

/bib//first-name

Page 20: XML and Semistructured Datalsir · 2006-05-08 · 3 5 XML Data •Relational data does not have a syntax –I can’t “give” you my relational database –Need to import it from

20

39

XPath: Text Nodes

Result: Serge Abiteboul Jeffrey D. Ullman

Rick Hull doesn’t appear because he has firstname, lastname

Functions in XPath:– text() = matches the text value– node() = matches any node (= * or @* or text())– name() = returns the name of the current tag

/bib/book/author/text()

40

XPath: Wildcard

Result: <first-name> Rick </first-name> <last-name> Hull </last-name>

* Matches any element

//author/*

Page 21: XML and Semistructured Datalsir · 2006-05-08 · 3 5 XML Data •Relational data does not have a syntax –I can’t “give” you my relational database –Need to import it from

21

41

XPath: Attribute Nodes

Result: “55”

@price means that price is has to be anattribute

/bib/book/@price

42

XPath: Predicates

Result: <author> <first-name> Rick </first-name> <last-name> Hull </last-name> </author>

/bib/book/author[firstname]

Page 22: XML and Semistructured Datalsir · 2006-05-08 · 3 5 XML Data •Relational data does not have a syntax –I can’t “give” you my relational database –Need to import it from

22

43

XPath: More Predicates

Result: <lastname> … </lastname> <lastname> … </lastname>

/bib/book/author[firstname][address[.//zip][city]]/lastname

44

XPath: More Predicates

/bib/book[@price < “60”]

/bib/book[author/@age < “25”]

/bib/book[author/text()]

Page 23: XML and Semistructured Datalsir · 2006-05-08 · 3 5 XML Data •Relational data does not have a syntax –I can’t “give” you my relational database –Need to import it from

23

45

XPath: Summarybib matches a bib element* matches any element/ matches the root element/bib matches a bib element under rootbib/paper matches a paper in bibbib//paper matches a paper in bib, at any depth//paper matches a paper at any depthpaper|book matches a paper or a book@price matches a price attributebib/book/@price matches price attribute in book, in bibbib/book[@price<“55”]/author/lastname matches…