48
CIS 670 Fall 2001 (LN 5) 1 XML Introduction to XML XML basics DTDs XML and semistructured data Query languages for XML XML-QL, XQL, XSL XML extensions XML-Data, XLink, XPointer

CIS 670 Fall 2001 (LN 5)1 XML 4 Introduction to XML –XML basics –DTDs –XML and semistructured data 4 Query languages for XML XML-QL, XQL, XSL 4 XML extensions

Embed Size (px)

Citation preview

Page 1: CIS 670 Fall 2001 (LN 5)1 XML 4 Introduction to XML –XML basics –DTDs –XML and semistructured data 4 Query languages for XML XML-QL, XQL, XSL 4 XML extensions

CIS 670 Fall 2001 (LN 5) 1

XML

Introduction to XML

– XML basics

– DTDs

– XML and semistructured data

Query languages for XML

XML-QL, XQL, XSL

XML extensions

XML-Data, XLink, XPointer

Page 2: CIS 670 Fall 2001 (LN 5)1 XML 4 Introduction to XML –XML basics –DTDs –XML and semistructured data 4 Query languages for XML XML-QL, XQL, XSL 4 XML extensions

CIS 670 Fall 2001 (LN 5) 2

XML’s history: SGML, HTML, XML

SGML: Standard Generalized Markup Language

-- Charles Goldfarb, ISO 8879, 1986

DTDs (Document Type Declarations)

powerful and flexible tool for structuring information, but

– complete, generic implementation of SGML proved extremely difficult

– tools for working with SGML documents proved expensive

two children that have outpaced SGML:

– HTML: HyperText Markup Language (Tim Berners-Lee, 1991). Describing presentation.

– XML: eXtensible Markup Language, W3C, 1998. Describing content.

Page 3: CIS 670 Fall 2001 (LN 5)1 XML 4 Introduction to XML –XML basics –DTDs –XML and semistructured data 4 Query languages for XML XML-QL, XQL, XSL 4 XML extensions

CIS 670 Fall 2001 (LN 5) 3

From HTML to XML

HTML is good for presentation (human friendly), but does not help automatic data extraction by means of programs (not computer friendly).

Why? HTML tags:

predefined and fixed

describing display format, not the structure of the data.

<h3> Book </h3>

<ul>

<il><i> SGML </i> Goldfarb <br>

<b> 1986 </b>

<il><b> XML </b> W3C <br>

</ul>

Page 4: CIS 670 Fall 2001 (LN 5)1 XML 4 Introduction to XML –XML basics –DTDs –XML and semistructured data 4 Query languages for XML XML-QL, XQL, XSL 4 XML extensions

CIS 670 Fall 2001 (LN 5) 4

XML: a first glance

XML tags:

user defined

describing the structure of the data

<book id = “B1” >

<title> SGML </title>

<author> Goldfarb </author>

<year> 1986 </year>

</book>

<book id = “B2” >

<title> XML </title>

<author> W3C </author>

</book>

Page 5: CIS 670 Fall 2001 (LN 5)1 XML 4 Introduction to XML –XML basics –DTDs –XML and semistructured data 4 Query languages for XML XML-QL, XQL, XSL 4 XML extensions

CIS 670 Fall 2001 (LN 5) 5

XML vs. HTML

user defined new tags, describing structure instead of display

structures can be arbitrarily nested (even recursively defined)

optional description of its grammar (DTD) and thus validation is possible

XML presentation:

XML standard does not define how data should be displayed

Style sheet: provide browsers with a set of formatting rules to be applied to particular elements

– CSS (Cascading Style Sheets), originally for HTML

– XSL (eXtensible Style Language), for XML

Page 6: CIS 670 Fall 2001 (LN 5)1 XML 4 Introduction to XML –XML basics –DTDs –XML and semistructured data 4 Query languages for XML XML-QL, XQL, XSL 4 XML extensions

CIS 670 Fall 2001 (LN 5) 6

Web applications

data exchange

data transformation

data integration

data extraction

E-commerce

Page 7: CIS 670 Fall 2001 (LN 5)1 XML 4 Introduction to XML –XML basics –DTDs –XML and semistructured data 4 Query languages for XML XML-QL, XQL, XSL 4 XML extensions

CIS 670 Fall 2001 (LN 5) 7

XML basics (1)

XML consists of tags and text

<book id = “B2” >

<title> XML </title>

<author> W3C </author>

</book>

tags come in pairs: markups

– start tag, e.g., <book>

– end tag, e.g., </book>

tags must be properly nested

– <book> <title> … </title> </book> -- good

– <book> <title> … </book> </title> -- bad

XML has only one “basic” type: text, called PCDATA (Parsed Character DATA)

Page 8: CIS 670 Fall 2001 (LN 5)1 XML 4 Introduction to XML –XML basics –DTDs –XML and semistructured data 4 Query languages for XML XML-QL, XQL, XSL 4 XML extensions

CIS 670 Fall 2001 (LN 5) 8

XML basics (2)

nested tags can be used to express various structures,

e.g., “records”:

<person>

<name> Wenfei Fan </name>

<tel> (215) 204-6485 </tel>

<email> [email protected] </email>

<email> [email protected] </email>

</person>

a list: represented by using the same tags repeatedly:

<person> … </person>

<person> … </person>

...

Page 9: CIS 670 Fall 2001 (LN 5)1 XML 4 Introduction to XML –XML basics –DTDs –XML and semistructured data 4 Query languages for XML XML-QL, XQL, XSL 4 XML extensions

CIS 670 Fall 2001 (LN 5) 9

XML basics (3)

XML data is ordered!

How to represent sets in XML?

How to represent an unordered pair (a, b) in XML?

Can one directly represent the following in a conventional database?

– <person> … </person>

<person> … </person> …

– <person>

<name> Wenfei Fan </name>

<tel> (215) 204-6485 </tel>

<email> [email protected] </email>

<email> [email protected] </email>

</person>

Page 10: CIS 670 Fall 2001 (LN 5)1 XML 4 Introduction to XML –XML basics –DTDs –XML and semistructured data 4 Query languages for XML XML-QL, XQL, XSL 4 XML extensions

CIS 670 Fall 2001 (LN 5) 10

XML element (1)

Element: the segment between an start and a corresponding end tag

subelement: the relation between an element and its component elements.

<person>

<name> Wenfei Fan </name>

<tel> (215) 204-6485 </tel>

<email> [email protected] </email>

<email> [email protected] </email>

</person>

Page 11: CIS 670 Fall 2001 (LN 5)1 XML 4 Introduction to XML –XML basics –DTDs –XML and semistructured data 4 Query languages for XML XML-QL, XQL, XSL 4 XML extensions

CIS 670 Fall 2001 (LN 5) 11

XML elements (2)

root element: an XML document consists of a single element called the root element, e.g.,

<db>

<person> … </person>

<person> … </person> ...

</db>

empty element: special element indicating non-textual content, e.g., <giggle></giggle> or simply <giggle/>

– sound effect (to be generated by application):

<claim> Everyone taking CIS 670 can get an A <giggle/>. However, …</claim>

– its attributes carry information:

<image img=“picture.gif” />

Page 12: CIS 670 Fall 2001 (LN 5)1 XML 4 Introduction to XML –XML basics –DTDs –XML and semistructured data 4 Query languages for XML XML-QL, XQL, XSL 4 XML extensions

CIS 670 Fall 2001 (LN 5) 12

XML elements (3)

mixed content: an element may contain a mixture of subelements and PCDATA:

<person>

This is my daughter

<name> Grace Fan</name>

<email> [email protected] </email>

I am not too sure of the following

<hobby> crying </hobby>

<education> Ph.D </education>

Don’t even think about it: <smoking/>

</person>

Page 13: CIS 670 Fall 2001 (LN 5)1 XML 4 Introduction to XML –XML basics –DTDs –XML and semistructured data 4 Query languages for XML XML-QL, XQL, XSL 4 XML extensions

CIS 670 Fall 2001 (LN 5) 13

XML attributes (1)

An start tag may contain attributes describing “properties” of the element (e.g., dimension or type)

<picture>

<height dim=“cm”> 2400</height>

<width dim=“in”> 96 </width>

<data encoding=“gif”> M05-+C$ … </data>

</picture>

References (meaningful only when a DTD is present):

<person id = “989” father=“868”>

<name> Grace Fan</name>

<email> [email protected] </email>

</person>

Page 14: CIS 670 Fall 2001 (LN 5)1 XML 4 Introduction to XML –XML basics –DTDs –XML and semistructured data 4 Query languages for XML XML-QL, XQL, XSL 4 XML extensions

CIS 670 Fall 2001 (LN 5) 14

XML attributes (2)

XML attributes cannot be nested

XML attributes must be unique

one can’t write <person friends=“0” friends=“1”> ...

XML attributes are not ordered

<person id = “989” friends=“110 111 112 113 114”>

<name> Grace Fan</name>

</person>

is the same as

<person friends=“110 111 112 113 114” id = “989”>

<name> Grace Fan</name>

</person>

Page 15: CIS 670 Fall 2001 (LN 5)1 XML 4 Introduction to XML –XML basics –DTDs –XML and semistructured data 4 Query languages for XML XML-QL, XQL, XSL 4 XML extensions

CIS 670 Fall 2001 (LN 5) 15

Attributes vs. subelements

When to use attributes?

Not always clear. A research problem.

An incomplete guideline:

attributes cannot nest (flat structure)

subelements cannot represent references

subelements are more easily displayed (without the use of complex style sheets)

subelements are often easier to read

attributes are more compact

Page 16: CIS 670 Fall 2001 (LN 5)1 XML 4 Introduction to XML –XML basics –DTDs –XML and semistructured data 4 Query languages for XML XML-QL, XQL, XSL 4 XML extensions

CIS 670 Fall 2001 (LN 5) 16

Other XML constructs (1)

XML declaration: version information must be provided:

<?xml version= ‘1.0’?>

comments:

<!-- This is a comment. Processors will ignore me -->

CDATA: escape blocks containing characters that would otherwise be recognized as markup:

<![CDATA[ content]]>

e.g., <![CDATA[ <start> this is not an element </end>]]>

Page 17: CIS 670 Fall 2001 (LN 5)1 XML 4 Introduction to XML –XML basics –DTDs –XML and semistructured data 4 Query languages for XML XML-QL, XQL, XSL 4 XML extensions

CIS 670 Fall 2001 (LN 5) 17

Other XML constructs (2)

PI (Processing Instruction): for applications, not parsers

<?instruction?>

Example: associate a CSS style sheet with XML document

<?xml:stylesheet href=“book.css” type=“text/css”?>

Example: associate an XSL style sheet with XML document

<?xml:stylesheet

href=“http://www.cis.temple.edu/~fan/book.xsl”

type=“text/xsl” ?>

Page 18: CIS 670 Fall 2001 (LN 5)1 XML 4 Introduction to XML –XML basics –DTDs –XML and semistructured data 4 Query languages for XML XML-QL, XQL, XSL 4 XML extensions

CIS 670 Fall 2001 (LN 5) 18

Other XML constructs (3)

Entities: macros (defined in a DTD):

<!ENTITY entity_name “entity_content”>

E.g., in 670.dtd

<!ENTITY XML-flavor “structure”>

<!ENTITY HTML-flavor “display”>

In 670.xml:

XML is about &XML-flavor while HTML about &HTML-flavor

DTD (Document Type Declaration)

<!DOCTYPE 670 PUBLIC

“http://www.cis.temple.edu/670/670.dtd”>

Page 19: CIS 670 Fall 2001 (LN 5)1 XML 4 Introduction to XML –XML basics –DTDs –XML and semistructured data 4 Query languages for XML XML-QL, XQL, XSL 4 XML extensions

CIS 670 Fall 2001 (LN 5) 19

A complete XML document

<?xml version= ‘1.0’?>

<!DOCTYPE book PUBLIC “~fan/book.dtd”>

<?xml:stylesheet href=“book.xsl” type=“text/xsl”?>

<books> <!-- book database --><book id = “B1” >

<title> SGML </title><author> Goldfarb </author><year> 1986 </year>

</book><book id = “B2” >

<title> XML </title></book>

</books>

Page 20: CIS 670 Fall 2001 (LN 5)1 XML 4 Introduction to XML –XML basics –DTDs –XML and semistructured data 4 Query languages for XML XML-QL, XQL, XSL 4 XML extensions

CIS 670 Fall 2001 (LN 5) 20

Well-formed XML documents

a document is well-formed if it satisfies two constraints (when only elements and attributes are considered):

tags have to nest properly

attributes have to be unique

There are also constraints about other constructs (e.g., entities) -- XML specification.

Very weak constraints: it does little more than ensure that XML data will parse into a labeled tree

XML is tree-like: rooted directed tree (graph) with labels on vertices!

Page 21: CIS 670 Fall 2001 (LN 5)1 XML 4 Introduction to XML –XML basics –DTDs –XML and semistructured data 4 Query languages for XML XML-QL, XQL, XSL 4 XML extensions

CIS 670 Fall 2001 (LN 5) 21

We are here

Introduction to XML

XML basics

DTDs XML and semistructured data

Page 22: CIS 670 Fall 2001 (LN 5)1 XML 4 Introduction to XML –XML basics –DTDs –XML and semistructured data 4 Query languages for XML XML-QL, XQL, XSL 4 XML extensions

CIS 670 Fall 2001 (LN 5) 22

Document Type Declarations (DTDs)

A DTD imposes structure on an XML document

The DTD is a syntactic specification (grammar)

There is some relationship between a DTD and a schema, but it is not close -- hence the need for additional “typing” systems (extensions)

DTDs are optional: an XML document may not come along with a DTD

DTDs are somewhat unsatisfactory, and several proposals have been made for better schema formalisms

Page 23: CIS 670 Fall 2001 (LN 5)1 XML 4 Introduction to XML –XML basics –DTDs –XML and semistructured data 4 Query languages for XML XML-QL, XQL, XSL 4 XML extensions

CIS 670 Fall 2001 (LN 5) 23

A DTD

<!DOCTYPE db [

<!ELEMENT db (book*)>

<!ELEMENT book (title, authors*, section*, ref*)>

<!ATTLIST book isbn ID #required>

<!ELEMENT section (text | section)*>

<!ELEMENT ref EMPTY>

<!ATTLIST ref to IDREFS #implied>

<!ELEMENT title #PCDATA>

<!ELEMENT author #PCDATA>

<!ELEMENT text #PCDATA>

]>

Page 24: CIS 670 Fall 2001 (LN 5)1 XML 4 Introduction to XML –XML basics –DTDs –XML and semistructured data 4 Query languages for XML XML-QL, XQL, XSL 4 XML extensions

CIS 670 Fall 2001 (LN 5) 24

Element declarations (1)

for each element type E, a declaration of the form:

<!ELEMENT E P>

where P is a regular expression, i.e.,

P ::= EMPTY | ANY | #PCDATA | E’ |

P1, P2 | P1 | P2 | P? | P+ | P*

– E’: element type

– P1 , P2: concatenation

– P1 | P2: disjunction

– P?: optional

– P+: one or more occurrences

– P*: the Kleene closure

Page 25: CIS 670 Fall 2001 (LN 5)1 XML 4 Introduction to XML –XML basics –DTDs –XML and semistructured data 4 Query languages for XML XML-QL, XQL, XSL 4 XML extensions

CIS 670 Fall 2001 (LN 5) 25

Element declarations (2)

Extended context free grammar: <!ELEMENT E P>

Why is it called extended?

E.g., <!ELEMENT book (title, authors*, section*, ref*)>

single root: <!DOCTYPE db [ … ] >

subelements are ordered.

The following two are different. Why?

<!ELEMENT section (text | section)*>

<!ELEMENT section (text* | section* )>

recursive definition, e.g., section, binary tree:

<!ELEMENT node (leaf | (node, node))

<!ELEMENT leaf (#PCDATA)>

Page 26: CIS 670 Fall 2001 (LN 5)1 XML 4 Introduction to XML –XML basics –DTDs –XML and semistructured data 4 Query languages for XML XML-QL, XQL, XSL 4 XML extensions

CIS 670 Fall 2001 (LN 5) 26

Element declarations (3)

more on recursive DTDs

<!ELEMENT person (name, father, mother)>

<!ELEMENT father (person)>

<!ELEMENT mother (person)>

What is the problem with this? How to fix it?

– Attributes

– optional (e.g., father?, mother?)

more on ordering

How to declare E to be an unordered pair (a, b)?

<!ELEMENT E ((a, b) | (b, a)) >

Page 27: CIS 670 Fall 2001 (LN 5)1 XML 4 Introduction to XML –XML basics –DTDs –XML and semistructured data 4 Query languages for XML XML-QL, XQL, XSL 4 XML extensions

CIS 670 Fall 2001 (LN 5) 27

Element declarations (4)

EMPTY element:

<!ELEMENT ref EMPTY>

<!ATTLIST ref to IDREFS #implied>

observe it has attributes

ANY: may contain any content -- discouraged

<!ELEMENT generic ANY>

mixed content

<!ELEMENT section (#PCDATA | section)*>

Page 28: CIS 670 Fall 2001 (LN 5)1 XML 4 Introduction to XML –XML basics –DTDs –XML and semistructured data 4 Query languages for XML XML-QL, XQL, XSL 4 XML extensions

CIS 670 Fall 2001 (LN 5) 28

Element declarations (5)

global definition:

<!ELEMENT person (name, ssn)>

<!ELEMENT course (name, credit, instructor)>

The type associated with an element is unique -- only one declaration for name is allowed.

To avoid name clashes, one may use two distinct tags: e.g., personname, coursename.

namespace: define two namespaces

<MYNAMESPACE xmlns:person=“~fan/person.dtd”

xmlns:course=“~fan/course.dtd”>

<person:name> … <course:name> …

</MYNAMESPACE>

Page 29: CIS 670 Fall 2001 (LN 5)1 XML 4 Introduction to XML –XML basics –DTDs –XML and semistructured data 4 Query languages for XML XML-QL, XQL, XSL 4 XML extensions

CIS 670 Fall 2001 (LN 5) 29

Attribute declarations (1)

General syntax:

<!ATTLIST element_name

attribute-name attribute-type default-declaration>

example:

<!ATTLIST book

isbn ID #required>

<!ATTLIST ref

to IDREFS #implied>

Note: it is OK for several element types to define an attribute of the same name, e.g.,

<!ATTLIST person name ID #required>

<!ATTLIST pet name ID #required>

Page 30: CIS 670 Fall 2001 (LN 5)1 XML 4 Introduction to XML –XML basics –DTDs –XML and semistructured data 4 Query languages for XML XML-QL, XQL, XSL 4 XML extensions

CIS 670 Fall 2001 (LN 5) 30

Attribute declarations (2)

<!ATTLIST element_name

attribute-name attribute-type default-declaration>

attribute types: 10

– CDATA

– ID, IDREF, IDREFS

– ENTITY, ENTITIES

– NMTOKEN, NMTOKENS

– enumerated, notation

default declarations: 4

– #required, #implied

– “default value”, #fixed “default value”

Page 31: CIS 670 Fall 2001 (LN 5)1 XML 4 Introduction to XML –XML basics –DTDs –XML and semistructured data 4 Query languages for XML XML-QL, XQL, XSL 4 XML extensions

CIS 670 Fall 2001 (LN 5) 31

Specifying ID and IDREF attributes

<!ATTLIST person

id ID #required

father IDREF #implied

mother IDREF #implied

children IDREFS #implied>

e.g.,

<person id=“898” father=“332” mother=“336”

children=“982 984 986”>

….

</person>

Page 32: CIS 670 Fall 2001 (LN 5)1 XML 4 Introduction to XML –XML basics –DTDs –XML and semistructured data 4 Query languages for XML XML-QL, XQL, XSL 4 XML extensions

CIS 670 Fall 2001 (LN 5) 32

XML reference mechanism

ID attribute: unique within the entire document.

– An element can have at most one ID attribute.

– No default (fixed default) value is allowed.

• #required: a value must be provided

• #implied: a value is optional

IDREF attribute: its value must be some other element’s ID value in the document.

IDREFS attribute: its value is a set, each element of the set is the ID value of some other element in the document.

<person id=“898” father=“332” mother=“336”

children=“982 984 986”>

Page 33: CIS 670 Fall 2001 (LN 5)1 XML 4 Introduction to XML –XML basics –DTDs –XML and semistructured data 4 Query languages for XML XML-QL, XQL, XSL 4 XML extensions

CIS 670 Fall 2001 (LN 5) 33

ID vs. object identifiers in OODBs

ID is unique within the whole document, like an oid

ID is not system-generated and can be changed - different from oid and somehow like keys in relational DBs

IDREF (IDREFS) are untyped -- a big problem: you point to something, but you don’t know what it is!

<student id=“01” taking=“cis670”/>

<student id=“02” taking=“cis670 01”/>

<course id=“cis670”/>

No inverse constraints, i.e., child is inverse of parent.

This makes it difficult to translate object-oriented databases into an XML encoding.

Page 34: CIS 670 Fall 2001 (LN 5)1 XML 4 Introduction to XML –XML basics –DTDs –XML and semistructured data 4 Query languages for XML XML-QL, XQL, XSL 4 XML extensions

CIS 670 Fall 2001 (LN 5) 34

ID/IDREF vs. key/foreign key in RDBs

keys are unique within the same relation, while IDs are unique within the whole database

a relation may have several different keys, while an element can have at most one ID

keys can be multi-valued, while IDs must be single-valued

enroll (sid: string, cid: string, grade:string)

foreign keys are typed, while IDREF (IDREFS) is not

This makes it difficult to translate RDBs into XML.

Why is it a problem? To exchange/integrate/transform data, we need to translate legacy data into an XML encoding while preserving the original semantics of the data.

Page 35: CIS 670 Fall 2001 (LN 5)1 XML 4 Introduction to XML –XML basics –DTDs –XML and semistructured data 4 Query languages for XML XML-QL, XQL, XSL 4 XML extensions

CIS 670 Fall 2001 (LN 5) 35

CDATA attributes

CDATA: string

<!ATTLIST snore

volume CDATA #implied>

e.g., <snore volume=“loud” />

default value: used when no value is given

<!ATTLIST snore

volume CDATA “normal”>

e.g., <snore volume=“loud” />

<snore/>

fixed default value: fixed and may not be changed

<!ATTLIST snore

volume CDATA #fixed “normal”>

e.g., <snore/>

Page 36: CIS 670 Fall 2001 (LN 5)1 XML 4 Introduction to XML –XML basics –DTDs –XML and semistructured data 4 Query languages for XML XML-QL, XQL, XSL 4 XML extensions

CIS 670 Fall 2001 (LN 5) 36

Enumerated types

We specify a range and don’t want its volume out of range.

<!ATTLIST snore

volume (silent | quite | normal | loud | loudest)

“normal”>

Page 37: CIS 670 Fall 2001 (LN 5)1 XML 4 Introduction to XML –XML basics –DTDs –XML and semistructured data 4 Query languages for XML XML-QL, XQL, XSL 4 XML extensions

CIS 670 Fall 2001 (LN 5) 37

NOTATIONS

Notations allow documents to identify the types of content they will contain:

<!NOTATION notation-id SYSTEM “notation-type”>

e.g., <!NOTATION gif SYSTEM “image/gif”>

<!NOTATION jpg SYSTEM “image/jpg”>

In attributes:

<!ATTLIST picture

source CDATA #required

type NOTATION (gif | jpg) #required>

Page 38: CIS 670 Fall 2001 (LN 5)1 XML 4 Introduction to XML –XML basics –DTDs –XML and semistructured data 4 Query languages for XML XML-QL, XQL, XSL 4 XML extensions

CIS 670 Fall 2001 (LN 5) 38

NMTOKEN, NMTOKENS

NMTOKEN: name token, restricted form of strings that has the same production rule as element names. No additional constraints: it does not have to match another attribute or declaration.

<!ATTLIST picture

width NMTOKEN #required>

NMTOKENS: a list of name tokens.

Page 39: CIS 670 Fall 2001 (LN 5)1 XML 4 Introduction to XML –XML basics –DTDs –XML and semistructured data 4 Query languages for XML XML-QL, XQL, XSL 4 XML extensions

CIS 670 Fall 2001 (LN 5) 39

ENTITY, ENTITIES

ENTITY (attribute): must match the name of an entity

ENTITIES: multiple ENTITY values, each must be the name of an entity.

Parametric entities: their use is limited to the DTD:

syntax: <!ENTITY %entity-name “entity-content”>

use: %entity-name;

e.g., <!ENTITY %basics “RDBs | OODBs | WebDBs”>

<!ATTLIST DB-user

skills %basics>

<!ATTLIST DBA

skills %basics

others CDATA #required>

Research: sub-typing and inheritance by means of entities?

Page 40: CIS 670 Fall 2001 (LN 5)1 XML 4 Introduction to XML –XML basics –DTDs –XML and semistructured data 4 Query languages for XML XML-QL, XQL, XSL 4 XML extensions

CIS 670 Fall 2001 (LN 5) 40

Valid XML documents

A valid XML document must have a DTD.

The document is well-formed

It conforms to the DTD:

– elements observe the grammar (nested only in the way described by the DTD)

– elements have only the attributes specified by the DTD

– ID/IDREF attributes satisfy their constraints:

• ID must be distinct

• IDREF/IDREFS values must be existing ID values

Page 41: CIS 670 Fall 2001 (LN 5)1 XML 4 Introduction to XML –XML basics –DTDs –XML and semistructured data 4 Query languages for XML XML-QL, XQL, XSL 4 XML extensions

CIS 670 Fall 2001 (LN 5) 41

We are here

Introduction to XML

– XML basics

– DTDs

– XML and semistructured data

Page 42: CIS 670 Fall 2001 (LN 5)1 XML 4 Introduction to XML –XML basics –DTDs –XML and semistructured data 4 Query languages for XML XML-QL, XQL, XSL 4 XML extensions

CIS 670 Fall 2001 (LN 5) 42

Graph representation

XML data can be presented as a rooted node-labeled directed tree, while semistructured data is a rooted edge-labeled directed graph.

If references are not considered (without DTD), XML document is usually modeled as a tree.

References (IDREF/IDREFS) can be viewed as edges and thus lead to graphs (cyclic structure) -- XML-QL model

When IDREF/IDREFS attributes are simply treated as text data, XML data is modeled as a tree - XSL and XQL model.

This model does not take validation into account.

Research: ordered graph/tree?

Page 43: CIS 670 Fall 2001 (LN 5)1 XML 4 Introduction to XML –XML basics –DTDs –XML and semistructured data 4 Query languages for XML XML-QL, XQL, XSL 4 XML extensions

CIS 670 Fall 2001 (LN 5) 43

Node-labeled vs. edge-labeled (1)

Consider an ssd expression {a: {b: “string”}, a: {c: “string”}}

Edge-labeled graph:

We may encode it in XML either as (with some DTD):

<a> <b id=“&12”> string </b> </a>

<a c=“&12” />

or as <a b =“&12” />

<a> <c id=“&12”> string </c> </a>

a a

cb“string”

Page 44: CIS 670 Fall 2001 (LN 5)1 XML 4 Introduction to XML –XML basics –DTDs –XML and semistructured data 4 Query languages for XML XML-QL, XQL, XSL 4 XML extensions

CIS 670 Fall 2001 (LN 5) 44

Node-labeled vs. edge-labeled (2)

Node-labeled graph:

<a> <b id=“&12”> string </b> </a>

<a c=“&12” />

<a b =“&12” />

<a> <c id=“&12”> string </c> </a>

a a

c

b

“string”

a a

cb

“string”

Page 45: CIS 670 Fall 2001 (LN 5)1 XML 4 Introduction to XML –XML basics –DTDs –XML and semistructured data 4 Query languages for XML XML-QL, XQL, XSL 4 XML extensions

CIS 670 Fall 2001 (LN 5) 45

XML vs. semistructured data

similarities:

– both described best by graphical representation

– both are schema-less, self-describing

differences:

– XML is ordered, semistructured data is not.

– XML data is modeled as a node labeled graph (tree), semistructured data as an edge labeled graph.

– XML can mix text and elements; when translating it to a semistructured data expression, we have to add some surrounding tag for the PCDATA.

– XML has other stuffs (PIs, comments).

– XML may have (optional) DTD.

Page 46: CIS 670 Fall 2001 (LN 5)1 XML 4 Introduction to XML –XML basics –DTDs –XML and semistructured data 4 Query languages for XML XML-QL, XQL, XSL 4 XML extensions

CIS 670 Fall 2001 (LN 5) 46

Query languages

for semistructured data:

– Lorel

– UnQL

for XML

– XML-QL

– XQL

– XSLT (transformation language)

XML data can be treated and queried as semistructured data (e.g., Lorel).

Page 47: CIS 670 Fall 2001 (LN 5)1 XML 4 Introduction to XML –XML basics –DTDs –XML and semistructured data 4 Query languages for XML XML-QL, XQL, XSL 4 XML extensions

CIS 670 Fall 2001 (LN 5) 47

DTDs vs. schemas (types)

By database (or programming language) standard XML DTDs are rather weak specifications.

– Only one base type -- PCDATA.

– No useful “abstractions”, e.g., sets.

– IDREFs are not typed or scoped -- you point to something, but you don’t know what!

– No constraints (e.g., inverse) or methods.

– No sub-typing or inheritance.

XML extensions to overcome the limitations.

– Type systems: XML-Data, XML-Schema, SOX, DCD

– metadata: RDF

– constraints

Page 48: CIS 670 Fall 2001 (LN 5)1 XML 4 Introduction to XML –XML basics –DTDs –XML and semistructured data 4 Query languages for XML XML-QL, XQL, XSL 4 XML extensions

CIS 670 Fall 2001 (LN 5) 48

Summary

XML is a new data exchange format. Its main virtues include widespread acceptance.

DTDs provide some useful syntactic constraints on documents. As schemas they are weak.

Research topics include:

– How will we query XML data?

– How will we navigate the Web (XLink, XPointer)?

– How will we extend XML DTDs to capture the semantics of the data?

– How will we map between XML and other representations, esp. structured databases?

– How will we compress/store XML data?