65
2005 http://www.cs.huji.ac.il/ ~dbi 1 XML eXtensible Markup Language

2005 dbi 1 XML eXtensible Markup Language

  • View
    215

  • Download
    1

Embed Size (px)

Citation preview

2005 http://www.cs.huji.ac.il/~dbi 1

XML

eXtensible Markup Language

2005 http://www.cs.huji.ac.il/~dbi 2

Introduction and Motivation

2005 http://www.cs.huji.ac.il/~dbi 3

XML vs. HTML

• HTML is a HyperText Markup language– Designed for a specific application,

namely, presenting and linking hypertext documents

• XML describes structure and content (“semantics”)– The presentation is defined separately

from the structure and the content

2005 http://www.cs.huji.ac.il/~dbi 4

An Address Book asan XML document

<addresses><person>

<name> Donald Duck</name><tel> 04-828-1345 </tel><email> [email protected] </email>

</person><person>

<name> Miki Mouse</name><tel> 03-426-1142 </tel><email>[email protected]</email>

</person></addresses>

2005 http://www.cs.huji.ac.il/~dbi 5

Main Features of XML

• No fixed set of tags– New tags can be added for new

applications• An agreed upon set of tags can be

used in many applications– Namespaces facilitate uniform and

coherent descriptions of data• For example, a namespace for address

books determines whether to use <tel> or <phone>

2005 http://www.cs.huji.ac.il/~dbi 6

Main Features of XML (cont’d)

• XML has the concept of a schema– DTD and the more expressive XML

Schema• XML is a data model

– Similar to the semistructured data model

• XML supports internationalization (Unicode) and platform independence (an XML file is just a character file)

2005 http://www.cs.huji.ac.il/~dbi 7

XML is Self-Describing Data

• Traditionally, a data file is just a bit stream• Only a program that reads or writes this file

has the details about– How to break the bit stream into records– How to break each record into fields– The type of each data field

• Over the years, companies retained valuable data (e.g., on magnetic tapes), but lost the programs that have the above information– As a result, the data was practically lost

• It cannot happen with XML data

2005 http://www.cs.huji.ac.il/~dbi 8

XML is the Standard forData Exchange

• Web services (e.g., ecommerce) require exchanging data between various applications that run on different platforms

• XML (augmented with namespaces) is the preferred syntax for data exchange on the Web

2005 http://www.cs.huji.ac.il/~dbi 9

XML is not Alone• XML Schemas strengthen the data-modeling

capabilities of XML (in comparison to XML with only DTDs)

• XPath is a language for accessing parts of XML documents

• XLink and XPointer support cross-references• XSLT is a language for transforming XML

documents into other XML documents (including XHTML, for displaying XML files)– Limited styling of XML can be done with CSS

alone

• XQuery is a lanaguage for querying XML documents

2005 http://www.cs.huji.ac.il/~dbi 10

The Two Facets of XML

• Some XML files are just text documents with tags that denote their structure and include some metadata (e.g., an attribute that gives the name of the person who did the proofreading)– See an example on the next slide– XML is a subset of SGML (Standard

Generalized Markup Language)

• Other XML documents are similar to database files (e.g., an address book)

2005 http://www.cs.huji.ac.il/~dbi 11

XML can Describethe Structure of a Document

<paper><title> Complexity of Computations </title><author>

<name> M. O. Rabin</name><institute> Hebrew University </

institute></author><abstract> … </abstract><section> … </section><section> … </section><references> … </ references >

</paper>

2005 http://www.cs.huji.ac.il/~dbi 12

XML Syntax

W3Schools Resources on XML Syntax

2005 http://www.cs.huji.ac.il/~dbi 13

The Structure of XML• XML consists of tags and text• Tags come in pairs <date> ... </date>• They must be properly nested

– good <date> ... <day> ... </day> ... </date>

– bad <date> ... <day> ... </date>... </day>

(You can’t do <i> ... <b> ... </i> ...</b> in HTML)

2005 http://www.cs.huji.ac.il/~dbi 14

A Useful AbbreviationAbbreviating elements with empty contents:• <br/> for <br></br>• <hr width=“10”/> for <hr width=“10”></hr>For example:

<family> <person id = “lisa”>

<name> Lisa Simpson </name> <mother idref = “marge”/>

<father idref = “homer”/></person>...

</family>

Note that a tag may have a set of attributes, each consisting of a name and a value

2005 http://www.cs.huji.ac.il/~dbi 15

XML TextXML has only one “basic” type – text

It is bounded by tags, e.g., <title> The Big Sleep </title> <year> 1935 </ year> – 1935 is still

text

• XML text is called PCDATA – (for parsed character data)

• It uses a 16-bit encoding, e.g., \&\#x0152 for the Hebrew letter Mem

2005 http://www.cs.huji.ac.il/~dbi 16

XML Structure

• Nesting tags can be used to express various structures, e.g., a tuple (record):

<person><name> Lisa Simpson</name><tel> 02-828-1234 </tel><tel> 054-470-777 </tel><email> [email protected] </email>

</person>

2005 http://www.cs.huji.ac.il/~dbi 17

XML Structure (cont’d)

• We can represent a list by using the same tag repeatedly:

<addresses><person> … </person><person> … </person><person> … </person><person> … </person>…

</addresses>

2005 http://www.cs.huji.ac.il/~dbi 18

XML Structure (cont’d)<addresses>

<person><name> Donald Duck</name><tel> 04-828-1345 </tel><email> [email protected] </email>

</person><person>

<name> Miki Mouse</name><tel> 03-426-1142 </tel><email>[email protected]</email>

</person></addresses>

2005 http://www.cs.huji.ac.il/~dbi 19

TerminologyThe segment of an XML document between an opening and a corresponding closing tag is called an element

<person> <name> Bart Simpson </name>

<tel> 02 – 444 7777 </tel> <tel> 051 – 011 022 </tel>

<email> [email protected] </email> </person>

element

element, a sub-element of

not an element

2005 http://www.cs.huji.ac.il/~dbi 20

An XML Document is a Treeperson

name emailtel tel

Bart Simpson

02 – 444 7777

051 – 011 022

[email protected]

Note that semistructured data models typically put the labels on the edges, and are arbitrary graphs and not just trees

Leaves are either empty or contain PCDATA

2005 http://www.cs.huji.ac.il/~dbi 21

Mixed ContentAn element may contain a mixture of sub-elements and PCDATA

<airline> <name> British Airways </name> <motto> World’s <dubious> favorite</dubious>

airline </motto></airline>

• How many leaves are there in the corresponding tree?

• How many leaves are empty?

2005 http://www.cs.huji.ac.il/~dbi 22

The Header Tag

• <?xml version="1.0" standalone="yes/no" encoding="UTF-8"?>– Standalone=“no” means that there is an

external DTD

– You can leave out the encoding attribute and the processor will use the UTF-8 default

2005 http://www.cs.huji.ac.il/~dbi 23

Processing Instructions<?xml version="1.0"?><?xml-stylesheet  href="doc.xsl"

type="text/xsl"?>

<!DOCTYPE doc SYSTEM "doc.dtd">

<doc>Hello, world!<!-- Comment 1 --></doc>

<?pi-without-data?><!-- Comment 2 --><!-- Comment 3 -->

2005 http://www.cs.huji.ac.il/~dbi 24

Using CDATA<HEAD1>

Entering a Kennel Club Member

</HEAD1>

<DESCRIPTION>Enter the member by the name on his or her papers. Use the NAME tag. The NAME tag has two attributes. Common (all in lowercase, please!) is the dog's call name. Breed (also in all lowercase) is the dog's breed. Please see the breed reference guide for acceptable breeds. Your entry should look something like this:

</DESCRIPTION>

<EXAMPLE><![CDATA[<NAME common="freddy" breed"=springer-spaniel">Sir Fredrick of Ledyard's End</NAME>]]>

</EXAMPLE>

We want to seethe text as is,even though

it includes tags

2005 http://www.cs.huji.ac.il/~dbi 25

A Complete XML Document

<?XML version ="1.0" encoding="UTF-8" standalone="no"?><!DOCTYPE addresses SYSTEM "http://www.cs.huji.ac.il/~dbi/dbi-addresses.dtd"><addresses>

<person><name>Lisa Simpson</name><tel> 02-828-1234 </tel><tel> 054-470-777 </tel><email> [email protected] </email>

</person></addresses>

2005 http://www.cs.huji.ac.il/~dbi 26

Well-Formed XML Documents

• An XML document (with or without a DTD) is well-formed if– Tags are syntactically correct

– Every tag has an end tag

– Tags are properly nested

– There is a root tag

– A start tag does not have two occurrences of the same attribute

An XML document must be well formed

2005 http://www.cs.huji.ac.il/~dbi 27

DTD(Document Type

Definition)Imposing Structure on

XML Documents(W3Schools on DTDs)

2005 http://www.cs.huji.ac.il/~dbi 28

Motivation

• A DTD adds syntactical requirements in addition to the well-formed requirement

• It helps in eliminating errors when creating or editing XML documents

• It clarifies the intended semantics• It simplifies the processing of XML

documents

2005 http://www.cs.huji.ac.il/~dbi 29

An Example• In an address book, where can a phone

number appear?– Under <person>, under <name> or under both?

• If we have to check for all possibilities, processing takes longer and it may not be clear to whom a phone belongs– We would like to know that a phone number is

allowed to appear under both a department and the manager of that department

– If we don’t know that and there is only one phone number, we may not know whether it serves both the department and its manager or just one of them

2005 http://www.cs.huji.ac.il/~dbi 30

Document Type Definitions

• Document Type Definitions (DTDs) impose structure on XML documents

• There is some relationship between a DTD and a schema, but it is not close – hence the need for additional “typing” systems (XML schemas)

• The DTD is a syntactic specification

2005 http://www.cs.huji.ac.il/~dbi 31

Example: An Address Book<person>

<name> Homer Simpson </name>

<greet> Dr. H. Simpson </greet>

<addr>1234 Springwater Road </addr>

<addr> Springfield USA, 98765 </addr>

<tel> (321) 786 2543 </tel>

<fax> (321) 786 2544 </fax>

<tel> (321) 786 2544 </tel>

<email> [email protected] </email>

</person>

Mixed telephones and faxes

As manyas needed

As many address lines as needed (in order)

At most one greeting

Exactly one name

2005 http://www.cs.huji.ac.il/~dbi 32

Specifying the Structure

• name to specify a name element

• greet? to specify an optional (0 or 1) greet

elements

• name, greet? to specify a name followed by an optional greet

2005 http://www.cs.huji.ac.il/~dbi 33

Specifying the Structure (cont’d)

• addr* to specify 0 or more address lines

• tel | fax a tel or a fax element

• (tel | fax)* 0 or more repeats of tel or fax

• email* 0 or more email elements

2005 http://www.cs.huji.ac.il/~dbi 34

Specifying the Structure (cont’d)

• So the whole structure of a person entry is specified by

name, greet?, addr*, (tel | fax)*, email*

• This is known as a regular expression

• Why is it important?

2005 http://www.cs.huji.ac.il/~dbi 35

Summary of Regular Expressions

• A The tag (i.e., element) A occurs• e1,e2 The expression e1 followed by

e2• e* 0 or more occurrences of e• e? Optional: 0 or 1 occurrences• e+ 1 or more occurrences• e1 | e2 either e1 or e2• (e) grouping

2005 http://www.cs.huji.ac.il/~dbi 36

The Definition of an Element Consists of Exactly One of the

Following• A regular expression (as defined

earlier)• EMPTY means that the element

has not content• ANY means that content can be

any mixture of PCDATA and elements defined in the DTD

• Mixed content which is defined as described on the next slide

• (#PCDATA)

2005 http://www.cs.huji.ac.il/~dbi 37

The Definition of Mixed Content

• Mixed content is described by a repeatable OR group (#PCDATA | element-name | …)*– Inside the group, no regular

expressions – just element names– #PCDATA must be first followed by 0

or more element names, separated by |

– The group can be repeated 0 or more times

2005 http://www.cs.huji.ac.il/~dbi 38

An Address-Book XML Document with an Internal DTD

<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE addressbook [ <!ELEMENT addressbook (person*)> <!ELEMENT person (name, greet?, address*, (fax | tel)*, email*)> <!ELEMENT name (#PCDATA)> <!ELEMENT greet (#PCDATA)> <!ELEMENT address(#PCDATA)> <!ELEMENT tel (#PCDATA)> <!ELEMENT fax (#PCDATA)> <!ELEMENT email (#PCDATA)>]>

The name ofthe DTD is

addressbook

“Internal” means that the DTD and theXML Document are in the same file

The syntax of a DTD is not XML syntax

2005 http://www.cs.huji.ac.il/~dbi 39

The Rest of theAddress-Book XML Document

<addressbook> <person> <name> Jeff Cohen </name> <greet> Dr. Cohen </greet>

<email> [email protected] </email> </person></addressbook>

2005 http://www.cs.huji.ac.il/~dbi 40

Regular Expressions

• Each regular expression determines a corresponding finite-state automaton• Let’s start with a simpler example:

name, addr*, email

name

addr

email

This suggests a simple parsing program

A double circle denotes an accepting state

2005 http://www.cs.huji.ac.il/~dbi 41

Another Examplename,address*,(tel | fax)*,email*

name

address

tel

tel

fax

fax

email

email

Adding in the optional greet furthercomplicates things

email

2005 http://www.cs.huji.ac.il/~dbi 42

Deterministic Requirement

• If element-type declarations are deterministic, it is easier

• Formally, the Glushkov automaton is deterministic

• The states of this automaton are the positions of the regular expression (semantic actions)

• The transitions are based on the “follows set”

2005 http://www.cs.huji.ac.il/~dbi 43

Deterministic Requirement (cont’d)

• The associated automata are succinct

• A regular language may not have an associated deterministic grammar, e.g.,

<!ELEMENT ndeter

((movie|director)*,movie,(movie|director))>

2005 http://www.cs.huji.ac.il/~dbi 44

Some Things are Hard to Specify

Each employee element should contain name, age and ssn elements in some order

<!ELEMENT employee ( (name, age, ssn) | (age, ssn, name) |

(ssn, name, age) | ... )>

Suppose that there were many more fields!

2005 http://www.cs.huji.ac.il/~dbi 45

Some Things are Hard to Specify (cont’d)

<!ELEMENT employee ( (name, age, ssn) | (age, ssn, name) |

(ssn, name, age) | ... )>

Suppose there were many more fields!There are n! differentorders of n elements

It is not even polynomial

2005 http://www.cs.huji.ac.il/~dbi 46

Specifying Attributes in the DTD

<!ELEMENT height (#PCDATA)><!ATTLIST height dimension CDATA #REQUIRED accuracy CDATA #IMPLIED >

The dimension attribute is required The accuracy attribute is optional

CDATA is the “type” of the attribute – it means “character data,” and may take any literal string as a value

2005 http://www.cs.huji.ac.il/~dbi 47

The Format of an Attribute Definition

• <!ATTLIST element-name attr-name attr-type default-value>

• The default value is given inside quotes

2005 http://www.cs.huji.ac.il/~dbi 48

Summary of Attribute Types

• CDATA• (value | … | … ) is an

enumeration of allowed values• ID, IDREF, IDRERS

– to be explained later• ENTITY, ENTITIES

– to be explained later• NMTOKEN, NMTOKENS, NOTATION

2005 http://www.cs.huji.ac.il/~dbi 49

Summary of AttributeDefault Values

• #REQUIRED means that the attribute must by included in the element

• #IMPLIED• #FIXED “value”

– The given value (inside quotes) is the only possible one

• “value”– The default value of the attribute if none is

given

2005 http://www.cs.huji.ac.il/~dbi 50

Recursive DTDs<DOCTYPE genealogy [

<!ELEMENT genealogy (person*)><!ELEMENT person (

name,dateOfBirth,person, -- motherperson )> -- father

... ]>

What is the problem with this?A parser does not notice it!

Each person should have a father and amother. Thisleads to eitherinfinite data ora person thatis a descendentof herself.

2005 http://www.cs.huji.ac.il/~dbi 51

Recursive DTDs (cont’d)<DOCTYPE genealogy [

<!ELEMENT genealogy (person*)><!ELEMENT person (

name,dateOfBirth,person?, -- motherperson? )> -- father

... ]>

What is now the problem with this?

If a person only has a father, how can you tell that he has a father anddoes not havea mother?

2005 http://www.cs.huji.ac.il/~dbi 52

Using ID and IDREF Attributes

<!DOCTYPE family [ <!ELEMENT family (person)*> <!ELEMENT person (name)> <!ELEMENT name (#PCDATA)> <!ATTLIST person

id ID #REQUIRED mother IDREF #IMPLIED father IDREF #IMPLIED children IDREFS #IMPLIED>]>

2005 http://www.cs.huji.ac.il/~dbi 53

IDs and IDREFs

• ID stands for identifier

– No two ID attributes may have the same value (of type CDATA)

• IDREF stands for identifier reference

– Every value associated with an IDREF attribute must exist as an ID attribute value

• IDREFS specifies several (0 or more) identifier references

2005 http://www.cs.huji.ac.il/~dbi 54

Some Conforming Data<family> <person id=“lisa” mother=“marge” father=“homer”> <name> Lisa Simpson </name> </person>

<person id=“bart” mother=“marge” father=“homer”> <name> Bart Simpson </name> </person> <person id=“marge” children=“bart lisa”> <name> Marge Simpson </name> </person> <person id=“homer” children=“bart lisa”> <name> Homer Simpson </name> </person></family>

2005 http://www.cs.huji.ac.il/~dbi 55

ID References do not Have Types

• The attributes mother and father are references to IDs of other elements

• However, those are not necessarily person elements!

• The mother attribute is not necessarily a reference to a female person

2005 http://www.cs.huji.ac.il/~dbi 56

An Alternative Specification

<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE family [

<!ELEMENT family (person)*><!ELEMENT person (name, mother?, father?, children?)><!ATTLIST person id ID #REQUIRED><!ELEMENT name (#PCDATA)><!ELEMENT mother EMPTY><!ATTLIST mother idref IDREF #REQUIRED><!ELEMENT father EMPTY><!ATTLIST father idref IDREF #REQUIRED><!ELEMENT children EMPTY><!ATTLIST children idrefs IDREFS #REQUIRED>

]>

2005 http://www.cs.huji.ac.il/~dbi 57

The Revised Data<family>

<person id="marge"> <name> Marge Simpson </name> <children idrefs="bart lisa"/>

</person><person id="homer"> <name> Homer Simpson </name> <children idrefs="bart lisa"/></person>

<person id="bart"> <name> Bart Simpson </name>

<mother idref="marge"/> <father idref="homer"/>

</person><person id="lisa"> <name> Lisa Simpson </name> <mother idref="marge"/>

<father idref="homer"/></person>

</family>

2005 http://www.cs.huji.ac.il/~dbi 58

Consistency of ID and IDREF Attribute Values

•If an attribute is declared as ID– The associated value must be distinct, i.e.,

different elements (in the given document) must have different values for the ID attribute (no confusion)

• Even if the two elements have different element names

•If an attribute is declared as IDREF– The associated value must exist as the value of

some ID attribute (no dangling “pointers”)

•Similarly for all the values of an IDREFS attribute

•ID, IDREF and IDREFS attributes are not typed

2005 http://www.cs.huji.ac.il/~dbi 59

Adding a DTD to the Document

• A DTD can be internal– The DTD is part of the document file

• or external– The DTD and the document are on

separate files– An external DTD may reside

•In the local file system (where the document is)

•In a remote file system

2005 http://www.cs.huji.ac.il/~dbi 60

Connecting a Document with its DTD

• An internal DTD:<?xml version="1.0"?>

<!DOCTYPE db [<!ELEMENT ...> … ]><db> ... </db>

• A DTD from the local file system: <!DOCTYPE db SYSTEM "schema.dtd">

• A DTD from a remote file system: <!DOCTYPE db SYSTEM "http://www.schemaauthority.com/schema.dtd">

2005 http://www.cs.huji.ac.il/~dbi 61

Well-Formed XML Documents

• An XML document (with or without a DTD) is well-formed if– Tags are syntactically correct

– Every tag has an end tag

– Tags are properly nested

– There is a root tag

– A start tag does not have two occurrences of the same attribute

An XML document must be well formed

2005 http://www.cs.huji.ac.il/~dbi 62

Valid Documents

• A well-formed XML document isvalid if it conforms to its DTD, that is,– The document conforms to the regular-

expression grammar,

– The types of attributes are correct, and

– The constraints on references are satisfied

2005 http://www.cs.huji.ac.il/~dbi 63

DTDs are CFGs(Context-Free Grammars)

• Checking validity and parsing a document according to a DTD is in polynomial time, using a dynamic-programming algorithm– A <lecturer> element has the same rules

regardless of whether it is under a <course> element or a <seminar> element

• Note that XML Schemas are capable of describing context-sensitive structures– The complexity is higher

2005 http://www.cs.huji.ac.il/~dbi 64

XML Schemas

W3Schools on XML Schemas

2005 http://www.cs.huji.ac.il/~dbi 65

DTDs vs. Schemas (or Types)

• DTDs are rather weak specifications by DB & programming-language standards– Only one base type – PCDATA– No useful “abstractions”, e.g., sets– IDREFs are untyped – the type of the object

being referenced is not known– No constraints, e.g., child is inverse of parent– No methods– Tag definitions are global

• Some extensions of XML impose a schema or types on an XML document