195
1 e-Science e-Business e-Government and their Technologies Core XML Bryan Carpenter, Geoffrey Fox, Marlon Pierce Pervasive Technology Laboratories Indiana University Bloomington IN 47404 January 12 2004 [email protected] [email protected] [email protected] http://www.grid2004.org/spring2004

1 e-Science e-Business e-Government and their Technologies Core XML Bryan Carpenter, Geoffrey Fox, Marlon Pierce Pervasive Technology Laboratories Indiana

Embed Size (px)

Citation preview

11

e-Science e-Business e-Government and their

TechnologiesCore XML

Bryan Carpenter, Geoffrey Fox, Marlon PiercePervasive Technology Laboratories

Indiana University Bloomington IN 47404January 12 2004

[email protected]@indiana.edu

[email protected] http://www.grid2004.org/spring2004

22

What are we doing This is a semester-long course on Grids (viewed as technologies

and infrastructure) and the application – mainly to science but also to business and government

We will assume a basic knowledge of the Java language and then interweave 6 topic areas – first four cover technologies that will be used by students

1) Advanced Java: including networking, Java Server Pages and perhaps servlets

2) XML: Specification, Tools, Linkage to Java 3) Web Services: Basic Ideas, WSDL, Axis and Tomcat 4) Grid Systems: GT3/Cogkit, Gateway, XSOAP, Portlet 5) Advanced Technology Discussions: CORBA as history,

OGSA-DAI, security, Semantic Grid, Workflow 6) Applications: Bioinformatics, Particle Physics, Engineering,

Crises, Computing-on-demand Grid, Earth Science

33

Contents of this Lecture Set Intro: HTML and XML and Unicode Core XML:

• XML syntax and well-formedness, DTDs and validity.• XML namespaces.• The XML DOM with linkage to Java.• XPath basics.

XML Schema.• Validation for data-centric applications.

Later lectures may include additional information on:• XHTML, SVG, RDF.• XML style languages: XSLT and CSS.• XML Databases (Xindice, Sleepycat).• Search: advanced XPath, XQuery.

44

Motivations for XML: a Better HTML?

Limitations of HTML:• Extensibility: HTML does not allow users to specify

their own tags or attributes in order to parameterize or otherwise semantically qualify their data.

• Structure: HTML does not support the specification of deep structures needed to represent database schema or object-oriented hierarchies.

• Validation: HTML does not support the kind of language specification that allows applications to check data for structural validity when it is imported.

55

XML in the HTML world XML = eXtensible Markup Language.

• XML is a subset of SGML—Standard Generalized Markup Language, but XML is specifically designed for the web.

Specification by W3C: http://www.w3.org/XML and lots of links like http://www.xml.org

XML 1.0 in February 98.• XML 1.1 became a W3C recommendation 4 Feb, 2004!

How XML fits into the new HTML world:• XML describes the logical structure of the document.

• CSS (Cascading Style Sheets) and/or XSL describes the visual presentation of the document.

• DOM (Document Object Model) allows scripting languages like JavaScript to access and dynamically change document objects.

66

Logical vs. Visual Design Logical design of a document (content) should be

separate from its visual design (presentation).• Promotes sound typography.• Encourages better writing.• Is more flexible.• Allows the same “knowledge/information” (defined

in XML) to be displayed on PC’s, PDA’s, Braille devices etc.

XML used to define the logical design, with XSL (Extensible Style Language) or other mechanism used to define the visual layout (e.g. by mapping XML into HTML).

77

XML Design Goals1. XML shall be usable over the Internet.2. XML shall support a variety of applications.3. XML shall be compatible with SGML.4. It shall be easy to write programs that process XML

documents.5. Optional features in XML shall be kept to the absolute

minimum, ideally zero.6. XML documents should be human-legible and

reasonably clear.7. Design of XML should be prepared quickly.8. Design of XML shall be formal and concise.9. XML documents shall be easy to create.10. Terseness in XML markup is of minimal importance.

88

Document-Centric or Data-Centric? Roots of XML in document markup (HTML-like). In practice use of XML as a data format has become at

least as pervasive. Examples:• Use of XML format in configuration and deployment files of

EJB, Tomcat, …

• Uses of XML as a format for message exchange (e.g. SOAP, BEEP).

There is also an important intermediate case—XML as program text for machine interpretation. E.g.:• XSLT declarative transformation language.

• WSDL interface definition language for Web services.

• BPEL Web services workflow language.

99

Features of XML Documents are stored in plain text and thus can

be transferred and processed anywhere. Unifying principles make it easily acceptable:

• “Everything is a tree” (DOM).• Unicode for different languages.

1010

XML and Unicode All XML documents must be written using the Unicode

character set. Unicode is also the character set used by Java, C#,

ECMAScript, …, so we should know something about it.

1111

Special Topic: Unicode

1212

Unicode Unicode (http://www.unicode.org) is an international

standard character set that covers alphabets of all the World’s common written languages.• Eventually it should cover all languages, living and dead.

• Unicode helps make the Web truly “worldwide”?! Unlike, say, ASCII, which allows for 128 characters,

Unicode has space for over 1,000,000, of which around 96,000 are currently allocated.

Unicode itself assigns a unique sequence number (code point) to any character, regardless its alphabet.• Three Unicode encoding forms map these code points to

sequences of fixed size units—UTF-8, UTF-16, UTF-32.

1313

Unicode Code Points A Unicode code point is a numeric value between 0 and

10FFFF16, commonly denoted in one of the formats:

U+XXXX

U+XXXXX

U+10XXXX

where X is a hexadecimal digit. There are a total of 1,114,112 (= 17 · 164) code points, but most of

the World’s common characters are encoded in the first 65,536 points—the Basic Multilingual Plane (BMP).• 2048 code points in BMP are disallowed because their values have a

special role in UTF-16 encoding.

For each assigned character code, the Unicode standard defines a name, and “semantic” properties like case, directionality, ...

1414

Planes The space of 17 · 216 Unicode code points is

conventionally divided into 17 planes of 216 points each. Currently used planes include:

Plane # Plane Name Code Range

0 Basic Multilingual Plane 000016..FFFF16

1 Supplementary Multilingual Plane 1000016..1FFFF16

2 Supplementary Ideographic Plane 2000016..2FFFF16

• Note early versions of Unicode used a strict 16-bit encoding, and essentially contained just BMP

1515

Unicode Allocation

Layout of planes:

1616

Blocks Planes are subdivided into blocks. Blocks have variable size. Each block contains the

characters of one alphabet or a group of related alphabets.

The following slides are a random sampling of the blocks in BMP.• I have put 128 code points on each slide, but this is just what

would fit… no general significance to pages of size 128. For all blocks in the current Unicode standard see:

http://www.unicode.org/charts/

1717

“Basic Latin” (a.k.a. ASCII)

1818

“Latin 1” (supplement)

1919

“Greek and Coptic”

2020

“Arabic” (1 of 2)

2121

“Devanagari”

2222

“Hangul Jamo” (1 of 2)

2323

“CJK Unified Ideographs” (1 of 164)

2424

Unicode Allocation

Layout of Basic MultilingualPlane:

2525

Unicode Allocation

Layout of Plane 1:

2626

Encoding Forms In electronic documents or computer programs the

space of Unicode code points is normally broken down into a sequence of units, each unit having a convenient, fixed number of bits.

The Unicode standard defines 3 encoding forms. The most straightforward is UTF-32, in which the units

have size 32 bits. This unit is easily large enough to hold the integer value

of a single code point, so UTF-32 encoding is “obvious”. But for nearly all documents, UTF-32 wastes at least

half the available storage space.• Also, most programming languages work with 8 bit or 16 bit

character units.

2727

UTF-16 The UTF-16 encoding form breaks Unicode characters

into 16 bit units.• Java, for example, uses UTF-16 for chars and Strings.

One 16 bit unit is not large enough to represent all possible Unicode code points.

Code points higher than 216-1 are split over two consecutive units.• These are called surrogate pairs. The leading unit is a high-

surrogate unit; trailing is a low-surrogate unit.• There are 1024 code points reserved in the BMP for high

surrogates, and 1024 more reserved for low surrogates. This allows for 1024 · 1024 surrogate pairs representing code points

higher than 216-1, while ensuring a legal BMP code point can always be represented in a single unit, and such a unit can never be confused with a surrogate unit.

2828

UTF-16 Bit Distribution

2929

UTF-8 The UTF-8 encoding form breaks Unicode characters

into 8 bit units (i.e., individual bytes). UTF-8 is a variable-width encoding with the following

properties:• Any Unicode code point maps to 1, 2, 3, or 4 bytes.• Byte sub-sequences for individual characters can always be

recognized by local search in the encoded string.• The Basic Latin block coding points (U+0000..U+007F) map

to one byte, identical to their ASCII value.• All code points in the BMP map to at most 3 bytes.• For European texts UTF-8 will normally use 8 or 16 bits per

character (vs 16 bits for UTF-16).• For East Asian texts UTF-8 will normally use 24 bits per

character (vs 16 bits for UTF-16).

3030

UTF-8 Bit Distribution

3131

Encoding Schemes The 3 encoding forms don’t quite complete the encoding

schemes of Unicode, because they don’t address the endianness with which the UTF-32, UTF-16 numeric unit values are rendered to bytes (byte-serialization).

To allow applications to distinguish the endianness of a given document instance, Unicode allows a Byte Order Mark (BOM) as the first character of a document.• BOM is a code point (U+FEFF) for which the byte-reversed

unit value doesn’t correspond to a legal code point, so serves to determine the actual byte order.

3232

The Seven Unicode Encoding Schemes

3333

Unicode Summary Unicode is a large and important standard that is a

foundation for XML, HTML, etc. Although you are unlikely to manipulate the encodings

yourself, you should be aware of the pros and cons of UTF-16, UTF-8.• UTF-8 is backwards compatible with ASCII—Basic Latin

texts can be read by legacy applications.

• UTF-16 is better-suited for internationalization. It is the internal representation used by Java, C#, ECMAScript, …

3434

Core XML I:The XML Specification

3535

Introduction In this section we will describe core XML, as

defined by the XML specification document from W3C.

XML is a format for documents—originally documents for the Web—but its scope is wider than that.

XML is a subset of SGML—Standard Generalized Markup Language. Some features of XML exist simply for compatibility with SGML.

XML can also be viewed as a kind of generalization of HTML—presumably familiar from the Web.

3636

XML Parsers and Applications For purposes of this section an application is any

program that reads data from an XML document. Applications normally do not (and probably should

not) read the text of XML documents directly. The XML specification assumes that this text is initially

processed by a piece of software called an XML processor. We will also refer to this as an XML parser.

The parser exhaustively checks that the text is in a legal XML form, then extracts the essential data from the document, and hands that data to the application.

3737

Reading XML Data

<?xml version="1.0"?>

<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.0//EN"

"http://www.w3.org/TR/2001/REC-SVG-20010904/DTD/svg10.dtd">

<svg width="500" height="500">

<g transform='rotate(45)'>

<circle cx='150' cy='50' r='25'/>

<text x='125' y='100'>A Circle</text>

</g>

</svg>

svg width500

height500

g transformrotate(45)

circle cx150

cy50

r25

text x125

y100

A Circle

XML Source

Parsed XML Data

Application

XML Parser

3838

Well-formed Documents An XML document follows a strict syntax. For

example:• An XML document contains regions of text called elements,

delimited by matching start-tags and end-tags. Elements must be correctly nested.

• Start-tags may include attribute specifications, where attribute values are strings delimited by matching quote marks.

A document that obeys the full set of these rules is called well-formed.

Every legal XML document must be well-formed, otherwise it cannot be parsed.

3939

Examples Well-formed:

<html> <body style="font-style: italic"> This is a well-formed document. </body></html>

Not well-formed:<html> <body style=font-style: italic> This is not a well-formed document. </html></body>

• The style attribute value is not in quote marks, and the html and body tags don’t nest correctly.

4040

Install Xerces The Xerces parser is a product of the Apache XML

project, http://xml.apache.org. Follow the “Xerces Java 2” project link and go to the

download area, then to the master distribution directory or a mirror directory.

Download Xerces-J-bin.2.6.2.zip, and extract it to a suitable place, e.g. C:\• When extracting, remember to select “Use folder names”!!

• This should create a folder called, e.g., C:\xerces-2_6_2\.

4141

Put Xerces on your Class Path Using the menu at

Control Panel→System→Advanced→Environment Variables

add the jar files xercesSamples.jar, xercesImpl.jar, and xml-apis.jar, to you class path.• E.g. append

…;C:\xerces2_6_2\xercesSamples.jar;C:\xerces2_6_2\xercesImpl.jar;C:\xerces-2_6_2\xml-apis.jar

to the value of your CLASSPATH variable.

4242

Example Using Xerces Copy the two HTML examples given above to files

called, say, wellformed.html and illformed.html. Then, in a new Command Prompt window, try running the commands: > java dom.Writer wellformed.html … > java dom.Writer illformed.html …• The first command should just echo the document. The

second should print a syntax error message.• dom.Writer is one of the sample applications in the Xerces

release. It simply uses the Xerces parser to convert the source file to a tree data structure (DOM), then converts the tree back to nicely formatted XML, which it prints.

4343

“Rolling Your Own” Parser? People approaching XML sometimes decide they can

write their own “lightweight” parser that handles just the bit of XML their application needs.• In general this is a bad idea!

• We will see later that even basic XML is a moderately complex specification; unless you are going to invest a lot of effort it is unlikely you can parse the full specification more efficiently than existing parsers.

• If you subset the specification you may be compromising the most crucial advantage that XML brings to your application—interoperability.

• Later in these lectures we will see how to use the Xerces parser from your own Java programs, to read XML input.

4444

Valid Documents An XML document may optionally include a Document

Type Definition (DTD).• This declares the names of all elements and attributes

appearing in the document, and how they may nest.

• The DTD also declares and defines entities that may be referenced from within document content.

A well-formed XML document that includes a DTD—and accurately follows the declarations in that DTD—is called valid.

4545

Invalid Documents It is quite possible to parse invalid (but well-formed)

documents, by using a non-validating parser. Many applications accept XML files without DTDs,

which are therefore technically invalid. Applications may exploit “validation” mechanisms

other than DTDs. An important one is XML Schema which we will discuss later.• A document validated against an XML Schema usually does

not have a DTD, so technically is invalid as far as the base XML specification is concerned.

• But of course it is valid relative to the XML Schema specification!

4646

Validation Side Effects The use of a validating parser certainly affects what

documents are treated as legal. In some cases “switching on” validation may also alter

the exact data passed from the parser to application. These effects will be considered when we discuss DTDs.

4747

Physical Entities An XML document is represented by one or more

“storage units” (typically files), called “entities”. We can enumerate five kinds:

• Document entities—root XML documents.

• Parsed external entities, which contain fragmentary XML content.

• External DTD subsets, which contain some or all of the DTD declarations needed by a document.

• External parameter entities, which also contain fragmentary DTD content.

• Unparsed external entities, which are usually complete “binary” files in some native format (not XML).

4848

Physical Structure The structure of a non-trivial XML document is

illustrated in the following figure. Every XML document must have exactly one document

entity. It may also involve zero or more external entities:

• The document entity may reference any number of external general entities. These can be parsed external entities or unparsed external entities. A parsed external entity may in turn reference other external general entities.

• The document may have at most one external DTD subset.• A DTD subset in the document entity, or an external DTD

subset, may reference any number of external parameter entities (which may in turn reference other external parameter entities).

4949

A Complex XML DocumentDocument

EntityExternal DTD

Subset

DTD

Content

External Parameter Entity

External Parameter Entity

Parsed External Entity

Parsed External Entity

UnparsedExternal Entity

Parsed External Entity

5050

Syntactic Features The following two tables summarize the “top-level”

syntax of all the constructs in XML. Full details will be given in later slides, as needed.• The first columns give an abbreviated example of the syntax,

the second columns (“what?”) describe the construct, and the third columns (“where?”) specify the places in an XML document where the construct may appear.

• In a “where?” column, Document means at the top-level of the document entity, and Content means in the kind of content allowed in an element—also called Parsed Character Data.

• A Literal is character data in quotes—exactly what can appear in a literal depends strongly on its context.

• XML Names will be discussed shortly.

5151

Syntax I: Logical StructuresExample Syntax What? Where?<Name …>Content</Name>

Element Document, Content

Name = Literal Attribute specification Element start tag

<?xml …> XML declaration/

Text declaration

Document/

External entity

<?Name …> Processing instruction Document, DTD, Content

<!-- … --> Comment Document, DTD, Content

<!DOCTYPE …> DTD Document

<!ELEMENT …> Element declaration DTD

<!ATTLIST …> Attributes declaration DTD

<!ENTITY …> Entity declaration DTD

<!NOTATION …> Notation declaration DTD

5252

Syntax II: References, Sections

Example Syntax What? Where?&#Code-point; Character reference Content, Literal

&Name; Entity reference Content, Literal

%Name; Parameter entity reference DTD

<![[ … ]]> CDATA section Content

<![IGNORE[ … ]]> Conditional section DTD

<![INCLUDE[ … ]]> Conditional section DTD

5353

Character Set Every XML document must be composed using the

Unicode character set.• The specification does not stipulate any particular encoding,

though defaults are UTF-8 or UTF-16.

• ASCII is a subset of Unicode, so you can create XML documents using your favorite, pre-Unicode, text editor.

5454

Allowed Character Ranges The allowed characters are

• white space:

U+0020 (space), U+0009 (tab), U+000A (line feed), U+000D (carriage return),

plus all Unicode characters higher than U+0020, excluding: The surrogate blocks U+D800..DFFF. U+FFFE and U+FFFF (noncharacters in Unicode. Note

FFFE16 is the BOM after byte-reversal in UTF-16)

Because some codes are forbidden, can’t consider including raw binary data in parsed XML (without additional encoding).

5555

Names and Name Tokens In XML, names are used in many places:

• An element has a name, an attribute has a name, an entity is referenced by a name, etc.

As in programming languages, there are rules about what constitutes a valid name (next slide).

In XML there is also a concept of name tokens, which are strings similar to names that can be specified as values of certain types of attribute.• They are less restricted. For example a number can be a valid

name token.

5656

What is a Name? Well-formed XML name tokens include any sequence of

the following characters:• a letter

• a digit

• a period (“.”), hyphen (“-”), underscore (“_”), or colon (“:”) Well-formed XML names include any name token that

starts with a letter, “_” or “:”.• Names that begin with “XML” (in upper, lower or mixed case)

are reserved. Case is significant in XML names: names are only

identified if they consist of identical character sequences.

5757

Examples Some well-formed names:

bryan

Big-Foot

Century_21

rated:PG

_.--._ Some illegal or reserved names:

2004

.com

#xFFFF

xml-1.0

5858

What is a Letter? In defining names and name tokens, XML 1.0 relies on

a definition of letter and digit. These are “Western-centric” notions—see Chapter 4 of the Unicode 4 standard for relevant discussion.

XML 1.0 defines a “letter” as an alphabetic or syllabic base character or an ideograph, and gives a (probably dated) recipe for extracting these from Unicode character databases.

XML 1.1 gives intentionally liberal Unicode ranges for NameStartChar and NameChar, then has a non-normative appendix suggesting what kind of characters should be preferred.

5959

Elements The most characteristic markup feature of XML is the

element. The basic syntax for an element is either

<Name>Content</Name>

or <Name/>

Here Name is an XML name—the type of element. In the first case, Content stands for further text, which

may include nested elements. The second case is called an empty element. It is

formally equivalent to<Name></Name>

6060

Examples from XHTML XHTML is an XML-compatible dialect of HTML. Examples:

• <h1>A header</h1>

This is an element of type h1 with content text “A header”.

• <body><h1>My Page</h1>Welcome.</body>

This is an element of type body. The content is a nested h1 element plus the text “Welcome.”.

• <hr/>

This is an empty element of type hr.• Examples here don’t illustrate, but it is allowed to include

white space in the tags, on either side of the element name.

6161

Attributes An element start tag (or an empty element tag) may

include one or more attribute specifications. An attribute specification has the general syntax

Name = "Value"

or Name = 'Value'

where Name is the name of the attribute and Value is some text.• This value text must not include the literal character “<”

(directly or through an entity replacement, see later).• It must not include the character (" or ') used to delimit it.• The value text may include line breaks and other white space.• Attribute specifications can have white space around the “=”.

6262

Examples from SVG SVG (Scalable Vector Graphics) is an XML notation

for representing graphical objects. Examples:

• <rect x="50" y="50" width="100" height="75"/>

An empty element representing a rectangle at position (50,50), with shape 100 × 75 pixels.

• <g transform='rotate(45)'> <circle cx='150' cy='50' r='25'/> <text x='125' y='100'>A Circle</text> </g>

A group containing a circle and some text. The group as a whole is rotated 45° about the origin, as specified by the attribute with name transform and value rotate(45).

6363

Possible Display of Examples

6464

The Document It is recommended to start a document entity with an XML

declaration, although technically this is optional.• An XML declaration must be strictly the first thing in the file—no white

space can go before an XML declaration (a single BOM is allowed). Every XML document entity must contain exactly one top-level

element called the root element.• Of course there can be any number of elements nested inside the root

element. If there is a Document Type Declaration, this must appear before

the root element. Miscellaneous other things are allowed at the document level

(anywhere between XML declaration and end of file):• White space• Comments• Processing instructions

Anything before the root element is collectively called the prolog.

6565

Document Layout

XML declaration (optional)

…optional comments and processing instructions …

Document Type Declaration (optional)

…optional comments and processing instructions …

root element (required)

…optional comments and processing instructions …

Prolog

6666

XML Declaration It is strongly recommended to start any document

entity with an XML Declaration. The XML declaration has syntax:

<?xml version=Literal …optional declarations… ?> The version specification and optional declarations have

a syntax similar to attribute specifications on elements (declared values are quoted the same way).• The value assigned to version must be 1.0 or 1.1, according

to which version of the XML specification you are following (1.0 may be prudent, for now).

• The optional declarations are the encoding declaration and the standalone declaration.

6767

Example XML Declaration Here is a complete (and completely pointless) XML

document with an XML declaration—including optional parts—and an empty root element:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>

<my-root/>

• In general the encoding declaration, if included, specifies the character encoding scheme.

This must be an encoding of Unicode, but it may be some encoding not defined in the Unicode standard.

• Informally a standalone declaration says whether this document stands alone. See later for details. Meanwhile, “if in doubt, leave it out”—the default is always safe.

6868

On Encoding Declarations If you are awake, you may wander what is the point of

declaring the character encoding inside a document encoded according to that scheme—apparently we can’t read the declaration unless we already know the scheme!• The XML specification gives an argument about auto-

detection of the encoding scheme allowing us to read far enough into the document to read the contents of the declaration.

• Debatable how much point there is giving an explicit encoding declaration—seems more logical to rely on auto-detection of standard encodings alone, or external metadata?

6969

Comments Comments can appear at the top level of a document, in

the DTD, or in the document content.• Specification says the parser may or may not pass the text of

the comment to the application.• XML comments are not white space—placement is more

restricted. Comments have the syntax

<!-- Text -->

where Text is any text, except it must not contain two adjacent hyphens, “--”, and it must not end with a hyphen:<!-- This--is--not--legal --><!-- Neither is this--->

7070

Processing Instructions Processing instructions can appear anywhere that

comments can—they effectively are comments so far as the XML parser is concerned.• But the specification requires the parser to pass the text of a

processing instruction to the application. The syntax is

<?Target Text ?>

where Target is an XML name, and Text is any text, except that it must not contain the string “?>”.

Processing instructions are a convenience for the application. They allow application-specific annotation of the document (example <?php … ?>). The target name may be declared as a notation (see later).

7171

The Document Type Definition

7272

Document Types In the syntax for the document entity, we saw that the

document type declaration was an optional feature. This declaration, if present, contains the document type

definition, or DTD. A validating parser will read the DTD, which should

contain (among other things) declarations of all the elements and attributes appearing in the body of the document.

The DTD is required if the parser is validating, but optional for a non-validating parser• Even a non-validating parser may read the DTD if it is

present, to look for entity declarations. These will be discussed later.

7373

Document Type Declaration The document type declaration, if present, appears

before the root element of the document. The most general form of this declaration contains

three things:1. The type (i.e. name) of the following root element

2. An identifier for an External DTD Subset

3. An Internal DTD Subset.

Items 2. and 3. are optional. The DTD itself is either given in an external file

(“external subset”), or “in line” in the document (“internal subset”), or divided between the two.

7474

General Syntax General syntax is one of:

<!DOCTYPE Name [ Declarations ] >

<!DOCTYPE Name External-ID >

<!DOCTYPE Name External-ID [ Declarations ] >

where Name is the type of the root element, External-ID is an identifier for an external entity—a separate file containing the external DTD subset—and Declarations is the internal DTD subset.

• Syntactically, the form:

<!DOCTYPE Name >

is allowed, but can never yield a valid document. (Why?)

7575

External Entity Identifiers External entity identifiers can occur in a couple of

places: they will be discussed in the next section when we discuss entity declarations.

Meanwhile the simplest form is “SYSTEM Literal”, where Literal contains a URI—a file name or URL or (in principle) a URN.

So a valid document with an external DTD might be:

<!DOCTYPE my-root SYSTEM "my-type.dtd"> <my-root> … </my-root>

where my-type.dtd is a local file name.

7676

Example DTD A DTD (subset) contains a series of DTD Declarations.

• Comments and processing instructions can be interleaved amongst the declarations.

A possible pair of declarations in an internal DTD:

<!DOCTYPE my-root [

<!ELEMENT my-root (my-leaf)>

<!ELEMENT my-leaf EMPTY> ] >

• These declare two types of element: my-root and my-leaf.• They also specify a very restricted kind of nesting. The only

valid document content would be something equivalent to:

<my-root> <my-leaf/> </my-root>

7777

Validation Using dom.Writer If I save the document type declaration and root

element in a file called “internaldtd.xml”, I can validate the file with the Xerces parser by using the dom.Writer sample application as follows:

> java dom.Writer –v internaldtd.xml

• If validation is successful, this simply prints a formatted version of the input file. If validation fails, you will see error messages early in the output.

• The –v flag is needed here: validation is “off”, by default. “v” must be lower case: “-V” forces no validation.

7878

Element Type Declarations

The general form of an element type declaration is:

<!ELEMENT Name Content-Specification >

where

• Name is an XML name, the element type, and

• Content-Specification describes the content allowed in elements of type Name (i.e. it describes the structure of any text surrounded by Name tags).

• No element type may be declared more than once.

7979

Empty Content Specification The simplest content specification that can appear in a document

type definition is “EMPTY”. This means elements of this type never have any content, not even white space!

E.g. the declaration <!ELEMENT b EMPTY>

allows elements: <b/>, <b></b>

but not: <b>Hello</b>, <b><!-- Hello --></b>, <b> </b>,

<b> </b>

• The form <b/> is preferred when the element type is declared EMPTY; <b></b> is preferred when empty content matches another declaration.

8080

Parsed Character Data The content specification “(#PCDATA)” means elements include

only “flat” text—no nested elements are allowed. PCDATA stands for “Parsed Character Data”.• Comments and processing instructions are allowed in parsed character

data.

E.g. the declaration <!ELEMENT a (#PCDATA)>

allows elements: <a>Hello, hello </a>,

<a>Hello <!-- pause --> hello</a>,

but not: <a>Hello <b/> hello</a>,

<a> <a>Hello</a> hello </a>.

8181

Content Models and Mixed Content Two sorts of content specification allow for nested

elements:• If an element may contain only nested elements, with no

interspersed character data, one gives a content model. E.g. typical content like this:

<c> <a>Hello</a> <b/> </c> c contains nested a and b elements, but no directly nested

character data.• Otherwise, if character data and elements can both appear,

the content is specified (less precisely) as mixed content. E.g. typical content like this :

<c> Hello <b/> </c> c can’t have a content model because character data

“Hello” directly nested in c.

8282

Structure of Content Models A content model is a kind of regular expression, in

which elements as whole are treated as the tokens. The content model:

(a, b, c)

means the content is a sequence of an a element, a b element, and a c element, in exactly that order.

The content model: (a | b | c)

means the content is a choice of an a element, a b element, or a c element.

8383

Composing Content Models You can compose sequences and choices with appropriate

parentheses, e.g. ((a | b), c)

means a sequence of a c or a sequence b c, and ((a, b) | c)

means a sequence of a b or a single c. In these expressions you can follow an individual element name

or a parenthesized expression by one of the modifiers ?, *, or +.• ? Means the term is optional (occurs zero or one times),

• * Means the term can be repeated zero or more times,

• + Means the term can be repeated one or more times, e.g.

(a | b | c)+

means any combination of one or more as, bs and cs

8484

Content Model Example Suppose the format of a simple “report” is: a title,

followed by any mix of paragraphs and figures, followed by an optional bibliography section.

A suitable document type might start:

<!DOCTYPE report [ <!ELEMENT report (title, (paragraph | figure)*, bibliography?) > <!ELEMENT title (#PCDATA)> <!ELEMENT paragraph (#PCDATA)> <!ELEMENT figure EMPTY> <!ELEMENT bibliography (reference)* > … ] >

8585

Example report Data<report>

<title>Early Use of XML</title>

<paragraph>Recently uncovered documents (see figure) prove XML was first used by the Incas of ancient Peru.</paragraph>

<figure source="notafake.jpg"/>

<paragraph>The author is grateful to W3C for making this research possible.</paragraph>

<bibliography> …

</bibliography>

</report>

8686

Content Model Miscellany As stated earlier, character data cannot appear in a

(valid) element described by a content model. But white space, comments, and processing instructions can be interleaved in an allowed element sequence.

The first character in the top level of the content model expression must be a left parenthesis: (a*) and (a)* are allowed as content models, but a* is not.

The specification says a content model must be deterministic. This fairly technical requirement (needed for SGML compatibility only) is outlined in the following few slides.

8787

Positions in Content Models For purposes of this discussion, we label every

appearance of an element type in a content model with some unique identifier. We call a labeled element type a position.• E.g. if the content model is:

(a, (b, c)*, (b | a))

then a labeled version could be:

(a1, (b2, c3)*, (b4 | a5))

which includes positions a1, b2, c3, b4, and a5.

8888

“Deterministic” Imagine reading some input string of elements from the

document, one element at a time, and assume that at the time each individual element is read we have no knowledge of what follows it in the document.

A content model is deterministic if, as elements are read from the document and matched against the content model, there is never more than one position in the content model that could match each inputted element.• This must be true for all possible input strings.

To give a full formal definition, need to define a follows relation between positions. Won’t do that here.

8989

Determinism Examples The content model

((b1, c2) | (b3, d4))

is not deterministic. Suppose the first element read is a b. At the time it is read, the b may match against b1 or against b3.• The content model (b, (c | d)) is equivalent and deterministic.

The content model (a1, (b2, c3)*, (b4 | a5))

is not deterministic. Suppose one a element has already been read and the next element encountered is a b. At the time b is read, it may match against b2 or b4.• In this case DTDs provide no equivalent deterministic model.

9090

Mixing Content Many XML formats allow character data and text to be

intermingled in the content of a single element.• For example XHTML allows a mixture of text and element

markup in the content of its body element (or, say, its p element).

Neither the parsed character content specification “(#PCDATA)”, nor a content model specification, will allow this mixture.

Instead one must use a mixed content specification.

9191

Mixed Content Specifications A mixed content specification looks like:

(#PCDATA | Name1 | Name2 | … | Namen)*

where Name1, Name2, …, Namen are different element types.• This may look like a content model, but it is not! Mixed

content specification must appear in the ELEMENT declaration in exactly the form above.

• You can’t replace any of the Name fields with more general expressions; the #PCDATA token must appear first in the parentheses; the right parenthesis must be followed by *.

Nevertheless the matching rules are what the syntax suggests: a valid element can contain an unrestricted mix of character data and the listed types of element.

9292

Mixed Content Example Earlier we said that an element c like :

<c> Hello <b/> </c>

cannot be described by a content model. It can be described by the declaration:

<!ELEMENT c (#PCDATA | b)* > Note however this allows any sequence of text and b

elements.• There is no way to specify there must be exactly one piece of

text preceding exactly one b element.

9393

ANY Content The last kind of content specification allowed in an

element declaration is “ANY”. This is equivalent to a mixed specification naming all

elements declared anywhere in this DTD. It does not allow appearance of elements that are

undeclared; and all nested elements must be valid according to their own declarations!

9494

Attribute-List Declarations For a valid document, besides declaring the elements

themselves, you must also declare all attributes that appear on all elements.

An ATTLIST declaration declares a list of attributes for a named element type.

For each attribute in the list, the declaration defines three things:

1. the name of the attribute,

2. the type of values it can be assigned, and

3. whether the attribute has a default value, and if so what it is.

9595

Syntax of ATTLIST Is fairly unstructured—it contains the name of the

element the attributes apply to, then simply a list of triples:

<!ATTLIST Element-Name Name1 Type1 Default1

Name2 Type2 Default2

… >

where Name1, Type1, Default1, etc are the attribute

properties mentioned on the previous slide.

9696

Attribute Types There is a longish list of allowed types for attributes

(not nearly as long as in XML Schema). The specification subdivides them into:• string type,

• tokenized types, and

• enumerated types. The simplest and most general is string type, indicated

by the keyword CDATA in the attribute declaration.• The value of an attribute is declared with this type can be any

literal string. Other types will be described shortly.

9797

Attribute Defaults The attribute type is followed by a default rule, one of:

Literal

#FIXED Literal

#REQUIRED

#IMPLIED

• Literal is a default value. The attribute is always logically present (passed to application), but optionally specified.

• #FIXED modifier means the attribute can only take its default value (trying to specify something else is invalid).

• #REQUIRED means the attribute must be specified (so no default is necessary).

• #IMPLIED means the attribute is optional (if unspecified it is absent, so no default is necessary).

9898

Attribute Default Examples Attribute list declaration:

<!ATTLIST a val CDATA "nothing" fix CDATA #FIXED "constant" req CDATA #REQUIRED opt CDATA #IMPLIED>

Instances of element a:<a val="something" fix="constant“ req="reading" opt="extra"/>

<a req="no experience"/> <!-- OK: val = “nothing”, fix = “constant”, opt absent. -->

<a fix="variable"/> <!-- Invalid! fix not “constant” and req unspecified. -->

9999

“Tokenized” Attribute Types In place of “CDATA” we may have:

• NMTOKEN: syntax of attribute value is an XML name token (defined earlier), e.g. “2004”.

• NMTOKENS: …a list of name tokens.

• ID: attribute is an identifier for this element.

• IDREF: attribute is a reference to another element in this document.

• IDREFS: …a list of references to other elements.

• ENTITY: attribute is a reference to an unparsed external. Entity (see next section).

• ENTITIES: …a list of references to unparsed external entities.

Items in a “list” are separated by white space.

100100

Element Identifiers The value given to an attribute with type ID should

follow the syntax of an XML name. This name acts as an identifier for the element instance on which it appears. For validity:• For each element type in the DTD, there should be at most

one attribute declared with type ID.

• All identifiers on all element instances in the document must be different (regardless element type).

For validity, the value given to an attribute of type IDREF should be an identifier for an element appearing somewhere in the document.

101101

Example Assume the attribute name on element type agent is

declared to have type ID, and the attribute boss is declared to have type IDREF.

<agent name="Alice" boss="Alice"/><agent name="Bob" boss="Alice"/><agent name="Carole" boss="Alice"/><agent name="Dave" boss="Bob"/>

Carole

Alice

Bob

Dave

This document captures the hierarchy illustrated on the right. Can use this technique to represent general graphs.

102102

Enumerated Attribute Types The type field in an ATTLIST declaration may have

the form:

(Token1 | Token2 | … | Tokenn)

where each Token is an XML name token. This says that the specified value of the attribute must

be one of these token values.• Example: attribute list declaration: <!ATTLIST a color (red | green | blue | white) "white">• Instances of element a:

<a color="red"/> <!-- OK. -->

<a color="black"/> <!-- Invalid! -->

103103

Notation Attribute Types The type field in an ATTLIST declaration may have the

form:

Notation (Name1 | Name2 | … | Namen)

where each Name is declared elsewhere in the DTD as a notation (see next section).

The handling of this type is similar to enumeration types.

104104

Attribute Declaration Miscellany Attributes for a single element type may all appear in a

single ATTLIST declaration, or they may be divided over several ATTLIST declarations.

It is allowed to declare the same attribute more than once, but any declarations after the first are ignored.

It is allowed (but pointless) to declare an attribute for an undeclared element type.

105105

Attribute Order Attribute specifications can appear in element tags in

any order. The same attribute cannot be specified more than once

on a single element.

106106

Entities, References, and other Processing Issues

107107

Collecting Things Together So far we have described an ideal subset of XML in

which a document contains DDT, elements, attributes, and character data, all laid out linearly.

In reality fragments of content and DTD may be defined in places other than where they ultimately appear (perhaps in other files), and portions of text may need special processing before they are made available to the application.

To complete the discussion we must cover:• Character references and entity references.• Internal and external entity declarations, and notations.• CDATA sections.• Conditional sections.

108108

Character References Character references can be viewed as an “escape”

mechanism that allows us to include specially-treated or hard-to-type characters in the XML document.

They include the Unicode code point for a character, taking either of the forms “&#dd…;” or “&#xXX…;”, where ds are decimal digits and Xs are hexadecimal digits.

For example:• &#38; or &#x26; represents “&” (ampersand).• &#60; or &#x3C; represents “<” (left angle bracket).• &#963; or &#x3A3; represents “Σ” (large Greek sigma).

One application is for including the literal characters “<” or “&” in parsed character data.

109109

CDATA Sections CDATA sections provide a way of including a section of

character data in an XML document. • The section can include “<” and “&” characters without

escaping—in a CDATA section these characters have no special significance (so “markup” syntax is ignored).

The syntax is<![CDATA[ Text ]]>

where Text is any text, except that it must not contain the string “]]>”.• One application is for including scripting in XML—e.g.

JavaScript uses < and & for operators.• You cannot include any characters generally forbidden in

XML: can’t put raw “binary” data in a CDATA section!

110110

Entity References A character reference includes a single Unicode

character in the document. An entity reference includes the content of some “entity”, which may be the contents of an external file.

The simple syntax is: &Name;

where Name is an XML name. The name of the entity, Name, must have been declared

in the document DTD. There are just five exceptions to this rule.

111111

Predefined Entities As a convenience the entities amp, lt, gt, apos, and quot

are considered predefined.• You may declare them in a DTD, but it isn’t necessary.

They must contain the single-character values:• &amp; expands to “&” (ampersand).• &lt; expands to “<” (left angle bracket, or less than).• &gt; expands to “>” (right angle bracket, or greater than).• &apos; expands to “'” (single quote, or apostrophe).• &quot; expands to “"” (double quote).

Provide a more convenient way of including reserved characters.

Note these are entity references, not character references (affects processing in some contexts).

112112

A Hofstadteresque XHTML Example<html> <body> The source of this document is: <pre>&lt;html&gt; &lt;body&gt; The source of this document is: &lt;pre&gt;&amp;lt;html&amp;gt; … &amp;lt;html&amp;gt; &lt;/pre&gt; &lt/body&gt;&lt/html&gt; </pre> </body></html>

113113

Declaring Entities Entities are defined in the DTD by an ENTITY

declaration with the syntax: <!ENTITY Name Definition >

Here Name is of course the name of the declared entity.

This general form can declare:1. Internal entities2. Parsed external entities3. Unparsed external entities

Only 1. and 2. can be included by an entity reference.• Later we will see another form of the ENTITY declaration

that defines parameter entities.

114114

Internal Entities The simplest kind of entity is an internal entity. In this

case the Definition is just a literal string, and the entity behaves like a kind of macro.

The string can contain character data and markup. E.g.:

<!ENTITY me "Bryan" ><!ENTITY lt "&#38;#60;" ><!ENTITY Sigma "&#x3A3;" ><!ENTITY flag "Stars &amp; Stripes" ><!ENTITY icon '<image xlink:href="icon.jpg" />' >

These declarations allow the shorthand forms “&me;”, “&lt;”, “&Sigma;”, “&flag;”, “&icon;” in the body of the document.

115115

Replacement Text for Internal Entities Character references (and also parameter entity

references, see later) appearing in the entity definition are expanded when the declaration is processed. So the replacement text for lt is “&#60” and for Sigma is “Σ”.

References to other entities appearing in an entity definition are not expanded at the time the declaration is processed (see next slide). The replacement text for flag is “Stars &amp; Stripes”.

116116

Expansion of General Entities When the parser encounters an entity reference in the

content of a document, the reference is expanded to the replacement text for the entity.

The parser then resumes processing, starting at the beginning of the inserted text. If the replacement text contains further entity references, these are replaced in turn, as they are encountered.

An entity reference must not expand to fragmentary markup. For example, the replacement text may include complete elements, but not isolated start tags, or isolated “<” characters.

117117

External Entities For an external entity the Definition appearing in the

ENTITY declaration is an external ID—same syntax as in DOCTYPE declarations.

There are two forms (for parsed external entities):

<!ENTITY Name SYSTEM URI-Literal >

<!ENTITY Name PUBLIC Public-ID-Literal URI-Literal >

The first form is fairly self-explanatory: URI-Literal is typically an absolute or relative URL.• If relative, it is relative to the location of the document entity

or external entity containing the ENTITY declaration.

118118

References to External Entities External entities may be referenced anywhere markup

can appear in document content, except in the literal value of an attribute.• Internal entities can be referenced in attribute specifications.

119119

Public Identifiers Where an external entity is likely to widely used one

can give a PUBLIC identifier.• This acts as a logical identifier, something like a URN.• XML standard itself doesn’t specify a syntax for public

identifiers, but XML-based standards usually follow the SGML format for Formal Public Identifiers.

For example the DOCTYPE declaration for an XHMTL 1.1 document should be:<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">

• When an document type declaration or an entity declaration gives a public identifier, this must be followed by a URI, which acts as a fall back for locating the entity.

120120

Format of Parsed External Entities A parsed external entity may optionally start with a text

declaration. The rest of the file should be Content, such as may appear in an XML element.• The replacement text for an external entity is the unprocessed

content (minus text declaration). An external DTD subset may optionally start with a text

declaration. The rest of the file should be Declarations, such as may appear in an internal DTD subset.• We will see that external DTD subsets allow a couple of

processing options beyond those allowed in internal subsets.

121121

Text Declaration It is recommended to start a parsed external entity with

a text declaration. The text declaration has syntax:

<?xml …optional version declaration… encoding=Literal ?>

i.e. syntax is identical to an XML declaration, except that now it is the encoding declaration that is mandatory (and no standalone declaration is allowed).• It is allowed for the entity character encoding to be different

from the document that references it.

122122

XHTML Example A document entity:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd" [

<!ENTITY myfooter SYSTEM "myfooter.xml" >

]>

<html> <body> This is one page in a complicated Web site. &myfooter; </body></html>

An external entity “myfooter.xml”: <hr/>Maintained by

<a xlink:href="mailto:[email protected]"> Bryan Carpenter</a>.

123123

Combining DTD Subsets The example above is illustrative—sadly no known

Web browser will recognize this kind of DTD. In principle it shows how one can use both an external

subset and an internal subset in the same DTD.• We will see shortly that one can go further with this using

parameter entities.

• Although the internal subset appears after the reference to the external subset, the internal subset is processed first. If there are conflicting declarations of entities, the internal subset takes precedence.

• In general if an entity or attribute is declared more than once, declarations after the first are ignored.

124124

Unparsed Entities Unparsed entities are declared in a similar way to

parsed entities, but used in a completely different way. The XML parser does not include the contents of the

entity in the XML data, it just forwards a reference to the application, e.g. as a URI.

An unparsed entity must be annotated with information about what type of content it holds (or, equivalently, what kind of application can process it).• This notation is also passed to the application.

We give the syntax on the following slides—there is a quite lot of syntax with modest payoff.

125125

Notation Declarations Before using a notation to label an unparsed entity, its

notation is declared. Declaration syntax is quite free, e.g. the following are legal:<!NOTATION myformat ><!NOTATION jpeg SYSTEM "image/jpeg" ><!NOTATION perl SYSTEM "/usr/bin/perl" ><!NOTATION Name PUBLIC Public-ID >

• According to the specification, public and system identifiers may “locate a helper application capable of processing data”.

• Some authors interpret that they may be, say, MIME types.• A declared notation may also be the target of a processing

instruction, or specified in the value of an attribute of type NOTATION or NOTATIONS.

126126

Declaring and Using Unparsed Entities Declaration similar to parsed external entities, except it

ends with “NDATA Name” where Name is a notation. E.g.:

<!ENTITY notafake SYSTEM "notafake.jpg" NDATA jpeg >

The only way to reference a declared unparsed entity is in an attribute declared of type ENTITY or ENTITIES, e.g. if we have: <!ATTLIST figure source req ENTITY #REQUIRED >

then the name notafake can finally be used as follows:

<figure source="notafake"/>

127127

Parameter Entities General entity references are expanded when they

appear to in the document content. They are not useful for abstracting sections within a DTD.• The only time general entities are expanded while processing

a DTD is if they appear in the default value of an attribute. There is a separate “macro-expansion” mechanism that

is designed for use within DTDs. This uses parameter entities.

128128

Declaration and Reference Declaration of parameter entities follows a syntax

similar to parsed general entities, except the name is preceded by a “%” sign, e.g.:

<!ENTITY % my-dec '<!ENTITY me "Bryan" >' >

This declares the internal parameter entity my-dec. Its value is the declaration of the general entity me.

To actually insert the declaration of me, later in the DTD, one would use a parameter entity reference, which looks like an entity reference except it uses “%”, e.g.:

%my-dec;

129129

External Parameter Entities Like general entities, parameter entities may be

external, e.g.

<!DOCTYPE my-root [ <!ENTITY % my-type SYSTEM "my-type.dtd"> %my-type; ]>

is for all intents and purposes equivalent to:

<!DOCTYPE my-root SYSTEM "my-type.dtd">

130130

Attribute Groups Parameter entities can be used to modularize a DTD,

for example by splitting it over multiple files which may be selectively included (c.f. conditional sections, described shortly).

Another common use is to collect together lists of related attribute definitions that appear in more than one attribute list, e.g.

<!ENTITY % image-atts "source CDATA #REQUIRED height NMTOKEN #IMPLIED">…<!ATTLIST image %image-atts; ><!ATTLIST figure %image-atts; caption CDATA #IMPLIED >

131131

Restrictions Some reasonable restrictions are placed on references

to parameter references, to ensure their replacement text is a “complete” fragment of DTD syntax.• This is similar to the requirement that a parsed general entity

must expand to a “complete” fragment of markup. There is a more ad hoc requirement that in an internal

DTD subset any parameter references must expand to one or more complete DTD declarations• Example of previous slide would be illegal in internal subset

because %image-atts; isn’t a complete declaration.

• This restriction does not apply inside external parameter entities referenced from the internal subset.

132132

Parameter Entity Expansion When a parameter entity is referenced anywhere in the

DTD except in the literal text of another entity declaration, extra space characters are added at the ends of the replacement text.

E.g., this will not work:

<!ENTITY % my-prefix "MYAPP_" ><!ELEMENT %my-prefix;Root (#PCDATA) >

as element “name” expands to “ MYAPP_ Root”, with illegal space. Following is OK:

<!ENTITY % my-prefix "MYAPP_" ><!ENTITY % my-root "%my-prefix;Root" ><!ELEMENT %my-root (#PCDATA) >

133133

Conditional Sections Two special kinds of section can appear in an external

DTD subset:<![IGNORE[ Text ]]><![INCLUDE[ Declarations ]]>

These are designed for use in the following the idiom:<![%my-control;[ … conditional Declarations … ]]>

where my-control is the name of a parameter entity defined earlier with value IGNORE or INCLUDE.• For example my-control can be declared in the internal

subset, affecting what parts of an external DDT are effective.• The IGNORE section has the unusual property that other

sections can nest within it. Thus also convenient for “commenting out”.

134134

Processing Miscellany The XML parser will present character data extracted

from element bodies and attributes to the application. It does this after suitable processing. E.g.• Character references may be replaced by actual character.

• CDATA sections may be replaced by their contents.

• Entity references may be expanded. Various other normalizations take place.

• Every CR LF sequence, or CR not followed by LF, is replaced by a single LF.

• A validating parser will normalize the values of attributes according to their type (next slide).

135135

Normalization of Attribute Values For every attribute value, the following normalizations

are applied by a validating parser:• Convert every literal white space character (Line Feed, etc) to

a space character (#x20).• Expand character references• Expand entity references (and apply normalizations above to

replacement text). For attributes that have type other than CDATA, all

leading or trailing spaces are then removed. Also sequences of spaces are replaced by single spaces.• Note that these transformations generally will not be applied

by a non-validating parser (so the application will see different data).

136136

Standalone Documents Various markup declarations affect the data passed to

the application.• E.g. declarations of entities, default values of attributes, types

of attributes affecting normalization,… The XML declaration may include standalone="yes" if

all those markup declarations are in the internal DTD subset (and moreover not in parameter entities).

Note this restricts only external declarations, not external entities: somewhat mysteriously, a document that is “standalone” can have references to external entities, provided they are declared internally.

137137

xml:space and xml:lang If these attributes are used or declared, the

specification restricts what values they may take.• xml:space may only be declared with an enumerated type

restricted to one or both of “default” and “preserve”, and may only be specified with one or other of these values.

• xml:lang may only be specified with a value that is a standard IETF RFC 3066 language tag (see also ISO-639).

Intended uses are to specify handling of white space, and the natural language of the document. This handling is up to the application: a parser will likely forward these the same as other attributes, and they don’t affect its handling of text.

138138

Conclusion We have covered everything in the basic XML

specification, but there is still a long way to go with XML.

Coming up:• The Document Object Model

• Namespaces in XML

• Introductory XPath

• XML Schema

139139

The Document Object Model: Programming Interfaces

140140

The Document Object Model The Document Object Model or DOM is a set of W3C

specifications that define standard ways to access parts of an XML or HTML document from within a program.

This addresses, for example, how a Java program extracts the data from an XML document, after the document has been processed by an XML parser.

But the origins of the DOM lie in JavaScript programming for interactive and dynamic Web pages.

141141

Dynamic HTML Example<html> <head> <script language="javascript"> function appendText() { paraNode = document.getElementsByTagName("p") [0] ; textNode = paraNode.firstChild ; textNode.nodeValue += " Ouch!" ; } </script> </head> <body> <p>Hello.</p> <form> <input type="button" value="Push Me" onclick="appendText()"/> </form> </body></html>

142142

The DOM from JavaScript The DOM represents an XML (or HTML) document as

a series of nodes. In the example, the JavaScript method appendText() is

called when the user clicks on the button labeled “Push Me”.• This method identifies the DOM node representing the <p>

element, extracts the nested text node, and modifies the data (the text) associated with that text node.

• The function getElementsByTagName() is a method defined in the DOM, and firstChild and nodeValue are node properties in the DOM.

143143

A Document is a Tree Here is how

an example fragment of HTML and can be thought of as a tree.

144144

DOM Nodes In the DOM, one builds the document tree out of a set

of Node objects Each Node object has a set of capabilities (properties

and methods) and implements specific interfaces.

Node

Node Node Node Node

Node Node Node NodeNode Node

Node NodeNode Node Node Node

145145

Node Types in Core DOM DOM APIs are defined as a series of interfaces (this term

meant in a similar way to how Java uses it). Node itself is an interface, and the following interfaces

extend Node:• Document: node representing entire document, including

prolog.

• DocumentFragment: convenience node, representing an incomplete document.

• Element: node representing an element.

• Attr: node representing an attribute attached to an element.

• Text: node representing “character data”.

• Comment: node representing a comment.

146146

Further Node Types Nodes on previous slide cover requirements of HTML.

Following extra node types are needed to represent general XML:• EntityReference: node representing an entity reference.

• Entity: node representing content of a parsed external entity.

• ProcessingInstruction: node representing a processing instruction.

• CDATASection: node representing a CDATA section.

• Notation: node representing a notation declared in the DTD.

147147

The Node

Interface

Constants

Properties

Methods

148148

Remarks DOM interfaces are traditionally defined in the OMG’s

Interface Definition Language.• IDL interfaces can be mechanically converted to interface

definitions for languages like Java, ECMAScript, C++, … By design, the Node interface is sufficiently generic

that any DOM tree can be manipulated using the methods and properties of this interface alone.• Use of the more specialized derived interfaces—Element,

Attribute, etc—is at programmer option. Those interfaces provide more type-safety, and various type-specific convenience methods and properties.

149149

nodeName, nodeValue Properties Each node type has different rules for values of some of the

properties—most importantly nodeName and nodeValue. The attributes property is only relevant for an element node.

Node Type

150150

Node Children The following node types may have children, as

indicated, in the DOM tree:• Document: DocumentType, Element (maximum of one),

ProcessingInstruction, Comment.• DocumentFragment: Element, Text, CDATASection,

EntityReference, ProcessingInstruction, Comment.• Element: Element, Text, CDATASection, EntityReference,

ProcessingInstruction, Comment.• Attr: Text, EntityReference.

• EntityReference: Element, Text, CDATASection, EntityReference, ProcessingInstruction, Comment.

• Entity: Element, Text, CDATASection, EntityReference, ProcessingInstruction, Comment.

151151

The Document Interface The Document interface has some indispensable factory methods

for creating other node types:

152152

Levels of the DOM W3C recognizes 4 stages of evolution of the DOM,

called levels:• Level 0: Legacy HTML “DOM” features from Navigator 3.0

and IE 3.0. No W3C specification.

• Level 1: Specification completed 1998. Basic tree structure for HTML and XML:

153153

Level 2 DOM Specification completed 2000. Notably added support for Events.

Also XML Namespaces, DOM representation of Cascading Style Sheets, and new facilities for manipulating the tree.

154154

Level 3 DOM Specification still in progress. It will add, for example, support

for XPath search operations on a DOM tree, and formally define the mapping between DOM and XML Infoset. Load/Save standardizes parsing/serialization APIs.

155155

Using Xerces Earlier we used the dom.Writer sample application

from the Xerces project to validate XML documents. In the following slides we illustrate how to use the

Xerces parser from your own Java program.• We will use the conventional Java JAXP API to control

parsing.

• In future the Load/Save features of Level 3 DOM are likely to be the preferred API (Xerces already provides an experimental implementation).

• To control certain aspects of parsing, it may be necessary to use the org.apache.xerces.parsers.DOMParser implementation class directly.

156156

Checking Well-Formednessimport javax.xml.parsers.DocumentBuilder;import javax.xml.parsers.DocumentBuilderFactory;import org.w3c.dom.Document;

import java.io.File;

public class MyChecker {

public static void main(String [] args) throws Exception {

File source = new File(args [0]) ;

DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance(); factory.setNamespaceAware(true) ;

DocumentBuilder builder = factory.newDocumentBuilder();

Document document = builder.parse(source) ; }}

157157

Remarks In JAXP one first obtains a parser factory, then uses the

factory to create a parser instance (called a document builder in JAXP).• Behaviors of the parser (e.g. whether namespace-aware,

validating/non-validating, …) are controlled by setting properties of the factory before creating the instance.

Actual parsing is done by the parse() method, which returns a DOM instance (Document node).• Example above does nothing if document is well-formed;

prints an exception if it is not.

158158

Validating

public static void main(String [] args) throws Exception {

File source = new File(args [0]) ;

DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance(); factory.setNamespaceAware(true) ; factory.setValidating(true) ;

DocumentBuilder builder = factory.newDocumentBuilder(); builder.setErrorHandler(new MyErrorHandler()) ;

Document document = builder.parse(source) ;}

159159

Simple Error Recovery

import org.xml.sax.* ;

class MyErrorHandler implements ErrorHandler {

public void warning(SAXParseException e) throws SAXException { System.out.println(e.getMessage()); }

public void error(SAXParseException e) throws SAXException { System.out.println(e.getMessage()); }

public void fatalError(SAXParseException e) throws SAXException { System.out.println(e.getMessage()); System.exit(1); }}

160160

Processing the DOM The two previous examples ended by creating a

Document object, then did nothing with it. The following example summarizes several important

Java DOM classes and methods in one recursive method that extracts all useful information in the elements, attributes, and text nodes of a document.• Assumes a declaration like

import org.w3c.dom.* ;

is in effect.

• After parsing the document, this method could be invoked from main() by:

process(document.getDocumentElement()) ;

161161

Recursive Processing of DOMstatic void process(Node node) {

System.out.println(node.getNodeName()) ;

switch(node.getNodeType()) {

case Node.ELEMENT_NODE: NamedNodeMap attributes = node.getAttributes() ; for(int i = 0 ; i < attributes.getLength() ; i++) process(attributes.item(i)) ;

NodeList children = node.getChildNodes() ; for(int i = 0 ; i < children.getLength() ; i++) process(children.item(i)) ;

break ;

case Node.TEXT_NODE: case Node.ATTRIBUTE_NODE: System.out.println(node.getNodeValue()) ; }}

162162

Interactive XML We have been discussing use of the DOM in Java as a convenient

intermediate representation for handling XML data. This is different from the original application of the DOM,

illustrated in our introductory dynamic HTML example. In that application, JavaScript event handlers manipulate the

DOM embedded in a Web browser, dynamically altering the text displayed by the browser.

This technique can also be applied to XML documents that have a visual representation, for example to Scalable Vector Graphics documents.

Following screen captures are from www.svgarena.org.

163163

XML Chess

164164

XML Solitaire

165165

Namespaces in XML

166166

Motivation XML DTDs allow us to define a set of customized

element, attribute, and entity names to support a particular application.

But in a document of any complexity one may want to mix and match markup from different application domains.

In particular in a Web page authored using XHTML tags one might want to embed pictures represented with SVG elements; in a technical Web page one might want to embed mathematical formulae represented using MathML elements.

167167

Namespaces Typically different XML vocabularies are designed by

different committees, and it is impractical to avoid name clashes.

In general we can’t simply merge the DTDs—unclear this would be desirable anyway: loss of modularity.

The Namespaces in XML specification address this issue by allowing each application to define names in a particular namespace. A single XML document can then include markup from more than one namespace.

168168

Relation to the XML Spec The namespaces specification was released a year after

the XML 1.0 specification. It is formally 100% compatible with the XML

specification—it “merely” puts some restrictions on the document form.

Significant practical issues arise. DTDs—set in stone in the XML/SGML standards—are not an ideal match for namespaces: DTDs can be used with namespaces, but it takes some hacks.• The likely intention was to move away from DTD validation

towards XML Schema validation, which is fully namespace compatible. So far this transition is incomplete?

169169

Qualified Names The simple general idea is that names occurring in

document instances may be prefixed in a way that identifies their name space.

Here is an (imaginary!) example including SVG tags in an XHTML document:<body> <h1>What are circles really like?</h1> This is what a circle really looks like: <object> <svg:svg> <svg:circle cx='150' cy='50' r='25'/> </svg:svg> </object></body>

170170

Name Syntax In the namespaces specification use of the colon, “:”, in

XML names is reserved. Namespaces qualified names have at most one colon; if

there is a colon it separates the prefix from the local part: Prefix:Local-part

In the example above we had qualified SVG namessvg:svgsvg:circle

Here the prefixes are svg and the local parts are svg and circle respectively. HTML names had no prefix, only local parts.

171171

Namespace Names Each namespace (e.g. the XHTML namespace or the

SVG namespace) itself has a “name”.• This is neither the prefix appearing on qualified names, nor

in general a legal XML name! Instead a namespace name is an IRI.

• An IRI is an Internationalized Resource Identifier: it is a generalization of a URI allowing non-ASCII characters.

E.g. the name of the XHTML 1.0 namespace is: http://www.w3.org/1999/xhtml

and the name of the SVG 1.1 namespace is: http://www.w3.org/2000/svg

172172

Use of URIs The use of URIs (or IRIs) for namespace names is quite

confusing. Though namespace names follow the syntax of resource

identifiers, the namespace is not a resource; there need not be any resource at the location—all that is relevant is the sequence of characters in the URI. This string of characters identifies the namespace.

One day this odd situation may be resolved to allow applications to automatically find schema at namespace URIs—see for example www.rddl.org.

For now namespace names and schema locations are two independent IRIs.

173173

Defining Prefixes In principle an instance document can choose any

convenient prefix to use within that document• Though an existing DTD may restrict the choice.

A prefix is declared in an attribute named xmlns:Prefix. The scope is the element the declaration appears on, and any nested elements. Value specified for attribute must be the namespace name. E.g.: <object xmlns:svg='http://www.w3.org/2000/svg' >

<svg:svg> <svg:circle cx='150' cy='50' r='25'/> </svg:svg> </object>

174174

Default Namespaces If a “vocabulary” has been defined in a namespace, a

document instance must acknowledge this, even if prefixes aren’t needed.

Use the attribute xmlns to declare a default namespace. E.g. all XHTML documents should start:

<html xmlns="http://www.w3.org/1999/xhtml" >

<head>

</html>

175175

Scope of Declarations So far as the namespaces specification is concerned:

• Declarations of prefixes and default namespaces can appear on any elements in a document. Scope is the declaration is limited to the element it appears on.

• Different namespaces can be the default in different elements.

• The same prefix could be used to represent different namespaces in different parts of the document, or vice versa (these are bad ideas!)

DTDs may limit these possibilities. Default namespace declarations don’t affect attribute

names. Interpretation of an un-prefixed attribute name is determined by the element it is attached to.• Means attribute is in same namespace as element? No!

176176

Namespaces and DTDs DTDs are not “namespace aware”. So far as DTDs are

concerned a qualified name with a colon is just an atomic XML name.

A partial workaround is to make all names introduced in DTD declarations parameter entity references, then factor out the prefix as a single parameter entity that can be set in a document’s internal DTD subset.

This is ugly, and it doesn’t solve all problems. • E.g. DTDs still won’t recognize an equivalence between

names in the same namespace, if they have the same local name but different prefixes: so far as DTDs are concerned they are different XML names.

177177

Namespaces and XML Schema Later lectures discuss XML Schema—one of the

alternative validation frameworks to DTDs—in detail. You will probably find the issues associated with

namespaces become much more concrete once XML Schema (which are fully namespace-aware) are understood.

178178

Elementary XPath

179179

XML Path Language The XML Path Language, or XPath, is a language for

addressing nodes of an XML document.• Understand “nodes” as in the DOM (although technically there

are minor differences). XPath is an important part of XML Schema, XSLT, and

XPointer, and is used as a query language in some XML databases.

In simple cases an XPath expression looks like a UNIX path name, with nested directory names replaced by nested element names, so for example:• “/” corresponds to the document node.

• Expressions may be absolute (relative to /) or relative to some context node.

180180

Simple Examples The XPath:

/planets

represents a document element called planets. The XPath:

/planets/planet/density

represents the set of all elements named density that are directly nested in any element named planet that is directly nested in a document element planets.

The XPath: /planets/planet/*

represents all elements directly nested in an element planet directly nested in document element planets.

181181

Example

<?xml version="1.0" encoding="UTF-8"?><planets> <planet> <name>Mercury</name> <mass>0.0553</mass> <day units="days">58.65</day> <radius units="miles">1516</radius> <density>0.983</density> </planet> <planet> <name>Venus</name> <mass>0.815</mass> <day units="days">116.75</day> <radius units="miles">3716</radius> <density>0.943</density> </planet> <planet> <name>Earth</name> <mass>1</mass> <day units="days">1</day> <radius units="miles">3960</radius> <density>1</density> </planet></planets>

Document is a simplified version of an example given the in the book Inside XML, by Holzner.

Highlighted node is evaluation of

/planets

182182

Example

<?xml version="1.0" encoding="UTF-8"?><planets> <planet> <name>Mercury</name> <mass>0.0553</mass> <day units="days">58.65</day> <radius units="miles">1516</radius> <density>0.983</density> </planet> <planet> <name>Venus</name> <mass>0.815</mass> <day units="days">116.75</day> <radius units="miles">3716</radius> <density>0.943</density> </planet> <planet> <name>Earth</name> <mass>1</mass> <day units="days">1</day> <radius units="miles">3960</radius> <density>1</density> </planet></planets>

Highlighted nodes are evaluation of

/planets/planet/density

183183

Example

<?xml version="1.0" encoding="UTF-8"?><planets> <planet> <name>Mercury</name> <mass>0.0553</mass> <day units="days">58.65</day> <radius units="miles">1516</radius> <density>0.983</density> </planet> <planet> <name>Venus</name> <mass>0.815</mass> <day units="days">116.75</day> <radius units="miles">3716</radius> <density>0.943</density> </planet> <planet> <name>Earth</name> <mass>1</mass> <day units="days">1</day> <radius units="miles">3960</radius> <density>1</density> </planet></planets>

Highlighted nodes are evaluation of

/planets/planet/*

184184

Types of XPath Expressions In full generality XPath expressions can evaluate to

various kinds of thing (including numbers, Booleans, and strings) but one is usually interested in expressions that evaluate to node-sets.

A node-set is a collection of nodes in an XML document. In general he nodes that can appear in a node-set

include:• element nodes, attribute nodes,

• text nodes (plain text children of some element), root nodes, namespace nodes, processing instruction nodes and comment nodes.

We will only discuss the first two cases.

185185

Location Paths and Location Steps The most important kind of XPath expression is

the location path. In general a location path consists of a series of

location steps separated by the slash “/”.• Compare to a UNIX path, where the individual “step”

is the name of an immediately nested subdirectory or file, or a wildcard (*), or a move up into the parent directory (..), etc.

186186

Steps and Context Nodes While a UNIX path may be relative to some current

directory, an XPath expression is generally evaluated relative to some context node.

Starting from this context node of the path, the location path takes a series of steps in various possible “directions”, e.g. • into the set of child elements of the current element,

• into the set of attributes of the current element,

• into the set of siblings of the current element, etc. This “direction” is called the axis of the step. Each individual step has its own context node, determined by

preceding steps in the path.

187187

Syntax for Location Steps The commonest example of a location step—analogous

to a UNIX directory name—is an XML element name.• This should be the name of an element that is an immediate

child of the context node. Recalling this example:

/planets/planet

it has two steps, planets and planet.• Actually this is an example of what is called abbreviated syntax.

In the full, unabbreviated, syntax, that location path would be:

/child :: planets/child :: planet

child being the name of the axis. In this brief exposition we only cover abbreviated syntax.

188188

Abbreviated Syntax for Steps The location step Name selects some element nodes: the

children of the context node called Name.• “*” selects all children of the context node.

The location step “.” represents the context node. The location step “..” represents the parent node of the

context node. The location step @Name represents an attribute node

—the attribute of the context node named Name.• @* selects all attributes of the context node.

A blank location step effectively selects the context node or any descendant element, e.g. • //Name selects all Name elements in the document.• .//Name selects all Name descendants of the context node.

189189

Example

<?xml version="1.0" encoding="UTF-8"?><planets> <planet> <name>Mercury</name> <mass>0.0553</mass> <day units="days">58.65</day> <radius units="miles">1516</radius> <density>0.983</density> </planet> <planet> <name>Venus</name> <mass>0.815</mass> <day units="days">116.75</day> <radius units="miles">3716</radius> <density>0.943</density> </planet> <planet> <name>Earth</name> <mass>1</mass> <day units="days">1</day> <radius units="miles">3960</radius> <density>1</density> </planet></planets>

Highlighted nodes are evaluation of/planets/planet/radius/@units

190190

Example

<?xml version="1.0" encoding="UTF-8"?><planets> <planet> <name>Mercury</name> <mass>0.0553</mass> <day units="days">58.65</day> <radius units="miles">1516</radius> <density>0.983</density> </planet> <planet> <name>Venus</name> <mass>0.815</mass> <day units="days">116.75</day> <radius units="miles">3716</radius> <density>0.943</density> </planet> <planet> <name>Earth</name> <mass>1</mass> <day units="days">1</day> <radius units="miles">3960</radius> <density>1</density> </planet></planets>

Highlighted nodes are evaluation of

//mass

191191

Example

<?xml version="1.0" encoding="UTF-8"?><planets> <planet> <name>Mercury</name> <mass>0.0553</mass> <day units="days">58.65</day> <radius units="miles">1516</radius> <density>0.983</density> </planet> <planet> <name>Venus</name> <mass>0.815</mass> <day units="days">116.75</day> <radius units="miles">3716</radius> <density>0.943</density> </planet> <planet> <name>Earth</name> <mass>1</mass> <day units="days">1</day> <radius units="miles">3960</radius> <density>1</density> </planet></planets>

If the context node is the second <planet/> element, the Highlighted nodes are evaluation of

*/@units

192192

Unions The operator “|” forms the union of two node sets. For example:

//radius | //density

represents the set of all radius elements and all density elements in the document.

193193

Example

<?xml version="1.0" encoding="UTF-8"?><planets> <planet> <name>Mercury</name> <mass>0.0553</mass> <day units="days">58.65</day> <radius units="miles">1516</radius> <density>0.983</density> </planet> <planet> <name>Venus</name> <mass>0.815</mass> <day units="days">116.75</day> <radius units="miles">3716</radius> <density>0.943</density> </planet> <planet> <name>Earth</name> <mass>1</mass> <day units="days">1</day> <radius units="miles">3960</radius> <density>1</density> </planet></planets>

Highlighted nodes are evaluation of

//radius | //density

194194

XPath and Namespaces If the elements or attributes named an XPath

expression belong to a namespace, then in general those names must have a namespace prefix.• You cannot use a default namespace in an XPath expression.

When XPath expressions are embedded in XML documents (which is normal), the prefixes for the qualified names are scoped in the usual way—by namespace declarations (i.e. xmlns:Prefix attribute specifications) in the XML.

195195

Summary In its full generality, XPath is a fairly complex (but

powerful) language for addressing subsets of the nodes of an XML document.• It incorporates extensive computational and filtering

functionalities that we have not described here. The subset we covered in this lecture is nevertheless

useful in itself.• And happens to include all the XPath needed to understand

XML Schema.