63
XML Introduction

XML Introduction. Introducing XML XML stands for Extensible Markup Language. A markup language specifies the structure and content of a document. Because

Embed Size (px)

Citation preview

Page 1: XML Introduction. Introducing XML XML stands for Extensible Markup Language. A markup language specifies the structure and content of a document. Because

XML Introduction

Page 2: XML Introduction. Introducing XML XML stands for Extensible Markup Language. A markup language specifies the structure and content of a document. Because

Introducing XML• XML stands for Extensible Markup

Language. A markup language specifies the structure and content of a document.

• Because it is extensible, XML can be used to create a wide variety of document types.

Page 3: XML Introduction. Introducing XML XML stands for Extensible Markup Language. A markup language specifies the structure and content of a document. Because

Introducing XML• XML is a subset of a the Standard Generalized

Markup Language (SGML) which was introduced in the 1980s. SGML is very complex and can be costly.

• These reasons led to the creation of Hypertext Markup Language (HTML), a more easily used markup language. XML can be seen as sitting between SGML and HTML – easier to learn than SGML, but more robust than HTML.

Page 4: XML Introduction. Introducing XML XML stands for Extensible Markup Language. A markup language specifies the structure and content of a document. Because

The Limits of HTML• HTML was designed for formatting text on a Web page.

It was not designed for dealing with the content of a Web page. Additional features have been added to HTML, but they do not solve data description or cataloging issues in an HTML document.

• Because HTML is not extensible, it cannot be modified to meet specific needs. Browser developers have added features making HTML more robust, but this has resulted in a confusing mix of different HTML standards.

Page 5: XML Introduction. Introducing XML XML stands for Extensible Markup Language. A markup language specifies the structure and content of a document. Because

Introducing XML• HTML cannot be applied consistently.

Different browsers require different standards making the final document appear differently on one browser compared with another.

Page 6: XML Introduction. Introducing XML XML stands for Extensible Markup Language. A markup language specifies the structure and content of a document. Because

Introduction to XML Markup

• XML document (intro.xml)– Marks up message as XML– Commonly stored in text files

• Extension .xml

Page 7: XML Introduction. Introducing XML XML stands for Extensible Markup Language. A markup language specifies the structure and content of a document. Because

1 <?xml version = "1.0"?>

2

3 <!-- Fig. 5.1 : intro.xml -->

4 <!-- Simple introduction to XML markup -->

5

6 <myMessage>

7 <message>Welcome to XML!</message>

8 </myMessage>

Line numbers are not part of XML document. We include

them for clarity.

Document begins with declaration that specifies XML version 1.0

Element message is child element of root element

myMessage

Page 8: XML Introduction. Introducing XML XML stands for Extensible Markup Language. A markup language specifies the structure and content of a document. Because

• XML documents– Must contain exactly one root element

• Attempting to create more than one root element is erroneous

– Elements must be nested properly• Incorrect: <x><y>hello</x></y>• Correct: <x><y>hello</y></x>

– Must be well-formed

Introduction to XML Markup (cont.)

Page 9: XML Introduction. Introducing XML XML stands for Extensible Markup Language. A markup language specifies the structure and content of a document. Because

XML Parsers• An XML processor (also called XML

parser) evaluates the document to make sure it conforms to all XML specifications for structure and syntax.

• XML parsers are strict. It is this rigidity built into XML that ensures XML code accepted by the parser will work the same everywhere.

Page 10: XML Introduction. Introducing XML XML stands for Extensible Markup Language. A markup language specifies the structure and content of a document. Because

XML Architecture

Page 11: XML Introduction. Introducing XML XML stands for Extensible Markup Language. A markup language specifies the structure and content of a document. Because

Structure of a Well-formed XML Document

<?xml version="1.0" ?><!DOCTYPE publication [<!ELEMENT publications (journals, conferences, books)>...<!ELEMENT author (#PCDATA)><!ELEMENT issue (#PCDATA)><!ATTLIST issue pages CDATA #REQUIRED><!ENTITY JSI " <journal>Journal of Systems

Integration</journal><publisher>Kluwer Academic Publishers</publisher>">]><publications><journals>...&JSI;...</publications>

Page 12: XML Introduction. Introducing XML XML stands for Extensible Markup Language. A markup language specifies the structure and content of a document. Because

XML Parsers• Microsoft’s parser is called MSXML and is

built directly in IE versions 5.0 and above.

• Netscape developed its own parser, called Mozilla, which is built into version 6.0 and above.

Page 13: XML Introduction. Introducing XML XML stands for Extensible Markup Language. A markup language specifies the structure and content of a document. Because

Parsers and Well-formed XML Documents (cont.)

• XML parsers support– Document Object Model (DOM)

• Builds tree structure containing document data in memory

– Simple API for XML (SAX)• Generates events when tags, comments, etc. are

encountered– (Events are notifications to the application)

Page 14: XML Introduction. Introducing XML XML stands for Extensible Markup Language. A markup language specifies the structure and content of a document. Because

Parsing an XML Document with MSXML

• XML document– Contains data– Does not contain formatting information– Load XML document into Internet Explorer 5.0

• Document is parsed by msxml.• Places plus (+) or minus (-) signs next to container elements

– Plus sign indicates that all child elements are hidden– Clicking plus sign expands container element

» Displays children– Minus sign indicates that all child elements are visible– Clicking minus sign collapses container element

» Hides children

• Error generated, if document is not well formed

Page 15: XML Introduction. Introducing XML XML stands for Extensible Markup Language. A markup language specifies the structure and content of a document. Because

XML document shown in IE6.

Page 16: XML Introduction. Introducing XML XML stands for Extensible Markup Language. A markup language specifies the structure and content of a document. Because

Character Set

• XML documents may contain– Carriage returns– Line feeds– Unicode characters

• Enables computers to process characters for several languages

Page 17: XML Introduction. Introducing XML XML stands for Extensible Markup Language. A markup language specifies the structure and content of a document. Because

Characters vs. Markup

• XML must differentiate between– Markup text

• Enclosed in angle brackets (< and >)– e.g,. Child elements

– Character data• Text between start tag and end tag

– Welcome to XML!

– Elements versus Attributes

Page 18: XML Introduction. Introducing XML XML stands for Extensible Markup Language. A markup language specifies the structure and content of a document. Because

White Space, Entity References and Built-in Entities

• Whitespace characters– Spaces, tabs, line feeds and carriage returns

• Significant (preserved by application)• Insignificant (not preserved by application)

– Normalization» Whitespace collapsed into single whitespace

character» Sometimes whitespace removed entirely

<markup>This is character data</markup>

after normalization, becomes

<markup>This is character data</markup>

Page 19: XML Introduction. Introducing XML XML stands for Extensible Markup Language. A markup language specifies the structure and content of a document. Because

White Space, Entity References and Built-in Entities (cont.)

• XML-reserved characters– Ampersand (&)– Left-angle bracket (<)– Right-angle bracket (>)– Apostrophe (’)– Double quote (”)

• Entity references– Allow to use XML-reserved characters

• Begin with ampersand (&) and end with semicolon (;)

– Prevents from misinterpreting character data as markup

Page 20: XML Introduction. Introducing XML XML stands for Extensible Markup Language. A markup language specifies the structure and content of a document. Because

White Space, Entity References and Built-in Entities (cont.)

• Build-in entities– Ampersand (&amp;)– Left-angle bracket (&lt;)– Right-angle bracket (&gt;)– Apostrophe (&apos;)– Quotation mark (&quot;)– Mark up characters “<>&” in element message

<message>&lt;&gt;&amp;</message>

Page 21: XML Introduction. Introducing XML XML stands for Extensible Markup Language. A markup language specifies the structure and content of a document. Because

Document Object Model (DOM)

• XML Document Object Model (DOM)– Build tree structure in memory for XML

documents– DOM-based parsers parse these structures

• Exist in several languages (Java, C, C++, Python, Perl, C#, VB.NET, VB, etc)

Page 22: XML Introduction. Introducing XML XML stands for Extensible Markup Language. A markup language specifies the structure and content of a document. Because

Document Object Model (DOM)

• DOM tree– Each node represents an element, attribute,

etc.

<?xml version = "1.0"?><message from = "Paul" to = "Tem"> <body>Hi, Tim!</body></message>

• Node created for element message– Element message has child node for body element– Element body has child node for text "Hi, Tim!"– Attributes from and to also have nodes in tree

Page 23: XML Introduction. Introducing XML XML stands for Extensible Markup Language. A markup language specifies the structure and content of a document. Because

DOM Implementations

• DOM-based parsers– Microsoft’s msxml– Microsoft.NET System.Xml Namspace– Sun Microsystem’s JAXP

Page 24: XML Introduction. Introducing XML XML stands for Extensible Markup Language. A markup language specifies the structure and content of a document. Because

Creating Nodes

• Create XML document at run time

Page 25: XML Introduction. Introducing XML XML stands for Extensible Markup Language. A markup language specifies the structure and content of a document. Because

Traversing the DOM

• Use DOM to traverse XML document– Output element nodes– Output attribute nodes– Output text nodes

Page 26: XML Introduction. Introducing XML XML stands for Extensible Markup Language. A markup language specifies the structure and content of a document. Because

DOM Components

• Manipulate XML document

Page 27: XML Introduction. Introducing XML XML stands for Extensible Markup Language. A markup language specifies the structure and content of a document. Because

XPATH

• XML Path Language (XPath)– Syntax for locating information in XML

document• e.g., attribute values

– String-based language of expressions• Not structural language like XML

– Used by other XML technologies• XSLT

Page 28: XML Introduction. Introducing XML XML stands for Extensible Markup Language. A markup language specifies the structure and content of a document. Because

XPATH - Nodes

• XML document– Tree structure with nodes– Each node represents part of XML document

• Seven types– Root– Element– Attribute– Text– Comment– Processing instruction– Namespace

• Attributes and namespaces are not children of their parent node– They describe their parent node

Page 29: XML Introduction. Introducing XML XML stands for Extensible Markup Language. A markup language specifies the structure and content of a document. Because

XPath node types Node Type string-value expanded-name Description

root

Determined by concatenating the string-values of all text-node descendents in document order.

None. Represents the root of an XML document. This node exists only at the top of the tree and may contain element, comment or processor-instruction children.

element Determined by concatenating the string-values of all text-node descendents in document order.

The element tag, including the namespace prefix (if applicable).

Represents an XML element and may contain element, text, comment or processor-instruction children.

attribute The normalized value of the attribute.

The name of the attribute, including the namespace prefix (if applicable).

Represents an attribute of an element.

Page 30: XML Introduction. Introducing XML XML stands for Extensible Markup Language. A markup language specifies the structure and content of a document. Because

XPath node types. (Part 2) Node Type string-value expanded-name Description

text

The character data contained in the text node.

None. Represents the character data content of an element.

comment The content of the comment (not including <!-- and -->).

None. Represents an XML comment.

processing instruction

The part of the processing instruction that follows the target and any whitespace.

The target of the processing instruction.

Represents an XML processing instruction.

namespace The URI of the namespace. The namespace prefix.

Represents an XML namespace.

Page 31: XML Introduction. Introducing XML XML stands for Extensible Markup Language. A markup language specifies the structure and content of a document. Because

Location Paths

• Location path– Expression specifying how to navigate XPath

tree– Composed of location steps

• Each location step composed of– Axis– Node test– Predicate

Page 32: XML Introduction. Introducing XML XML stands for Extensible Markup Language. A markup language specifies the structure and content of a document. Because

Axes

• XPath searches are made relative to context node

• Axis – Indicates which nodes are included in search

• Relative to context node

– Dictates node ordering in set• Forward axes select nodes that follow context node

• Reverse axes select nodes that precede context node

Page 33: XML Introduction. Introducing XML XML stands for Extensible Markup Language. A markup language specifies the structure and content of a document. Because

Node Tests

• Node tests– Refine set of nodes selected by axis

• Rely upon axis’ principle node type– Corresponds to type of node axis can select

Page 34: XML Introduction. Introducing XML XML stands for Extensible Markup Language. A markup language specifies the structure and content of a document. Because

Node-set Operators and Functions (cont.)

• Location-path expressions– Combine node-set operators and functions

• Select all head and body children element nodeshead | body

• Select last bold element node in head element nodehead/title[ last() ]

• Select third book elementbook[ position() = 3 ]

– Or alternativelybook[ 3 ]

• Return total number of element-node childrencount( * )

• Select all book element nodes in document//book

Page 35: XML Introduction. Introducing XML XML stands for Extensible Markup Language. A markup language specifies the structure and content of a document. Because

Sample Data for Queries

<bib><book> <publisher> Addison-Wesley </publisher> <author> Serge Abiteboul </author> <author> <first-name> Rick </first-name> <last-name> Hull </last-name> </author> <author> Victor Vianu </author> <title> Foundations of Databases </title> <year> 1995 </year></book><book price=“55”> <publisher> Freeman </publisher> <author> Jeffrey D. Ullman </author> <title> Principles of Database and Knowledge Base Systems </title> <year> 1998 </year></book>

</bib>

<bib><book> <publisher> Addison-Wesley </publisher> <author> Serge Abiteboul </author> <author> <first-name> Rick </first-name> <last-name> Hull </last-name> </author> <author> Victor Vianu </author> <title> Foundations of Databases </title> <year> 1995 </year></book><book price=“55”> <publisher> Freeman </publisher> <author> Jeffrey D. Ullman </author> <title> Principles of Database and Knowledge Base Systems </title> <year> 1998 </year></book>

</bib>

Page 36: XML Introduction. Introducing XML XML stands for Extensible Markup Language. A markup language specifies the structure and content of a document. Because

Data Model for XPath

bib

book book

publisher author . . . .

Addison-Wesley Serge Abiteboul

The root

The root element

Page 37: XML Introduction. Introducing XML XML stands for Extensible Markup Language. A markup language specifies the structure and content of a document. Because

XPath: Simple Expressions

Result: <year> 1995 </year>

<year> 1998 </year>

Result: empty (there were no papers)

/bib/book/year/bib/book/year

/bib/paper/year/bib/paper/year

Page 38: XML Introduction. Introducing XML XML stands for Extensible Markup Language. A markup language specifies the structure and content of a document. Because

XML Document Type Definitions

DeclarationsDefinition of element and attribute

Content Model (regular expressions)– Association of attributes with elements– Association of elements with other– Order and cardinality constraints

Page 39: XML Introduction. Introducing XML XML stands for Extensible Markup Language. A markup language specifies the structure and content of a document. Because

Element Declarations Basic form

– <!ELEMENT elementname (contentmodel)>– Contentmodel determines which– Given by a regular expression

Atomic contents– Element content

<!ELEMENT example ( a )>– Text content

<!ELEMENT example (#PCDATA)>– Empty Element

<!ELEMENT example EMPTY>– Arbitrary content

<!ELEMENT example ANY>

Page 40: XML Introduction. Introducing XML XML stands for Extensible Markup Language. A markup language specifies the structure and content of a document. Because

Element Declarations Sequence

<!ELEMENT example ( a, b )> Alternative

<!ELEMENT example ( a | b )> Optional (zero or one)

<!ELEMENT example ( a )?> Optional and repeatable (zero or more)

<!ELEMENT example ( a )*> Required and repeatable (one or more)

<!ELEMENT example ( a )+> Mixed content

<!ELEMENT example (#PCDATA | a)*> Content model can be grouped by parentheses Cyclic element containment is allowed

Page 41: XML Introduction. Introducing XML XML stands for Extensible Markup Language. A markup language specifies the structure and content of a document. Because

Attribute Declarations Each element can be associated with an arbitrary number of attributes Basic form

– <!ATTLIST Elementname Attributename Type DefaultAttributename Type Default

... > Example:

Document Type Definition<!ELEMENT shipTo ( #PCDATA)><!ATTLIST shipTo country CDATA #REQUIRED "US"

state CDATA #IMPLIEDversion CDATA #FIXED "1.0"payment (cash|creditCard) "cash">

Document<shipTo country="Switzerland"version="1.0"payment="creditCard"> … </shipTo>

Page 42: XML Introduction. Introducing XML XML stands for Extensible Markup Language. A markup language specifies the structure and content of a document. Because

Attribute Declarations - Types CDATA

– String– <!ATTLIST example HREF CDATA

Enumeration– Token from given set of values, Default– <!ATTLIST example selection (

Possible Defaults– Required attribute: #REQUIRED– Optional attribute: #IMPLIED– Fixed attribute: #FIXED– Default for enumeration: "value"

Other attribute types: IF, IDREF, ENTITY, ENTITIES, NOTATION, NAME, NAMES, NMTOKEN, NMTOKENS

Page 43: XML Introduction. Introducing XML XML stands for Extensible Markup Language. A markup language specifies the structure and content of a document. Because

ID/IDREFExample: ID/IDREF

ID, IDREF– ID is a unique identifier within the document

– IDREF is a reference to an ID

– Referential integrity checked by the parser

– ID's determined by the application

– <!ATTLIST example identity ID #IMPLIED

reference IDREF #IMPLIED>

Page 44: XML Introduction. Introducing XML XML stands for Extensible Markup Language. A markup language specifies the structure and content of a document. Because

Inclusion of XML Document Type Definitions

External DTD Declaration<?xml version="1.0" encoding="ISO-8859-1"?><!DOCTYPE test PUBLIC "-//Test AG//DTD test V1.0//EN"SYSTEM "http://www.test.org/test.dtd"><test> "test" is a document element </test>

Internal DTD Declaration<!DOCTYPE test [ <!ELEMENT test EMPTY> ]><test/>

Mixed usage<!DOCTYPE test SYSTEM "http://www.test.org/test.dtd" [<!ENTITY hello "hello world"> ]><test>&hello;</test>

Page 45: XML Introduction. Introducing XML XML stands for Extensible Markup Language. A markup language specifies the structure and content of a document. Because

Working with Namespaces

• Name collision occurs when elements from two or more documents share the same name.

• Name collision isn’t a problem if you are not concerned with validation. The document content only needs to be well-formed.

• However, name collision will keep a document from being validated.

Page 46: XML Introduction. Introducing XML XML stands for Extensible Markup Language. A markup language specifies the structure and content of a document. Because

Name CollisionThis figure shows two documents each with a Name element

Page 47: XML Introduction. Introducing XML XML stands for Extensible Markup Language. A markup language specifies the structure and content of a document. Because

Using Namespaces to Avoid Name Collision

This figure shows how to use a namespace to avoid collision

Page 48: XML Introduction. Introducing XML XML stands for Extensible Markup Language. A markup language specifies the structure and content of a document. Because

Declaring a Namespace

• A namespace is a defined collection of element and attribute names.

• Names that belong to the same namespace must be unique. Elements can share the same name if they reside in different namespaces.

• Namespaces must be declared before they can be used.

Page 49: XML Introduction. Introducing XML XML stands for Extensible Markup Language. A markup language specifies the structure and content of a document. Because

Declaring a Namespace

• A namespace can be declared in the prolog or as an element attribute. The syntax to declare a namespace in the prolog is:

<?xml:namespace ns=“URI” prefix=“prefix”?>

• Where URI is a Uniform Resource Identifier that assigns a unique name to the namespace, and prefix is a string of letters that associates each element or attribute in the document with the declared namespace.

Page 50: XML Introduction. Introducing XML XML stands for Extensible Markup Language. A markup language specifies the structure and content of a document. Because

Declaring a Namespace

• For example,

<?xml:namespace ns=http://uhosp/patients/ns prefix=“pat”>

• Declares a namespace with the prefix “pat” and the URI http://uhosp/patients/ns.

• The URI is not a Web address. A URI identifies a physical or an abstract resource.

Page 51: XML Introduction. Introducing XML XML stands for Extensible Markup Language. A markup language specifies the structure and content of a document. Because

1 <?xml version = "1.0"?>

2

3 <!-- Fig. 5.9 : defaultnamespace.xml -->

4 <!-- Using Default Namespaces -->

5

6 <directory xmlns = "urn:deitel:textInfo"

7 xmlns:image = "urn:deitel:imageInfo">

8

9 <file filename = "book.xml">

10 <description>A book list</description>

11 </file>

12

13 <image:file filename = "funny.jpg">

14 <image:description>A funny picture</image:description>

15 <image:size width = "200" height = "100"/>

16 </image:file>

17

18 </directory>

Page 52: XML Introduction. Introducing XML XML stands for Extensible Markup Language. A markup language specifies the structure and content of a document. Because

<part-catalog xmlns:nw="http://www.nutware.com/" xmlns="http://www.bobco.com/" >

<nw:entry nw:number="1327"> <nw:description>torque-balancing hexnut</nw:description>

</nw:entry> <part id="555"> <name>type 4 wingnut</name> </part> </part-catalog>

Page 53: XML Introduction. Introducing XML XML stands for Extensible Markup Language. A markup language specifies the structure and content of a document. Because

Schemas

• A schema is an XML document that defines the content and structure of one or more XML documents.

• To avoid confusion, the XML document containing the content is called the instance document.

• It represents a specific instance of the structure defined in the schema.

Page 54: XML Introduction. Introducing XML XML stands for Extensible Markup Language. A markup language specifies the structure and content of a document. Because

Comparing Schemas and DTDs

This figure compares schemas and DTDs

Page 55: XML Introduction. Introducing XML XML stands for Extensible Markup Language. A markup language specifies the structure and content of a document. Because

Schema Dialects

• There is no single schema form.

• Several schema “dialects” have been developed in the XML language.

• Support for a particular schema depends on the XML parser being used for validation.

Page 56: XML Introduction. Introducing XML XML stands for Extensible Markup Language. A markup language specifies the structure and content of a document. Because

Starting a Schema File

• A schema is always placed in a separate XML document that is referenced by the instance document.

Page 57: XML Introduction. Introducing XML XML stands for Extensible Markup Language. A markup language specifies the structure and content of a document. Because

Schema Types

• XML Schema recognize two categories of element types: complex and simple.

• A complex type element has one or more attributes, or is the parent to one or more child elements.

• A simple type element contains only character data and has no attributes.

Page 58: XML Introduction. Introducing XML XML stands for Extensible Markup Language. A markup language specifies the structure and content of a document. Because

Schema Types

This figure shows types of elements

Page 59: XML Introduction. Introducing XML XML stands for Extensible Markup Language. A markup language specifies the structure and content of a document. Because

Understanding Data Types

• XML Schema supports two data types: built-in and user-derived.

• A built-in data type is part of the XML Schema specifications and is available to all XML Schema authors.

• A user-derived data type is created by the XML Schema author for specific data values in the instance document.

Page 60: XML Introduction. Introducing XML XML stands for Extensible Markup Language. A markup language specifies the structure and content of a document. Because

Understanding Data Types

• A primitive data type, also called a base type, is one of 19 fundamental data types not defined in terms of other types.

• A derived data type is a collection of 25 data types that the XML Schema developers created based on the 19 primitive types.

Page 61: XML Introduction. Introducing XML XML stands for Extensible Markup Language. A markup language specifies the structure and content of a document. Because

Example Document – Sequence Constructor

XML Document<USAddress country="US">

<name>Alice Smith</name><street>123 Maple

Street</street><city>Mill Valley</city><state>CA</state><zip>90952</zip>

</USAddress >

DTD <!ELEMENT USAdress(name,street,city, state,zip )><!ATTLIST USAdress country CDATA #FIXED ><!ELEMENT name #PCDATA> etc.

Page 62: XML Introduction. Introducing XML XML stands for Extensible Markup Language. A markup language specifies the structure and content of a document. Because

Example Document – Sequence Constructor

XML Schema<xsd:complexType name="USAddress"><xsd:sequence><xsd:element name="name" type="xsd:string"/><xsd:element name="street" type="xsd:string"/><xsd:element name="city" type="xsd:string"/><xsd:element name="state" type="xsd:string"/><xsd:element name="zip" type="xsd:decimal"/></xsd:sequence><xsd:attribute name="country" type="xsd:NMTOKEN"

use="fixed" value="US"/></xsd:complexType>

Page 63: XML Introduction. Introducing XML XML stands for Extensible Markup Language. A markup language specifies the structure and content of a document. Because

Anonymous Types and User-Defined Simple Types

<xsd:complexType name="Items"><xsd:sequence>

<xsd:element name="item" minOccurs="0" maxOccurs="unbounded"><xsd:complexType>

<xsd:sequence><xsd:element name="productName" type="xsd:string"/><xsd:element name="quantity">

<xsd:simpleType><xsd:restriction base="xsd:positiveInteger"><xsd:maxExclusive value="100"/></xsd:restriction></xsd:simpleType>

</xsd:element><xsd:element name="USPrice" type="xsd:decimal"/><xsd:element ref="comment" minOccurs="0"/><xsd:element name="shipDate" type="xsd:date“ minOccurs="0"/></xsd:sequence><xsd:attribute name="partNum" type="SKU"/>

</xsd:complexType></xsd:element>

</xsd:sequence></xsd:complexType>