© Michael Sonntag 2004
BasicsXML, Namespaces, DTD
XML Techniques for E-Commerce, Budapest 2004
Mag. iur. Dr. techn. Michael Sonntag
Institute for Information Processing andMicroprocessor Technology (FIM)
Johannes Kepler University Linz, AustriaE-Mail: [email protected]://www.fim.uni-linz.ac.at/staff/sonntag.htm
© Michael Sonntag 2004
QuestionsQuestions??Please ask immediately!
? ?
??
??
Michael Sonntag 3XML Techniques for E-Commerce: Basics
Content
IntroductionXML: Structure and principlesThe problems of XMLXML
StructureElementsAttributesOther concepts
DTDsWhat is it and why is it not enough?Defining content with DTD
Short but important: Namespaces
Michael Sonntag 4XML Techniques for E-Commerce: Basics
Introduction
What is this record?
1000101101101101011001000110101100101100111111000110110010000001000011101101110100110101010011010100111010010011010110011011101101111010101101101000000111001111
This is an order from a customer for a newLaptop and an external firewire harddisk.
Isn't this obvious???
Michael Sonntag 5XML Techniques for E-Commerce: Basics
Introduction
1000101101101101011001000110101100101100111111000110110010000001000011101101110100110101010011010100111010010011010110011011101101111010101101101000000111001111
Order number Ordering date
1 piece
Firewire harddiskLaptop
Customer name
Delivery dateUsage by recipient:
if(orderBytes[15]&0x3f==0x17) ...String customerName=extractBytes(orderBytes,4,20);...
Michael Sonntag 6XML Techniques for E-Commerce: Basics
Introduction
This is NOT a very good design!Where is the description of the record?How to write the parser to extract all the data?What if we want to add a "comment" field?...
Better:<Order>
<OrderNr>139</OrderNr><OrderingDate>11.11.2003</OrderingDate><CustomerName>Michael Sonntag</CustomerName><Items>
<Item count="1" Nr="0815">Laptop</Item><Item count="1" Nr="4711">Firewire harddisk (External)</Item>
</Items><DeliveryDate>2.1.2004, 10:30</DeliveryDate>
</Order>
Michael Sonntag 7XML Techniques for E-Commerce: Basics
Reasons & Goals for XML (1)
Format for storage of dataIndependent from presentation (XSLT for this)
ExtensibilityTo accommodate all the different needs without formal process
Exact definitionStrict set of rules without alternatives for the same, but stillcompletePossibility to check correctness on different levels
» E. g. without knowing about the contentPlatform independence
Very simple basic character set (others optional; see below)Everything is text (no binary data which can lead to problems)
Full character setXML supports Unicode
Michael Sonntag 8XML Techniques for E-Commerce: Basics
Reasons & Goals for XML (2)
Open definitionNot proprietary (would allow revocation/changes at any time)Can be standardized
SimplicityShould be easy to understand
» Partly document can be used even without knowing the definition» Human readable: No special tools needed for basic information
Easier to implementApplication independence
Not only for webpages but also for any applicationTerseness unimportant
Rely on text compression programs for this
Michael Sonntag 9XML Techniques for E-Commerce: Basics
Reasons & Goals for XML (3)
History: Binary formatsPro: Very small, rather simple to createCon: Unintelligible without proper format, complicated to parse, special parser for every message/format, etc., works for only one type of computer (or large difficulties)
XML: Textual formatONE parser for all documentsDescription of structure can be contained
» Or can be easily exchangedAutomatic checking whether document conforms to structureHuman readable, easily understandableCompletely portable (systems, character sets, etc.)
Michael Sonntag 10XML Techniques for E-Commerce: Basics
XML vs. HTML
HTML: Describes the visual representationXML: Describes the structure of the dataExample: A poem
HTML: Linebreak defines when a new line on the screen starts» Might be off if the line is long and the screen small
What's in a name? that which we call a rose<br>By any other name would smell as sweet;<br>
XML: Linebreak defines when one line of the poem ends» Whether a single line of the poem is printed on one or two lines, the
second perhaps indented, … is NOT specified!<line>What's in a name? that which we call a rose</line><line>By any other name would smell as sweet;</line>
http://the-tech.mit.edu/Shakespeare/romeo_juliet/romeo_juliet.2.2.html
Michael Sonntag 11XML Techniques for E-Commerce: Basics
What XML is not
No programming languageXML is for data; programs can be stored as data, but not executed
No successor/replacement of HTMLXML + ..... (e. g. stylesheets) could replace HTMLThis is often, but not always a good idea!
No databaseFile format: A lot of data can be stored, but there is e. g. no (efficient) query language
» XPath and XQuery work, but currently performace nowhere near SQLEfficiency as a database is very bad (slow, no transactions, ...)
Not reserved for special applications ("Only" Web, EDI, ...)Universal, can be used everywhere
No "jack of all trades"Suitable for many applications, but obviously not for all!
Michael Sonntag 12XML Techniques for E-Commerce: Basics
Of the origin of XML (1)
(Grand)parent: SGML (ISO 8879; 15.10.1986)Rather complicated; easy implementation was NOT a design goal!
Initial idea: SGML Editorial Review Board (1996)Participation from the SGML Working GroupAll in the scope of the W3C
Name changed to XML Working GroupParticipation from XML Special Interest Group
Now: XML Core Working GroupFirst Version: 10.2.1998 (W3C Recommendation)
Extensible Markup Language (XML) 1.0Second Version: 6.10.2000 (W3C Recommendation)
Extensible Markup Language (XML) 1.0 (Second Edition)» Not a new version, just includes all the errata since 1998!
GML (1969)
SGML (1985)
HTML (1993)
XML (1998)
Michael Sonntag 13XML Techniques for E-Commerce: Basics
Of the origin of XML (2)
Third Version: 4.2.2004 (W3C Recommendation)Extensible Markup Language (XML) 1.0 (Third Edition)
» Not a new version, just includes all the errata since 2000!New Version: 4.2.2004 (W3C Recommendation)
Extensible Markup Language (XML) 1.1Official encouragement to use 1.0, if new features are not required!Changes/new features:
» New Unicode characters can now also be used in names– In content text already possible!
» Names are more "loose" everything not forbidden is allowed– 1.0: Everything not allowed is forbidden
» Additional line termination characters (important for mainframes only)– XML files are then plain text (instead of binaries) also on these computers
» Normalization: Allows binary comparison even for Unicode characters– Uses the "Unicode Normalization Form C"
Michael Sonntag 14XML Techniques for E-Commerce: Basics
Some XML technologies
XML
HTML
XML
FOXML
Schema
XMLName-space
XSLT
ebXML, SOAP,
SecurityMetadata,
...
XPath
Java
Michael Sonntag 15XML Techniques for E-Commerce: Basics
Vital
Some XML technologies
XML
HTML
XML
FOXML
Schema
XMLName-space
XSLT
ebXML, SOAP,
SecurityMetadata,
...
XPath
Java
Optional
Michael Sonntag 16XML Techniques for E-Commerce: Basics
Basic XML
XML
HTML
XML
FOXML
Schema
XMLName-space
XSLT
ebXML, SOAP,
SecurityMetadata,
...
XPath
Java
Michael Sonntag 17XML Techniques for E-Commerce: Basics
Structure of XML:Elements
A "tag" is a name within angular brackets (“<”,”>”)Each "element" consists of a start and an end tag
Empty elements are “fused” together (modified end tag alone)Between start and end tag there may be some contentStart or empty tags may contain attributes (see later)
Restrictions for tag names:Case-sensitive (Unicode character number, glyphs don’t matter)May not start with xml (or XML, xML, xMl, …)Name must start with letter, “_” or “:”Within a name: Letter, Digit, “. -_:” + some other Unicode chars
» If namespaces are used, “:” is NOT allowed anymore!Examples:
<address> … </address>; <surname> … </surname><address/>
Michael Sonntag 18XML Techniques for E-Commerce: Basics
Structure of XML:Well-formedness
Name of start tag must exactly match name of end tagNo “interleaving”
<a>…<b>…</a>…</b>: INVALID!
<a>…<b>…</b>…</a>: VALID!
<a>…<b/>…</a>: VALID!
At the top level there may be only a single elementThe “document element” (its tag name is irrelevant, however!)
Any attribute may occur only once in a single tag+ Several rules for entities and entity references
See specification for details!
<a>…</a>…<b>…</b>: VALID!
Michael Sonntag 19XML Techniques for E-Commerce: Basics
Structure of XML:Attributes
Attribute: name “=“ valueValue MUST be quoted
Contrast to HTML!Quotes either single or double
» “…” or ‘…’Restrictions:
No attributes without a value allowed» Value can be the empty string
Must be in start tag or empty tagOrder within the element is unimportantExamples:
<ring OwnerID=“#1”>,<Time daylight=“”><cake type=“honey” expires=“11.11.2002”>
Michael Sonntag 20XML Techniques for E-Commerce: Basics
General structure / Data model
Hierarchical ordering of elements in a treeOnly a single root node (tree, not a forest)
Attributes for each element possibleText cannot contain attributes!
Elements and attributes are both "nodes"Each element may possess (OR)
(Child) Elements
Textual content
Good design:Child elements XOR Text!(One or the other, but not both!) Bad!
Michael Sonntag 21XML Techniques for E-Commerce: Basics
Structure of XML:General layout
There may be a XML declaration at the start of the fileExample: <?xml version="1.0"?>
May be followed by a reference to a DTD (see later)Example: <!DOCTYPE order SYSTEM ”Frying_Pan_Orders.dtd">
After that the single document element followsAt the end only PI’s, comments and whitespaces may follow
Examples (Each a complete XML file):<?xml version="1.0"?> <location>The Waters</location><CompanyCount>12</ CompanyCount >
Michael Sonntag 22XML Techniques for E-Commerce: Basics
Structure of XML: Example
<?xml version="1.0" encoding="UTF-8"?><message>
<Recipient><email>[email protected]</email>
</Recipient><Subject>Ãœbungsabgabe</Subject><Sender>
<email>[email protected]</email></Sender><CC>
<email>[email protected]</email></CC><BCC/><Body>Bis wann ist die neue Ãœbung abzugeben?</Body>
</message>Message.xml
Unicode encoding: ASCII is 8 bits, special characters are 16 bits
=Ü
Michael Sonntag 23XML Techniques for E-Commerce: Basics
Structure of XML: Example
Creating a well-formed XML file according to the following specification
The file should contain your contact informationYout student number as well as other study-related information should be contained
Important: Think about the structure before writing!How should it be split up?
» E. g. street + house number together or two separate elements?What is an element, what an attribute?
» Is the ZIP code an attribute of the city or a separate element?Which order/containment?
» E. g. Split up the "address" into separate parts below it or are they child of the document element?
Persondata.xml
Michael Sonntag 24XML Techniques for E-Commerce: Basics
Structure of XML:Characters
Within and between tags there may be textText consists of any Unicode characters
Small exception: Surrogate blocks, 0xFFFE (BOM) and 0xFFFFDepends on the definition what it means
Usually characters only within tags» Example: <a><b>chars1</b><b>chars2</b></a>» But not: <a><b>chars1</b>chars3<b>chars2</b></a> (Although this is
allowed by the specification!)May never contain <, > or &
» Must be represented as “<”, “>” and “&” (or numerically)– No other special characters defined by default (unlike HTML!)
Whitespaces: Space, carriage return, line feed, tabSpecial processing possible (i. e. automatically removed)
Michael Sonntag 25XML Techniques for E-Commerce: Basics
Characters: Examples
E-Mail address:"Michael Sonntag" <[email protected]>
» "Michael Sonntag" <[email protected]>» Angular brackets cannot be contained in XML text content!» The quotes could be contained, unless this is an attribute!
Copyright notice:© 2003 by Michael Sonntag for FIM
» Web Browsers do know about these by default and can display them!© 2003 by Michael Sonntag for FIM
» Difficult to insert manually unless there is an Unicode editor with a character table for graphically picking characters available!
Company name:Acme GesmbH & Co KGResults in: "Acme GesmbH & Co KG"
Michael Sonntag 26XML Techniques for E-Commerce: Basics
Characters: Example
List of commonly used entities:http://www.w3.org/TR/html401/sgml/entities.htmlAlso understood by browsers!
Should this encoding be done extensively (wherever possible), or only if there is no other chance?
Extensively:» Text gets harder to read» References require no special editor» Makes you think whether this is really needed
Sparse:» Text can be read everywhere at least to some degree» No table of entities needed» Files are shorter (a bit)
Michael Sonntag 27XML Techniques for E-Commerce: Basics
Structure of XML:Whitespace handling
Spaces, tabs, blank lines: WhitespacesOften useful for visual layout (especially indentation)Usually unimportant for content / not intended for actual handlingSometimes however important
» Poetry, source code, encoded content, ...All characters must be passed to the application
This includes the whitespaces mentioned aboveValidating processor: Informing that they are whitespace obligatoryRegardless of the value of xml:space (see next slide)!
Michael Sonntag 28XML Techniques for E-Commerce: Basics
Structure of XML:Whitespace handling
Special attribute as signal for application: "xml:space"Applies to the element it is attribute of and all contained elements, unless specified there separatels
Can be used always and everywhereFor valid documents it must be declared, however (see later)!
Values: "default" (default value) and "preserve"Applies to all contained elements unless overriden
Default: The application should ignore these whitespacesPreserve: The application should consider whitespaces as important and preserve them
Michael Sonntag 29XML Techniques for E-Commerce: Basics
Structure of XML:Linebreak handling
To ease use, all linebreaks in special parts must be unified
 (CR LF) and 
 (CR) are converted to 
 (LF)
» XML 1.1: Additionally අ, … and 
 are converted– Special linebreaks from Unicode
Happens before parsing: Application sees only 
 ('\n')Applies only to "external parsed entities" and the XML-file itself(the document entity), nothing else
See later: Entities (DTD)!This is almost all content
Except those encoded explicitly» E. g. by using '
'
– This will stay the same and not be unified!
Michael Sonntag 30XML Techniques for E-Commerce: Basics
Structure of XML:Language identification
Special attribute for identifying the natural language used: "xml:lang"
Applies to all attributes and all subelements of the element it is defined in (unless below there is a new declaration!)
Can be used always and everywhereFor valid documents it must be declared, however (see later)!
Values are according to RFC 1766"en", "en-us", "en-uk", "de", "de-at", "de-de", ...
Examples:<p xml:lang="en-us">of golden color</p><p xml:lang="en-uk">of golden colour</p><title xml:lang="de" desc="short">Hin und wieder zurück</title>
Michael Sonntag 31XML Techniques for E-Commerce: Basics
Structure of XML:CDATA sections
Allows including any text within an XML fileStart: “<![CDATA[“, End: “]]>”
Allowed in between: Anything excluding “]]>” (the end marker)» No nesting of CDATA sections possible!
Tags within are treated as simple text, not as tags!During parsing the start and end marking is removed and only thecontained data is returned as textNo nesting allowed (ending may not appear in content)
Example: “<![CDATA[<spider></spiders>]]>” is well-formedParsing: Returns “<spider></spiders>” as text;NOT as elements or tags; and NO error (note the 's', which would cause an error if interpreted as a tag)!
Michael Sonntag 32XML Techniques for E-Commerce: Basics
CData sections: Example
Embedded HTML (unparsed, because no valid XML!)<![CDATA[<p>A short (and incorrect) HTML snippet<br><!— The ending p-tag is missing here for example -->You can also do any other uncommon (&) and strange (<>) things! ]]>Result is pure text:<p>A short (and incorrect) HTML snippet<br><!— The ending p-tag is missing here for example -->You can also do any other uncommon (&) and strange (<>) things!
Avoiding replacement by multiple character references :<![CDATA[int a = b <<< 1; val&=0x7F; if(b>=0) b=b>>1;]]>Instead of (alternate version of above with character references):int a = b <<< 1; val&=0x7F; if(b>=0) b=b>>1;Result is pure text (exactly the same for both!):int a = b <<<1; val&=0x7F; if(b>=0) b=b>>1;
Michael Sonntag 33XML Techniques for E-Commerce: Basics
CData sections: Example
Integrating a complete mini XML file (header, document-element, an element with an attribute) into the file as ordinary text (therefore unparsed!)
Content: A very brief visiting card in XML format!Create the data to be inserted as a separate file at firstEmbed this into the file as a CDATA section in the second step
VisitingCard.xml, Persondata_2.xml
Michael Sonntag 34XML Techniques for E-Commerce: Basics
Structure of XML:Comments
May appear anywhere outside of tagsStarts with “<!--” and ends with “-->”
The string “--” may not occur within the commentMay be suppressed by a parser
So perhaps not available to the application!Only for humans directly reading the XML file
Example: “<!-- Prints: ‘Arrival date: <Insert birthday>’ -->”
Michael Sonntag 35XML Techniques for E-Commerce: Basics
Comments: Example
Extend the personal data file by some comments:A few elementsFile header with name, E-Mail address, date of creation, ...
Persondata_3.xml
Michael Sonntag 36XML Techniques for E-Commerce: Basics
Structure of XML:Character & Entity references
Character reference: Represents a certain characterTo be used e. g. if the storage does not support Unicode
» E. g. in ASCII text filesSyntax: “&#” decimal number ”;” or “&#x” hex number “;”
Entity reference: Allow substitutionSyntax: “&” name “;” (Parsed) or “%” name “;” (Parameters)No recursion allowed (entity reference may not contain itself)Declaration of entities and different types: See DTD!
Example: “Weight is > %value-in-gold; g pure gold.”Might return: “Weight is > 351,7 g pure gold.”
» With the definition: <!ENTITY % value-in-gold "351,7">
Example:   is the same as (no-break space)
Michael Sonntag 37XML Techniques for E-Commerce: Basics
Character & Entity references : Example
In anticipation of DTDs (entity definition; see later):<!ENTITY copyright "© Michael Sonntag 2003">
Usage:<footer>%copyright;</footer>
Results in the same as:<footer>© Michael Sonntag 2003</footer>
Results in the same as:<footer>© Michael Sonntag 2003</footer>
Michael Sonntag 38XML Techniques for E-Commerce: Basics
Structure of XML:Processing instructions
Processing instructions (PI) allow documents to contain instructions for applications/parsersSyntax: “<?” name parameters “?>”
name: Any name excluding “xml”, “XmL”, …parameters: Any text without “?>” contained
PIs are just passed to the parser/application; it should (but is notrequired to) know about themContent is NOT returned as character data!
Example: “<?php mysql_query(“SELECT * FROM recipes WHERE duration<30”,$id);?>”
Michael Sonntag 39XML Techniques for E-Commerce: Basics
Problems of XML:Is it enough?
Now we can write content, but we still cannot define it!Use DTD’s; they're part of the XML specification (see immediately)
Entities can still be complicatedParsed/unparsed ones, which are allowed in which position, …
Attributes vs. elements: When to use which?Highly debated
» Result: Designers choice!Elements can be refined, attributes not
» Use attributes sparingly!Assumes everything should (and can) be expressed as tree
Structure of tags within tagsEverything else must be plain text or be expressed as a (sub-)tree
No viewer: Must always be processed
Michael Sonntag 40XML Techniques for E-Commerce: Basics
Syntax:How the individual elements may be put together
Semantics:What this special combination of elements means
Pragmatics:What is the result of this semantic (application to problems)Debate: Can exist only in humans or also in computers?
Politics: Where do we want to get to?XML (up to here) does not even possess explicit syntaxDTDs provide some syntaxSchemata provide extensive syntax RDF, OWL, ebXML, ... provide some semanticsPragmatics: User / Parser / Application (programs) (or not at all)
Syntax - Semantics - Pragmatics (- Politics?)
Michael Sonntag 41XML Techniques for E-Commerce: Basics
DTD - The idea behind it
DTD = Document Type DeclarationDefines the elements to be used in XML documentsLists attributes allowed (or required) for elements Describes allowed structural relationships between elements
Which elements may be children of an elementHow often they may/must occur
Specifies sequence (if any) in which elements must appearIn which sequence the children may appear
Can be included in the document or reside externally (*.dtd)Including an external DTD:
» <!DOCTYPE message SYSTEM "message.dtd">
Specifies the grammar (=syntax) for a certain applicationDocuments must follow this grammar exactly to be valid
Michael Sonntag 42XML Techniques for E-Commerce: Basics
General structure of DTD
The DTD must appear before the first elementI.e. immediately after the XML header and before the doc. element
Two kinds possible:External: <!DOCTYPE message SYSTEM "message.dtd"> Internal: <!DOCTYPE message [ <!ELEMENT message ANY> ]> External+Internal: Also possible; better avoid it!
» Internal DTD takes precedence before external
The name of the DTD (e. g. "message") must be identical with the name of the document element!
Hello.xml(Trivial; with internal DTD)
Michael Sonntag 43XML Techniques for E-Commerce: Basics
Validity of XML
Only for XML if DTD or Schemata are usedWell-formed documents need not match their/some specificationof the structure
If they do, they are also "valid"Schemas: Many additional rules
E. g. textual content must match the specified datatype
Checking validity verifies the syntax of the document on a higher level of abstraction
"Basic" syntax: Well-formedness (correct form & naming & containment of tags)"Extended" syntax: Validity (correct name & correct content)
» Schemas: Correct datatype, ...
Michael Sonntag 44XML Techniques for E-Commerce: Basics
Defining elements
"<!ELEMENT" name ("EMPTY"|"ANY"|mixed|children) ">"EMPTY: No content allowed
Must always be an empty element» Example: The <br> element from HTML
ANY: Any content is allowedText and elements in any order and combination
mixed: Sequence of "#PCDATA" or any other element#PCDATA: Parsed character data, i. e. any text
» Text only; no elements allowed!Order or number of occurrences of children CANNOT be defined, only their type!
» Similar to HTML (Text with arbitrary tags/elements inbetween)Syntax: (#PCDATA | ... | ... | ...)*
Important and required!Must be the first one!
Michael Sonntag 45XML Techniques for E-Commerce: Basics
Defining elements
children: May use parentheses for groupingChoice of children ("|")Sequence of children (",")May NOT contain PCDATA!
» Only for elements (and perhaps whitespaces inbetween them)Qualifier for children:
"?": Optional (child may occur 0 or 1 times)"+": Child may occur 1 or more times"*": Child may occur 0 or more timesNone: Must occur exactly once
Special version of mixed content: PCDATA only"<!ELEMENT" name "(#PCDATA)>" "*" only here optional!
Michael Sonntag 46XML Techniques for E-Commerce: Basics
Defining elementsExamples
<!ELEMENT barrel (volume,content?,labels*)>Correct:
» <barrel><volume/></barrel>» <barrel><volume/><content/></barrel>» <barrel><volume/><label/><label/></barrel>
Incorrect:» <barrel><content/><volume/></barrel>: Wrong sequence» <barrel><content/><label/></barrel>: “volume” is missing
<!ELEMENT content (#PCDATA|goodsID)*>Correct:
» <content>A lot of garbage</content>» <content>Garbage<goodsID/>Oil<goodsID/><goodsID/></content>
Incorrect:» <content><goodsDesc></content>: “goodsDesc” is tag and not text
1 0 or 1 0 - NFirst volume, then content
(or not) and at the end perhaps several labels
Michael Sonntag 47XML Techniques for E-Commerce: Basics
Defining elements Example
Define an internal DTD for the visiting card from the previous example
If you used attributes, ignore them for now!
VisitingCard_2.xml
Michael Sonntag 48XML Techniques for E-Commerce: Basics
Defining attributes
“<!ATTLIST” name attributes “>”name: To which element it belongs
attributes: name type default-valuename: Name of this attributetype: String, enumerated or token (see next page)
» String: “CDATA”: Character data (unparsed)» Enumerated: List of allowed strings
default-value: “#REQUIRED”, “#FIXED”, “#IMPLIED” or none» #REQUIRED: Attribute must be present; no default value» #FIXED: Attribute must be present and have exactly this value» #IMPLIED: Attribute is optional; no default value» None: Attribute is optional; default value must be provided
More than one attribute list for an element exists:Attributes are merged (all attributes of all lists allowed in any order)
Michael Sonntag 49XML Techniques for E-Commerce: Basics
Types of attributes (Tokens)
ID: Defines an unique (within document) id for this elementAn element may only have one ID (must be #IMPLIED or #REQUIRED)
IDREF: Reference to another ID in this documentThis ID must exist (but may occur later!)
IDREFS: Several valid IDREF may be specifiedENTITY: Must be the name of an unparsed entity
More or less a reference to something external (XML, text, …)ENTITIES: Several valid ENTITY may be specifiedNMTOKEN: For specifying a valid name
More restrictive than CDATA, which might be anythingNMTOKENS: Several valid NMTOKEN may be specified
Important ones
Michael Sonntag 50XML Techniques for E-Commerce: Basics
Attribute value normalization
Before passed to the application, all attribute values must be normalized by the parser
Character references (see entities below) are taken "as-is"» Replaced, but no normalization takes place on their replacement text
Entity references are replaced and then normalizedWhitespace characters are replaced by a space
» 
 (CR), 
 (LF), 	 (TAB)If attribute type is CDATA (or none available), the result is finishedOtherwise, any leading and trailing blanks are removed and all consecutive blanks are compressed to a single blank
Examples (' ' to represent space character):a=" 

A

 B 
 ";
» a is CDATA: "#x20 #xD #xD A #xA #xA #x20 B #x20 #x20 #xD #xA #x20"» a is NMTOKEN: "#xD #xD A #xA #xA B #xD #xA"
Attention: Well-formed but invalid as a NMTOKEN!
Michael Sonntag 51XML Techniques for E-Commerce: Basics
Attribute examples
<!ATTLIST instrumentattuned (yes|no) “yes”type CDATA #REQUIREDowner IDREF #IMPLIED>
Correct:» <instrument type=“saxophone”/>: Is considered attuned (default!)» <instrument owner=“12” type=“horn” />» <instrument type=“flute” attuned=“no” owner=“Me”/>
Incorrect:» <instrument/>: type is missing» <instrument type=“guitar” attuned/>: attuned must have some value» <instrument type=“” attuned=“maybe”/>: attuned has illegal value» <instrument type=“<to_be_determined>” owner=“14”/>
– Only if no ID=“14” exists somewhere!
Possible valuesDefault value
Must be specifiedCan be specified (optional)
Michael Sonntag 52XML Techniques for E-Commerce: Basics
Attribute examples
Extend your visiting card by a version number and the date of thelast change
Model both as an attribute!Version number: Optional, default value "1.0"Last change: Obligatory
Afterwards extend the DTD by those two attributes!And all other you did use before
VisitingCard_3.xml
Michael Sonntag 53XML Techniques for E-Commerce: Basics
DTD example
Create a DTD for the message exampleInclude it as an external DTD
Same file as before!Compare it with an automatically created DTD:
E. g. XMLSpyQuestions:
» Empyt elements?» Multiplicity?» Optional Elemente which are currently missing?
Therefore: Suited as a beginning, but exact checking afterwards needed!
Message.dtd, Message.xml
Michael Sonntag 54XML Techniques for E-Commerce: Basics
DTD example
Create a complete internal DTD for your personal data fileDon't forget to also update the contained visiting card!
» How to define the visiting card? CDATA does not exist– Hint: Think about what datatype this is, respectively why we use
"CDATA" at all!» Attention: Do you have to include the already created DTD for the
visiting card into the DTD for the personal data? Or somewhere else? Or not at all?
You can assume that at least one course of study is selected by any student, but that some pursue severalThe visiting card is optionalAdd optional Pager information
Persondata_4.xml
Michael Sonntag 55XML Techniques for E-Commerce: Basics
Conditional sections
Conditional sections allow in-/excluding parts of the DTDMay only be part of the DTD, not the document content!
"<![INCLUDE[" definitions-to-include "]]>"Parse the contained part
"<![IGNORE[" definitions-to-include "]]>"Ignore till the end of the conditional section, may be nested
»Contained INCLUDE contional-sections are still ignored!Example: Used together with parameter entities
<!ENTITY % draft 'INCLUDE' ><!ENTITY % final 'IGNORE' ><![%draft;[ <!ELEMENT book (comments*, title, body, supplements?)> ]]><![%final;[ <!ELEMENT book (title, body, supplements?)> ]]>
Used (and to be used) rarely if at all!
Michael Sonntag 56XML Techniques for E-Commerce: Basics
Entities (1)
Three kinds exist:Internal entities
» For defining commonly used text in a single location» Always a parsed entity
External entities» For including other XML-files» For defining valid unparsed external components» Almost always a parsed entity ( unless notation is present)
Parameter entities» Are also expanded within the DTD and within other entities» Used in conditional sections » Can be internal or external (but CANNOT contain a notation!)» Always a parsed entity
Michael Sonntag 57XML Techniques for E-Commerce: Basics
Entities (2)
Two types:Parsed entities
» Will be handled as an included XML file» Can possess different character encoding (UTF-16, …)» If external, should begin with a literal text declaration: “<?xml …. ?>”» Must be well-formed
– No including the start tag from one file and the end tag from another!– External parameter and internal entities are well-formed by definition
Unparsed entities» Are always just a reference to something external» Will NOT be included or parsed; no linebreak handling» E. g. for referencing images» Can only be used as references in ENTITY or ENTITIES attributes» Notations and more on this is not explained here!
Michael Sonntag 58XML Techniques for E-Commerce: Basics
Defining entities
“<!ENTITY” name definition “>”Parameter entities: “<!ENTITY % ” name definition “>”
» Special replacement rules (e. g. replaced within “normal” entities)Definition: A string value or an external reference
External references: Must include SYSTEM and optionally PUBLIC» Public: Used for generating other URIs (need not itself be an URI!)» See examples!
Contains an URI to some external informationIs a kind of include: May contain text and other elements
» Declarations are NOT allowed» Referenced content MUST be well-formed XML
– OR contain a reference to a declared notation
Michael Sonntag 59XML Techniques for E-Commerce: Basics
Entity examples
External entities<!ENTITY map SYSTEM"http://www.mountain.org/private/map.xml"><!ENTITY map PUBLIC "-//MOUNTAIN//MAP Special map of location”"http://www.mountain.org/private/map.xml">
» If PUBLIC exists, SYSTEM is the second one and has NO keyword!<!ENTITY door-pic SYSTEM "../imgs/OpenDoor.gif” NDATA gif>
Parameter and internal entities<!ENTITY % YN '"Yes"' ><!ENTITY WhatHeSaid "He said %YN;" >…&WhatHeSaid;
» Result: “He said “Yes””
Parameter entities are replaced within other entities!
Michael Sonntag 60XML Techniques for E-Commerce: Basics
Predefined entities
Several characters pose problems when used in text:<, >, &, ', "
Both entity and character references can be used for escaping: Then they are treated as text
All XML processors MUST know themBut they SHOULD be declared anyway
<!ENTITY lt "&#60;"><!ENTITY gt ">"><!ENTITY amp "&#38;"><!ENTITY apos "'"><!ENTITY quot """>
"Double escaping": & = '&''&#60;' = '<'
Michael Sonntag 61XML Techniques for E-Commerce: Basics
Predefined entitiesDouble escaping
<!ENTITY lt "&#60;">After processing the declaration, 'lt' references the string '<'After replacing '<' in the text, the text contains the string '<'This returns '<' on parsing the text
» Entity declarations are parsed twice: Once on their definition, again when their content is encountered in the text
<!ENTITY lt "<">After processing the declaration, 'lt' references the string '<'After replacing '<' in the text, the text contains '<'On parsing the text a tag would be expected to start!
Michael Sonntag 62XML Techniques for E-Commerce: Basics
Complicated entity example
<!ENTITY example "<p>An ampersand (&#38;) may be escaped numerically (&#38;#38;) or with a general entity (&amp;).</p>" >After parsing, "example" references the following string:"<p>An ampersand (&) may be escaped numerically (&#38;) or with a general entity (&amp;).</p>"Referencing this in the document using "&example;" results in a "p" element containing the following text:"An ampersand (&) may be escaped numerically (&) or with a general entity (&). "
Entity declaration: Character references are replaced, but entity references are not (See "&#38;" converted to "&", but "&amp;" remaining"&amp;")Entity usage: Character and entity references are replaced
Michael Sonntag 63XML Techniques for E-Commerce: Basics
Validating and Non-Validating Processors
Validating processors check the documents syntaxMust read and process the entire DTDMust read and process all external parsed entitiesIf not found, readable, ...: Error!
Non-Validating processors have limitations:Obviously: No errors on syntax violations!But also:
» Information returned may vary: Parameter or external entities may or may not have been read
» Therefore avoid this!Complicated definition how far they must read and process an internal DTD
Both must report errors in well-formednessNon-validating: Only as far as they have read external parts!
Michael Sonntag 64XML Techniques for E-Commerce: Basics
DTD Example:Offline Converter configuration
Contains complete DTDPCDATA, sequence, optional elementsAttribute lists
Contains complete dataMissing:
Entities (external, internal)Character references
converter_config.xml
Previous version of program for converting IMS manifests into static webpages.New version uses schema!
Michael Sonntag 65XML Techniques for E-Commerce: Basics
XML … yes, nice!But this looks like ….!
Pure XML can be viewed with a webbrowser, but doesn't look too nice and isn't easy to readRemedy: CSS (small solution) or XSLT (large solution)
Both are complex standards, especially XSLTIncluding a stylesheet results in a better display
Message.dtd, Message.css, Message_CSS.xml
Michael Sonntag 66XML Techniques for E-Commerce: Basics
XML … yes, nice!But this looks like ….!
Before Afterwards
Michael Sonntag 67XML Techniques for E-Commerce: Basics
ExampleCreate XML for DTD and CSS
Create an XML file for the DTD below and check its well-formedness and validity with a tool
Now you can easily write dramas like Shakespear!» All you need is the DTD (and perhaps a little bit of inspiration!)
Test the presentation with the stylesheet provided!
Book.dtd, Book.css, Book.xml
Michael Sonntag 68XML Techniques for E-Commerce: Basics
Is DTD the DDT for getting rid of all bugs?
DTD does not support different namespacesTwo companies define the same tag with different meaning
» There can never be a combined document!Very weak datatype system
Applies also only to attributesDTD only provides syntax
DTD is itself not written in XMLThe description of an XML document should be written in XMLThen it can be used by parsers without requiring special handling
Very simple constraintsOnly: 0, 1, 0..n, 1..nImpossible: 3, 5-9, …No choice possible: A or B or C, but only one (or two) of them
Michael Sonntag 69XML Techniques for E-Commerce: Basics
DTD should not be really used any more!
Because of the weaknesses mentioned before, DTD's should only be supported for backwards compatibility
Biggest problem: DTD is not written in (well-formed) XML itself» Difficult for handling by parsers, programs, etc.
– See part on programming later: Writing internal DTD's or manipulating them does rarely work as expected or isn't supported at all!
When designing new systems: Use schemas!More complicated to learnMuch stronger and extensive languageCan be easily handled by parsers
Some parts are however still important:Especially (external) entites
» If such things are needed!
Michael Sonntag 70XML Techniques for E-Commerce: Basics
XML Namespaces
HTML
XML
FOXML
SchemaXSLT
ebXML, SOAP,
SecurityMetadata,
...
XPath
Java
XMLName-space
XML
Michael Sonntag 71XML Techniques for E-Commerce: Basics
One more reason for XML… (1)
Suppose data exchange in binary format works…Then you want to add additional information within itEach and every parser using this data MUST be changed
Alternative: Extensive versioning support must be built in right from the beginning
Extension fields/codes, etc.» Keeping those codes unique: Registry needed, ...
Rather difficult!Alternative: XML
Additional data can always be added without any problemsIf the parser doesn't know it, it will just be ignored (or warned) aboutOnly those programs needing this information must be changed, while the other programs and all parsers stay exactly the same!
Michael Sonntag 72XML Techniques for E-Commerce: Basics
One more reason for XML… (2)
However, there might still be one problem with extending XML documents…
What if an additional element should be introduced, where an element with exactly this name alread exists?
» Example: Merging customer data and order data together– E. g. Title of the person + title of the book = ????
For this we need different "regions of naming"!
These are called "namespaces" in XMLPractically all XML standards require them for exactly this reason!Each element is additionally "qualified" by an unique name
Michael Sonntag 73XML Techniques for E-Commerce: Basics
Namespace "example"
The ID is alphanumeric
I thought, the ID is only numbers
The ID is the name and a
number
Michael Sonntag 74XML Techniques for E-Commerce: Basics
Namespace "example"
The ID is alphanumeric
I thought, the ID is only numbers
The ID is the name and a
numberWhat is the context of ID?
Namespace "invoice" Namespace "customer"
Namespace "order"
invoice:IDNr customer:IDNr
order:IDNr
Michael Sonntag 75XML Techniques for E-Commerce: Basics
A brief interlude:XML Namespaces
Intention: Reusing markup structure ("vocabulary")Problems:
Recognition: How do we know which namespace it should be in?Collision: What if tags are named the same in different namespaces?
Namespace = Collection of names identified by an URIContent may be used as element types and attribute names
Qualified name (=includes namespace; also called 'QName'): namespace-prefix ":" local-namenamespace-prefix: Mapped to the URI of this namespace
» Interpretation by parser according to URI, not the prefix itself!– You COULD put in the complete URI everywhere instead!
local-name (=like "ordinary" name): See elements above!» Excluding the character ":", obviously!
Michael Sonntag 76XML Techniques for E-Commerce: Basics
URI, URL, URN
Namespace names are stringsThey must be URI's, but usually they are URL's
URI = Uniform Resource IdentifierFor unique identification of objectsURL and URN are applications (two subtypes) of URI's
URN = Uniform Resource NameUnique naming of objects, independent of locationNeed not be retrievable in any way!Example: urn:www-fim-jku-at/Converter/etc/ConfigFile.xsd
URL = Uniform Resource LocatorFor addressing resources in the InternetNeed not be unique (several URL's for one resource, several resources for one URL), specifies a "location" (=for retrieval)Example: http://www.fim.uni-linz.ac.at/Converter/ConfigFile.xsd
Michael Sonntag 77XML Techniques for E-Commerce: Basics
XML Namespaces:Defining a namespace
Namespace declarations are attributes of elementsTwo versions are possible:
Unnamed: "xmlns=‘ " URI " ’ "» Scope: This element only; See also below (defaulting)
Named: "xmlns:" name "=‘ " URI " ’ "» Scope: Everywhere the name is used + this element» The name itself has no meaning and can be choosen arbitrarily
Scope ≠ Where it applies! Scope = Where it can be used!URI: Defines the namespace; need not be retrievable
Can therefore be a URNname: May not start with "xml", "XML", "xMl", …Example: <party:x xmlns:party="http://organizations.org/company">…</party:x>
The "x" element and all contained elements are part of the namespace from "http://organizations.org/company"
Michael Sonntag 78XML Techniques for E-Commerce: Basics
Are we now qualified?
Qualification is possible for both elements and attributes:Qualified name replaces the "ordinary" name in both casesAttributes: Name may be the same IF the namespace is different!
Examples:<g xmlns:party="http://organizations.org/company"><party:leader title="Mr.">... </party:leader><business party:registerNo="exempt">Headhunter</business></g>
» "leader" and "registerNo" are in NS "http://organizations.org/company"» "g" is NOT in this NS» "business", "title": See "defaulting" below!
<tree biology:type="fir" age:type="young" state:type="weak"/>» Three times the same attribute("type")» But each time in a different namespace; therefore valid!
Michael Sonntag 79XML Techniques for E-Commerce: Basics
Uniqueness of namespaces
Namespace names must be unique (worldwide)Otherwise they bring no advantage at all!
Therefore usually constructed using domain namesThese are worldwide unique because of ICANN and registrars
Below (after) the domain (name), the company is itself responsible for unique namesExamples:
http://www.fim.uni-linz.ac.at/Konverter/ConfigFile.xsdhttp://www.fim.uni-linz.ac.at/Emerald/2003/1.0Alpha
Uniqueness guaranteed by ICANN & registrar
Uniqueness guaranteed by FIM (hopefully!)
Michael Sonntag 80XML Techniques for E-Commerce: Basics
Defaulting:For the lazy ones!
A namespace applies to the element it is used for and ALL elements within it (unless overridden by another use!)
Previous example: "business" is not part of namespace party, because its ancestor "g" is not in this namespace<party:g xmlns:party="..."><business ...>...</business></party:g>
» "business" is within the namespace party, because its ancestor "g" is within this namespace
Specifying an empty namespace removes the default one for thisand all its child elements
Michael Sonntag 81XML Techniques for E-Commerce: Basics
Defaulting:Attributes
Default NS do NOT apply to attributes, only to elements!<g xmlns:party="http://organizations.org/company"><party:leader title="Mr.">... </party:leader><business party:registerNo="exempt">Headhunter</business></g>"registerNo" must explicitly specify its NS to be within it!"title" is NOT in namespace party!
Different attributes of one element can use different NSNo defaulting for attributes, therefore always explicit specifiation!Possible: Same NS, different local namePossible: Different NS, same local nameNOT possible: Same NS AND same local name!
» Similar to "ordinary" XML: No two identical attributes» See tree example above!
Michael Sonntag 82XML Techniques for E-Commerce: Basics
XML+Namespace Example:IMS Manifest
Describes the organization of an online courseActually, only contains all possible metadata, but no real course!
No DTD – Uses schemata (see later)See especially:
Namespace declaration (document element)schema vs. lom: Namespace usetitle – langstring: xml:lang attributeStrange characters, e. g. keyword: UTF-8 encoding of "Umlaute""validity" element: Alternate namespace
» But "datetime" within: again in "imsmd"!vcard: Structured data within
» Could also be modelled in XML, but is according to another standardimsmanifest.xml
Michael Sonntag 83XML Techniques for E-Commerce: Basics
LiteratureSpecifications
XML 1.0 Specification (Version 3)http://www.w3.org/TR/REC-xmlXML 1.1 Specificationhttp://www.w3.org/TR/xml11Tim Bray: Commented XML specificationhttp://www.xml.com/axml/testaxml.htm
For original 1.0 specification only!XML Namespaces Specificationhttp://www.w3.org/TR/REC-xml-names/Markup Languages Cover Pageshttp://www.oasis-open.org/cover/
Michael Sonntag 84XML Techniques for E-Commerce: Basics
LiteratureOther
Knobloch/Knopp: Web-Design mit XML. Heidelberg: dpunkt 2001Stanek: XML Pocket Consultant. Redmond, MS Press, 2002Harold: The XML Bible2
http://www.ibiblio.org/xml/books/bible2/Microsoft XML information:http://msdn.microsoft.com/xml/Lots of XML information:http://www.xml.org/http://www.xml.com/http://www.devx.com/xml/http://xmlfiles.com/W3 Schools:http://www.w3schools.com/