Upload
mikko
View
33
Download
0
Embed Size (px)
DESCRIPTION
XML Validation I DTDs. Robin Burke ECT 360 Winter 2004. Outline. History Grammars / Regular expressions DTDs elements attributes entities Declarations. Validation. Why bother?. The idea. Language consists of terminals a, b, c Set of productions beginning with non-terminals - PowerPoint PPT Presentation
Citation preview
XML Validation IDTDs
Robin Burke
ECT 360
Winter 2004
Outline
History Grammars / Regular expressions DTDs
elements attributes entities
Declarations
Validation
Why bother?
The idea
Language consists of terminals a, b, c
Set of productions beginning with non-terminals
A, B, C rules specifying how to generate sequences of
terminals
Example
A aB A aBA B b generates strings
ababab etc.
Grammar
Can be used to efficiently parse a language basis of all modern programming language
parsing since Algol-60 Java Language Specification is completely in
EBNF grammar
Grammar
XML grammar-based syntax adheres to EBNF
SGML SGML had a more complex language definition
syntax HTML is defined the SGML way
Regular expressions
Language for expressing patterns Basic components
pattern elements optional element = ? repetition (1 or more) = + repetition (0 or more) = * choice = | grouping = ( ) sequence = ,
Examples
(a, b)* all strings "ab" "abab" etc.
(a | b | c)+, q, (b, c)* aaqb bq bqcccccccc
Note
Regular expressions are different in different applications Perl Javascript XML Schemas
DTDs only support ?+*|,()
EBNF
EBNF is more compact version of BNF it uses regular expressions to simplify grammar expression
A aB A aBA turns into
A aB(A)?
only one production per non-terminal allowed
DTDs
Use EBNF to specify structure of XML documents
Plus attributes entities
Syntax holdover from SGML Ugly
DTD Syntax
<!ELEMENT element-name content_model>
Content model contains the RHS of the production rule
Example<!ELEMENT name
(firstName, lastName)>
DTD Syntax cont'd
Not XML <! begins a declaration No "content" Empty elements not indicated with />
Simple content models
Content can be any text #PCDATA
Content can be anything at all (useful for debugging) ANY
Element has no content EMPTY
Example<grades>
<grade><student>Jane Doe</student><assigned-grade>A</assigned-grade>
</grade><grade>
<student>John Doe</student><assigned-grade>A-</assigned-grade>
</grade></grades>
Example<grades>
<grade><student>Jane Doe</student><assigned-grade>A</assigned-grade>
</grade><grade>
<student>John Doe</student><assigned-grade>A-</assigned-grade>
</grade><grade> <student>Wayne Doe</student>
<assigned-grade>I</assigned-grade><reason>Alien abduction</reason>
</grade></grades>
Mixed content Legal to have a content model with text and element data
<story category="national" byline="Karen Wheatley"><headline>President Meets with Congress</headline><![CDATA[ The President meet with Congressional leaders today in
effort to jump-start faltering budget negotiations. Sources described the mood
of the meeting as "cordial". ]]> <full_text ref="news801" /> <image src="img2071.jpg" /> <image src="img2072.jpg" /> <image src="img2073.jpg" /></story>
CDATA?
Forgot to mention last week Content that appears here will not be parsed
Can include arbitrary text including <, &, etc. Only restriction
termination sequence ]]>
Mixed content, cont'd
<!ELEMENT story (headline, #PCDATA, full-story, image*)>
Mixed content makes handling XML complex necessary for many applications
Recursion
Unlike grammars recursive formulation ≠ repetition
Difference between <!ELEMENT students (student+)> <!ELEMENT students (student, students?)>
Restriction
The grammar cannot be ambiguous A (a, b)| (a, c) this makes the parser implementation difficult
Usually easy to make non-ambiguous A a, (b | c)
Attribute lists
Declared separately from elements can be anywhere in the DTD
Specification includes name of the element name of the attribute attribute type default
Attribute types Character data
CDATA different from XML CDATA section!
Enumerated (yes|no)
ID must be unique in the document
IDREF must refer to an id in the document
NMTOKEN a restriction of CDATA to single "word"
Also IDREFS and NMTOKENS
Default declaration
#REQUIRED #IMPLIED
means optional Value
this becomes the default #FIXED
value provided
Examples
<!ATTLIST img
src CDATA #REQUIRED
alt CDATA #REQUIRED
align (left|right|center) "left"
id ID #IMPLIED
>
<!ATTLIST timestamp
time-zone NMTOKEN #IMPLIED>
Entities
Like macros content to be inserted indicated with &name;
Predefined general entities & < essential part of XML
User-defined general entities &disclaimer;
Entities, cont'd
Parameter entities can also be used to simplify DTD creation or to combine DTDs indicated with a %
More on this next week
Defining general entities
<!ENTITY name content> Example
<!ENTITY disclaimer
"This is a work of fiction. Any resemblance to persons living or dead is unintentional.">
Unparsed data
What about non-text data? images, audio files
In XML we define a notation
create a name and associate an application suggestion to the application
how to interpret the unparsed data not part of parsing operation
Using Notation
<!NOTATION name SYSTEM url> Example
<!NOTATION jpeg SYSTEM "IExplore.exe"> declares the jpeg notation
Example <!ENTITY "photo53" SYSTEM "photo53.jpg"
NDATA jpeg>
Notation, cont'd
Note that the content is defined in the DTD not the document binary data embedded in XML document
Not that useful in practice more likely to use URLs
Typical Example<story category="national" byline="Karen Wheatley">
...
<full_text ref="news801" />
<image src="img2071.jpg" />
<image src="img2072.jpg" />
<image src="img2073.jpg" />
</story>
Now it is up to the application to do something appropriate with the src attribute
A better solution
Use XLink We'll talk about this later
DTD limitations
Not in XML need a special parser for the DTD
No content type restrictions #PCDATA can be anything
Element names must be globally unique cannot reuse a common term at different places in the
document course-name professor-name
DTD benefits
Relatively easy to write and understand wait until you see XML Schema!
Possible to modularize and combine DTDs more next week
Next week
More DTDs Modularization and parameterization on-line reading
Beginning Schemas 4.1-4.30
Lab