What Are Real DTDs Like

Preview:

DESCRIPTION

What Are Real DTDs Like. Group Members : Xijie Zeng Peiyu Cai Presentor : Xijie Zeng. Outline. Overview Introduction Local properties Global properties. Overview. XML is widely used in a variety of areas DTDs with different structures define XML with different usages - PowerPoint PPT Presentation

Citation preview

What Are Real DTDs Like

Group Members :Xijie ZengPeiyu Cai

Presentor :Xijie Zeng

Outline

Overview Introduction Local properties Global properties

Overview

XML is widely used in a variety of areas

DTDs with different structures define XML with different usages

A survey based on a number of DTDs in our real world

Introduction DTDs are from XML.org DTD repository Three DTD categories :

app :Describe objects interchanged between programs/applications

data :Describe data stored in database

meta :Describe the structure of document markup

60 DTDs- 7 are app, 13 are data, 40 are meta

Introduction (cont.) A DTD can be described as a collection of ele

ment declarations of the form e α where e is the element name and α is the content model. The content model α::= ε| pcdata |e |α,α| α| α|α* | α+ | α?

Introduction (cont.)Email DTD<!ELEMENT email (head, body)><!ELEMENT head (from, to+, cc*, subject)><!ELEMENT from EMPTY><!ATTLIST from name CDATA #IMPLIED

address CDATA #REQUIRED><!ELEMENT to EMPTY><!ATTLIST to name CDATA #IMPLIED

address CDATA #REQUIRED><!ELEMENT cc EMPTY><!ATTLIST cc name CDATA #IMPLIED

address CDATA #REQUIRED>

<!ELEMENT subject (#PCDATA)><!ELEMENT body (text, attachment*)><!ELEMENT text (#PCDATA)><!ELEMENT attachment EMPTY><!ATTLIST attachment encoding (mime|binhex) "m

ime" file CDATA #REQUIRED>

email (head, body)head (from, to+, cc*, subject)from (ε)

to (ε)

cc (ε)

subject (pcdata)body (text, attachment*)text (pcdata)attachment (ε)

Introduction (cont.)

Local propertiesDescribe content models in individual element declarations

Global propertiesDescribe the graph-theoretic structure of the whole DTD

Local properties Content model classification

(1) pcdata (2) ε (3) any

No restriction on subelements (4) Mixed content

body (text, attachment*)text (pcdata)

(5) “|” only but not mixed content (6) “,” only (7) Complex content

Contains both “|” and “,”directory (dirname, dirinfo?, dirdesc?, (file | directory)*)

(8) List α * α +

(9) Single α ?

body1 (pcdata, attatchment*)

Local properties (cont.)

Content model classification

Local properties (cont.) Syntactic complexitydepth(ε) = 0;depth(е) = 1;depth(α*) = depth(α+) = depth(α?) =depth(pcdata) = 1;depth(α1,α2,…, αn) = depth(α1|α2,…|αn) =

depth(α) + 1;

max(depth(αi)) + 1;

Local properties (cont.) An examplehead (from, to+, cc*, subject)depth(from, to+, cc*, subject)

= depth(cc*) + 1= depth(cc) + 1 + 1= 1 + 1 + 1 = 3

Local properties (cont.) Determinism

If a content model DOES NOT require look ahead when parsing, it is a deterministic content model.non-deterministic content model : (a, b) | (a, c)

deterministic content model : a, (b|c) Result

It detects 5 non-deterministic content models in 4 DTDs.

Local properties (cont.) Ambiguity

Definition : An expression R is ambiguous if and only if there exists some string s in R such that there can be distinct ways to parse string s.partner (name?, onetime?, partnrid?, partnrtype?, syncind?, name*, parentid?, partnridx?, partnrratg*)

ResultIt detects 2 ambiguous content models.

Global properties ReachabilityDefinition : An element name e’ is reachable from e, denoted

by e e’ , if either e αand e’ occurs in α, or e e” and e” e’.

An example :email (head, body)head (from, to+, cc*, subject)

Definition : An element name e is reachable if r e, where r is the name of the root element. Otherwise element name e is called unreachable or useless.

email head email subjecthead subject

Global properties (cont.) Reachability

Unreachable element names in DTDs

Global properties (cont.) Recursions

Definition : A content model αis derivable from an element name e, denoted by e α, if either e α, or e α’, e’ α”, and α= α’[e’/ α”], where α= α’[e’/ α”] denotes the content model obtained by substituting α” for all occurrences of e’ in α’.

An example :email (head, body) head (from, to+, cc*, subject)

Definition : A DTD is recursive if and only if it has an element name e such that e e and e is reachable.

email (head, body)

head (from, to+, cc*, subject)

(from, to+, cc*, subject, body)email

Global properties (cont.) Recursions Definition : A DTD is linear recursive if and only if it is recursive and for any

reachable element name e and any e α, e occurs at most once inαand the occurrence is not enclosed in “*” or “+”. A DTD is said to be non-linear recursive if it is recursive but is not linear recursive.

An example of non-linear recursive :directory (dirname, dirinfo?, dirdesc?, (file | directory)*)

An example of linear recursive :e (pcdata | e)

ResultNo linear recursive DTD is found in the sample DTDs.There are 7, 2 and 26 non-linear recursive DTDs in the app, data and me

ta category respectively.

Global properties (cont.) Chain of stars

An example :entity (name*, contact*, location*, phone*, fax*)location (city*, otherinfo?)There is a chain of 2 stars.

Global properties (cont.) Chain of stars

Global properties (cont.) Hubs

Definition : Fan-in of an element name e is the cardinality of the set {e’ | e’ αand e occurs in α}. An element name with a large fan-in value is called hub.

An example :email (head, body)head (from, to+, cc*, subject)from (ε)to (ε)cc (ε)subject (pcdata)body (text, attachment*)text (pcdata)attachment (ε)

The fan-in value of email element is 0, and the fan-in value of all other elements in this DTD is 1.

Global properties (cont.)Result :

Fan-in of elements in data DTDs Fan-in of elements in meta DTDs

Summary Local properties

Content model classification Syntactic complexity Determinism Ambiguity

Global properties Reachability Recursions Chain of stars Hubs

One drawback of this survey It does not study any properties of attributes

Recommended