27
XML Data Management Deterministic DTDs and Schemas Werner Nutt

XML Data Management Deterministic DTDs and Schemas Werner Nutt

Embed Size (px)

Citation preview

Page 1: XML Data Management Deterministic DTDs and Schemas Werner Nutt

XML Data Management

Deterministic DTDs and Schemas

Werner Nutt

Page 2: XML Data Management Deterministic DTDs and Schemas Werner Nutt

How Expressive can a Schema Be?

<xsd:element name=“A” type=“oneB”/>

<xsd:complexType name=“onlyAs”> <xsd:choice> <xsd:sequence> <xsd:element name=“A” type=“onlyAs”/> <xsd:element name=“A” type=“onlyAs”/> </xsd:sequence> <xsd:element name=“A” type=“xsd:string”/> </xsd:choice></xsd:complexType>

<xsd:complexType name=“oneB”> <xsd:choice> <xsd:element name=“B” type=“xsd:string”/> <xsd:sequence> <xsd:element name=“A” type=“onlyAs”/> <xsd:element name=“A” type=“oneB”/> </xsd:sequence> <xsd:sequence> <xsd:element name=“A” type=“oneB”/> <xsd:element name=“A” type=“onlyAs”/> </xsd:sequence> </xsd:choice></xsd:complexType>

Arbitrary deep binary tree with A elements, and a single B element

What would documents look like that satisfy this schema?

How would one check validity? What would be the cost?What are the pros and cons of allowing such schemas?

This schema is a frequent example in teaching material on XML Schema

Page 3: XML Data Management Deterministic DTDs and Schemas Werner Nutt

Let’s see what SAXON says …

Page 4: XML Data Management Deterministic DTDs and Schemas Werner Nutt

• cos-element-consistent: Error for type 'oneB'. Multiple elements with name 'A', with different types, appear in the model group.

• cos-element-consistent: Error for type 'onlyAs'. Multiple elements with name 'A', with different types, appear in the model group.

• cos-nonambig: A and A (or elements from their substitution group) violate "Unique Particle Attribution". During validation against this schema, ambiguity would be created for those two particles.

• cos-nonambig: A and A (or elements from their substitution group) violate "Unique Particle Attribution". During validation against this schema, ambiguity would be created for those two particles.

Here is the Full Error Message from Eclipse

I.e., in a given context, elements with the same namemust have the same content.Easy to check!

That’s more subtle ...

Page 5: XML Data Management Deterministic DTDs and Schemas Werner Nutt

The Country Example in XML Schema

<?xml version="1.0" encoding="UTF-8"?><xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema" targetNamespace="http://www.example.org/country" xmlns="http://www.example.org/country" elementFormDefault="qualified">

<xsd:element name="country"> <xsd:complexType> <xsd:choice> <xsd:element name="king" type="xsd:string"></xsd:element> <xsd:element name="queen" type="xsd:string"></xsd:element> <xsd:sequence> <xsd:element name="king" type="xsd:string"></xsd:element> <xsd:element name="queen" type="xsd:string"></xsd:element> </xsd:sequence> </xsd:choice> </xsd:complexType> </xsd:element></xsd:schema>

As DTD: <!ELEMENT country (king | queen | (king,queen))>

Page 6: XML Data Management Deterministic DTDs and Schemas Werner Nutt

Also this is not validated …

• cos-nonambig: king and king (or elements from their substitution group) violate "Unique Particle Attribution". During validation against this schema, ambiguity would be created for those two particles.

Let’s check what this means!

Page 7: XML Data Management Deterministic DTDs and Schemas Werner Nutt

What the W3C Standard Explains …

Schema Component Constraint: Unique Particle Attribution

A content model must be formed such that during ·validation· of an element information item sequence, the particle contained directly, indirectly or ·implicitly· therein with which to attempt to ·validate· each item in the sequence in turn can be uniquely determined without examining the content or attributes of that item, and without any information about the items in the remainder of the sequence.

http://www.w3.org/TR/2001/REC-xmlschema-1-20010502/#cos-nonambig

Page 8: XML Data Management Deterministic DTDs and Schemas Werner Nutt

Questions and Ideas

Questions:• How can one make the standard formal?• How can a validator implement the standard?

Ideas:• Content models are specified by regular expressions• A regular expression E can be translated into

a finite state automaton A (Glushkov automaton)that checks which strings satisfy E

Construct A from E and check whether A is deterministic

Page 9: XML Data Management Deterministic DTDs and Schemas Werner Nutt

Formalization

• Alphabet (i.e., set of symbols):the element names occurring in the content model

• Regular expressions over are generated with the rule

e, f a | (ef) | (e|f) | (e)+ | (e)*

where e, f are expressions and a

• Language L(e) of an expression e (inductively defined)

• Exercise: Which of the following are in the language defined by a* (b | c) a+ ?– aba– abca

– aab– aaacaaa

In the following, we denote concatenation by a dot, no more by a comma.

Page 10: XML Data Management Deterministic DTDs and Schemas Werner Nutt

Regular Expressions and DTDs

These are formalizations of DTDs and validation:

A DTD is a pair (d, s) where• s is the start symbol• d maps every -symbol to a regular expression over

A document tree t satisfies d (t is valid wrt d) iff• the root of t is labeled s• for every node n in t, with symbol a,

the string formed by the names of the children of n satisfies d(a)

Validation is checking whether a string satisfies a regexp

Page 11: XML Data Management Deterministic DTDs and Schemas Werner Nutt

Markings

Distinguish between the different occurrences of a symbol in

a regexp by using numbers: markings of regexps

Examples:

• a1* (b2 | c3) a4+ is a marking of a* (b | c) a+

• king1 | queen2 | king3 queen4 is a marking of king | queen | king queen

Definition

A marking e′ of a regular expression e is an assignment of numbers to every symbol in e.

Page 12: XML Data Management Deterministic DTDs and Schemas Werner Nutt

Unmarked Version

Consider a regular expression e and a e marking of e

Definition:

For w L(e) , we denote by w# the corresponding unmarked string in L(r).

Example:

If w = b2a1a3, then w# = baa

Page 13: XML Data Management Deterministic DTDs and Schemas Werner Nutt

“Unique Particle Attribution”: Formalization Brüggemann-Klein/Wood

[1998]

Definition: A regular expression r is deterministic iff

there are no strings uxv, uyw L(r′)∈ with• |x| = |y| = 1• x y, (x and y are different marked symbols)• x# = y# (their unmarking is the same).

Example: (a | b)* a is not deterministic because there are

• marking ((a1 + b2) a∗ 3)

• strings b2 a1 a3 and b2 a3 u x v u x w

How can we check, whether e is deterministic?

Page 14: XML Data Management Deterministic DTDs and Schemas Werner Nutt

Finite State Automata

• Regular anguages can also be defined using automata • A finite state automaton (FSA) consists of:

– a set of states Q. – an alphabet (i.e., a set of symbols)

– a transition function , which maps every pair (q,a) to a set of states q’

– an initial state q0

– a set of accepting states F

• A word a1…an is in the language defined by an automaton if there is a path from q0 to a state in F with edges labeled a1,…,an

The automaton is deterministic if every pair (q,a) is

only mapped to a single state

Page 15: XML Data Management Deterministic DTDs and Schemas Werner Nutt

q0 q1

q2

a

a

b

q3

bc

Which Language Does this FSA Define?

Page 16: XML Data Management Deterministic DTDs and Schemas Werner Nutt

Non-Deterministic Automata

• An automaton is non-deterministic if

there is a state q and a letter a such that

there are at least two transitions from q

via edges labeled with a

What words are in the language of

a non-deterministic automaton?

• We now create a Glushkov automaton

from a regular

expression

Page 17: XML Data Management Deterministic DTDs and Schemas Werner Nutt

Creating a Glushkov Automaton from a Regular Expression

a*(b|c)a+Step 1: Create a marking of the expression

a1*(b1|c1)a2+

Page 18: XML Data Management Deterministic DTDs and Schemas Werner Nutt

Creating a Glushkov Automaton from a Regular Expression

Step 2: Create a state q0

and create a statefor each subscripted letter

a1*(b1|c1)a2+

Step 3: Choose as accepting states all subscripted letters with which it is possible to end a word

How do we find these states?

q0 a1

b1

c1

a2

Page 19: XML Data Management Deterministic DTDs and Schemas Werner Nutt

q0 a1

b1

c1

a2

Creating a Glushkov Automaton from a Regular Expression

Step 4: Create a transition from a state lj to a state kj if there is a word in which kj follows li.

Label the transition with k

a1*(b1|c1)a2+

How do we find these transitions?

Page 20: XML Data Management Deterministic DTDs and Schemas Werner Nutt

Exercises

What are the Glushkov automata of

• a* b (a b)*

• (a | b)* a (a | b)

• (a | b)*a ?

Page 21: XML Data Management Deterministic DTDs and Schemas Werner Nutt

Recognizing Deterministic Regular Expressions

Theorem (Book et al 1971, Brüggemann-Klein, Wood, 1998)

A regular expression is deterministic (one-unambiguous) iff its Glushkov automaton is deterministic.

Page 22: XML Data Management Deterministic DTDs and Schemas Werner Nutt

Construction of the Glushkov Automaton

For an arbitrary alphabet and a language L *

we define two sets

first(L) = a u *. au Llast(L) = a u *. ua L

and the function

follow(L,a) = b u,v *. uabv L.

Consider an expression e and its marking e

We can construct the Glushkov automaton for e if we know

the sets first(L(e)) , last(L(e)) , the function follow(L(e), ) ,

and if we know whether (L(e)) .

empty word

Why?

Page 23: XML Data Management Deterministic DTDs and Schemas Werner Nutt

Construction of the Glushkov Automaton

Where do we get this info?

If e = a1 , then • first(L(e)) = a1 • last(L(e)) = a1 • follow(L(e), ) is not defined for any li Also, L( e)

If e = (f | g) , then • first(L(e)) = first(L(f)) first(L(g)) • last(L(e)) = last(L(f)) last(L(g)) • follow(L(e), li) is follow(L(f), li) if li L(f) and

follow(L(g), li) if li L(g) Also, L(e) if L(f) or L(g)

For e = f*, f+, fg,exercise!

Page 24: XML Data Management Deterministic DTDs and Schemas Werner Nutt

Recognizing Deterministic Regular Expressions

Observation:• For each operator, first, last, and follow can be computed

in quadratic time.

This yields an O(n3) algorithm.

Theorem (Brüggemann-Klein, Wood, 1998)

• There is an O(n2) algorithm to check whether a regexpis deterministic.

Page 25: XML Data Management Deterministic DTDs and Schemas Werner Nutt

More Results

Theorems (Brüggemann-Klein, Wood, 1998)

• Not every regular language can be denoted by a deterministic regular expression.

E.g., (a | b)* a (a | b)

• Deterministic regular languages are not closed under union, concatenation, or Kleene-star.

I.e., there is no easy syntactic characterization

• If it exists, an equivalent deterministic regular expression can be constructed in exponential time.

It is possible to help users, but that is costly

Page 26: XML Data Management Deterministic DTDs and Schemas Werner Nutt

Theory for XML Schema

XML schema allows schemas where• the same element appears with different types

However,• it is illegal to have two elements of the same name,

but different types in one content model.

Also, content models must be deterministic.

Consequence:

Documents can be validated in a deterministic top-down pass

Page 27: XML Data Management Deterministic DTDs and Schemas Werner Nutt

References

This material draws upon slides by• Sara Cohen• Frank Neven,

notes by • Leonid Libkin

and the papers by A. Brüggemann-Klein and D. Wood