XML Data Management 10. Deterministic DTDs and Schemas Werner Nutt

  • View
    215

  • Download
    0

Embed Size (px)

Text of XML Data Management 10. Deterministic DTDs and Schemas Werner Nutt

  • XML Data Management

    10. Deterministic DTDs and SchemasWerner Nutt

  • How Expressive can a Schema Be?

    Arbitrary deep binary tree with A elements, and a single B elementWhat would documents look like that satisfy this schema?How would one check validity? What would be the cost?What are the pros and cons of allowing such schemas?This schema is a frequent example in teaching material on XML Schema

  • Lets see what SAXON says

  • cos-element-consistent: Error for type 'oneB'. Multiple elements with name 'A', with different types, appear in the model group.

    cos-element-consistent: Error for type 'onlyAs'. Multiple elements with name 'A', with different types, appear in the model group.

    cos-nonambig: A and A (or elements from their substitution group) violate "Unique Particle Attribution". During validation against this schema, ambiguity would be created for those two particles.

    cos-nonambig: A and A (or elements from their substitution group) violate "Unique Particle Attribution". During validation against this schema, ambiguity would be created for those two particles.Here is the Full Error Message from EclipseI.e., in a given context, elements with the same namemust have the same content.Easy to check!Thats more subtle ...

  • The Country Example in XML Schema

    As DTD:

  • Also this is not validated

    cos-nonambig: king and king (or elements from their substitution group) violate "Unique Particle Attribution". During validation against this schema, ambiguity would be created for those two particles.Lets check what this means!

  • What the W3C Standard Explains Schema Component Constraint: Unique Particle Attribution A content model must be formed such that during validation of an element information item sequence, the particle contained directly, indirectly or implicitly therein with which to attempt to validate each item in the sequence in turn can be uniquely determined without examining the content or attributes of that item, and without any information about the items in the remainder of the sequence.

    http://www.w3.org/TR/2001/REC-xmlschema-1-20010502/#cos-nonambig

  • Questions and IdeasQuestions:How can one make the standard formal?How can a validator implement the standard?

    Ideas:Content models are specified by regular expressionsA regular expression E can be translated into a finite state automaton A (Glushkov automaton) that checks which strings satisfy E Construct A from E and check whether A is deterministic

  • FormalizationAlphabet (i.e., set of symbols): the element names occurring in the content model

    Regular expressions over are generated with the rule e, f a | (ef) | (e|f) | (e)+ | (e)*where e, f are expressions and a

    Language L(e) of an expression e (inductively defined)

    Exercise: Which of the following are in the language defined by a* (b | c) a+ ?abaabcaaabaaacaaa

    In the following, we denote concatenation by a dot, no more by a comma.

  • Regular Expressions and DTDsThese are formalizations of DTDs and validation:

    A DTD is a pair (d, s) wheres is the start symbold maps every -symbol to a regular expression over

    A document tree t satisfies d (t is valid wrt d) iffthe root of t is labeled sfor every node n in t, with symbol a, the string formed by the names of the children of n satisfies d(a)

    Validation is checking whether a string satisfies a regexp

  • MarkingsDistinguish between the different occurrences of a symbol in a regexp by using numbers: markings of regexps

    Examples: a1* (b2 | c3) a4+ is a marking of a* (b | c) a+ king1 | queen2 | king3 queen4 is a marking of king | queen | king queen

    DefinitionA marking e of a regular expression e is an assignment of numbers to every symbol in e.

  • Unmarked VersionConsider a regular expression e and a e marking of e

    Definition:For w L(e) , we denote by w# the corresponding unmarked string in L(r).

    Example: If w = b2a1a3, then w# = baa

  • Unique Particle Attribution: Formalization Brggemann-Klein/Wood [1998]Definition: A regular expression r is deterministic iff there are no strings uxv, uyw L(r) with|x| = |y| = 1x y, (x and y are different marked symbols)x# = y# (their unmarking is the same).

    Example: (a | b)* a is not deterministic because there aremarking ((a1 + b2) a3)strings b2 a1 a3 and b2 a3 u x vu x wHow can we check, whether e is deterministic?

  • Finite State AutomataRegular anguages can also be defined using automata A finite state automaton (FSA) consists of:a set of states Q. an alphabet (i.e., a set of symbols)a transition function , which maps every pair (q,a) to a set of states qan initial state q0a set of accepting states FA word a1an is in the language defined by an automaton if there is a path from q0 to a state in F with edges labeled a1,,anThe automaton is deterministic if every pair (q,a) is only mapped to a single state

  • Which Language Does this FSA Define?

  • Non-Deterministic AutomataAn automaton is non-deterministic if there is a state q and a letter a such that there are at least two transitions from q via edges labeled with aWhat words are in the language of a non-deterministic automaton?

    We now create a Glushkov automaton from a regular expression

  • Creating a Glushkov Automaton from a Regular Expressiona*(b|c)a+Step 1: Create a marking of the expression

  • Creating a Glushkov Automaton from a Regular ExpressionStep 2: Create a state q0 and create a statefor each subscripted lettera1*(b1|c1)a2+Step 3: Choose as accepting states all subscripted letters with which it is possible to end a word How do we find these states? q0

  • Creating a Glushkov Automaton from a Regular ExpressionStep 4: Create a transition from a state li to a state kj if there is a word in which kj follows li.

    Label the transition with ka1*(b1|c1)a2+How do we find these transitions?

  • ExercisesWhat are the Glushkov automata of

    a* b (a b)*

    (a | b)* a (a | b)

    (a | b)*a ?

  • Recognizing Deterministic Regular Expressions

    Theorem (Book et al 1971, Brggemann-Klein, Wood, 1998)A regular expression is deterministic (one-unambiguous) iff its Glushkov automaton is deterministic.

  • Construction of the Glushkov AutomatonFor an arbitrary alphabet and a language L * we define two sets first(L) = a u *. au Llast(L) = a u *. ua Land the functionfollow(L,a) = b u,v *. uabv L.Consider an expression e and its marking e

    We can construct the Glushkov automaton for e if we know the sets first(L(e)) , last(L(e)) , the function follow(L(e), ) , and if we know whether L(e) .empty wordWhy?

  • Construction of the Glushkov AutomatonWhere do we get this info?

    If e = a1 , then first(L(e)) = a1 last(L(e)) = a1 follow(L(e), ) is not defined for any li Also, L( e)

    If e = (f | g) , then first(L(e)) = first(L(f)) first(L(g)) last(L(e)) = last(L(f)) last(L(g)) follow(L(e), li) = follow(L(f), li) if li L(f) = follow(L(g), li) if li L(g) Also, L(e) if L(f) or L(g) For e = f*, f+, fg, exercise!

  • Construction of the Glushkov Automaton

    If e = (fg) , then

    first(L(e)) = first(L(f)) first(L(g)) if L(f) = first(L(f))otherwise

    last(L(e)) = last(L(f)) last(L(g)) if L(g) = first(L(g))otherwise

    follow(L(e), li) = follow(L(f), li) if li in f but not li last(L(f)) = follow(L(g), li) first(L(g)) if li last(L(f)) = follow(L(g), li) if li in g

    Also, L(e) if L(f) and L(g)

  • Construction of the Glushkov Automaton

    If e = (f*) , then

    first(L(e)) = first(L(f))

    last(L(e)) = last(L(f))

    follow(L(e), li) = follow(L(f), li) if li in f but not li last(L(f)) = follow(L(f), li) first(L(f)) if li last(L(f)) Also, L(e) if L(f) and L(g)

  • Recognizing Deterministic Regular ExpressionsObservation:For each operator, first, last, and follow can be computed in quadratic time.This yields an O(n3) algorithm.

    Theorem (Brggemann-Klein, Wood, 1998)There is an O(n2) algorithm to check whether a regexp is deterministic.

  • More ResultsTheorems (Brggemann-Klein, Wood, 1998)Not every regular language can be denoted by a deterministic regular expression.

    E.g., (a | b)* a (a | b)

    Deterministic regular languages are not closed under union, concatenation, or Kleene-star. I.e., there is no easy syntactic characterization

    If it exists, an equivalent deterministic regular expression can be constructed in exponential time. It is possible to help users, but that is costly

  • Theory for XML SchemaXML schema allows schemas wherethe same element appears with different typesHowever,it is illegal to have two elements of the same name, but different types in one content model.Also, content models must be deterministic.

    Consequence: Documents can be validated in a deterministic top-down pass

  • ReferencesThis material draws upon slides bySara CohenFrank Neven,notes by Leonid Libkinand the papers by A. Brggemann-Klein and D. Wood