View
217
Download
0
Category
Preview:
Citation preview
UPA and Restriction for All-Groups and Numeric Exponents
Matthew Fuchs, PhDWestbridge Technology
matt@westbridgetech.comAllen Brown, PhD
Microsoft
Why Bother?• Numeric Exponents introduced by the
W3C XML Schema WG.
• Restriction is a subsumption relation among content models.
• And-groups long cherished by Markup Community.
• UPA is an old constraint on content models in WXS.
• What is the cost of combination?
Naïve Algorithms
• Exponential or worse:– All-groups try all exponential cases.– Numeric exponents – unroll - doubly
exponential:• First unroll:
(a{0,3} | b){10, 20} => ((a | aa | aaa | b)…(a |…)….• Then determinise.
– Used by XSV, Xerces, Sun.
• To not try to do better is simply remiss.
UPA Testing
• Generally just need to check follow sets.
• Problem for numeric exponents for {m,m}.
• For example:– (a1,b2){2,2},a3 => ababa
– ((a1, b2){1,3},a3) => aba or ababa or abababa
• Is a1 in follow(b2)?
Problem for All-groups
• Again, are different branches in each others’ follow groups?
• (a & b & c) => follow(a) = {b, c}
• (a & b? & c) => follow(a) = {b, c} union follow(b) => {a, b, c}
• ((a,b?) & b &c) => violates UPA
Five properties of particles
• particles(p) => all particles within p, recursively defined.
• opaque(p) => a particle is opaque if it can’t match the empty string.
• first(p) => particles in p that can match first letter in a string matching p
• follow(p) => particles in the outer expression that can match a letter in a string after substring matched by p.
• confusion(p) => particles in p which could conflict with follow(p)(a, b?) => b is in confusion((a, b?))
Special Considerations
• follow(p) restricted as follows:– (((a?,b){m,m}),c) => follow(b) = {c}– (((a?,b){m, n}),c) => follow(b) = {c, a, b}– ((a & b & c), d) => follow(c) = {d}
Sources of UPA Violation
• Consider P in– (, {0,1}, P, )– (, ( | P), )– (, ( & P), )
• UPA violation requires 2 terminals:– One before P, one inside P – need first(P)– Both inside P – in a moment– One inside P, one after P – need confusion(P)– One before P, one after P – opacity(P) is false
Internal Consistency
• P{m, m} – if P obeys UPA, then confusion(P) intersection first(P) != {}
• If P is ( & & ) then– overlap in first sets– confusion() intersects (first() U first() != {}– And so on for and
UPA Algorithm
UPA() => = a then if bi, bj in follow(a), then i=j
= {m,n} the UPA() and first() # confusion() = {}
= (1 |…| n) and #1n first(i) = {} then /\1
n UPA(n).
=(1 & … & n) and #1n first(i) = {} then
• /\1n(UPA(i)) and (confusion(i) # (Uj!=Ifirst(j)) = {}
=(1 , … , n) then UPA(1) /\ UPA((2, …, n))
Subsumption for Exponents
• Two steps– For fixed exponents– For exponent ranges
• Most equipment carries over
• Will use B or b to refer to base model, and R or r to refer to restricted model
Traditional
• Subsumption through transformation into automaton.
• Calculate intersection of automata (R intersects not(B)) should be empty (not(B) is the inversion of the accepting states of B).
• Once again, too huge when everything is unrolled.
Our Machines
• Represent regex as graph.
• Forward edges, matching terminals, form a DAG
• Back edges, matching exponents, form connected components.
• Each back edge marked with its arity.
Execution Model
• Letters are matched going forward by edges.
• Machine is “trapped” when a back-edge is entered.
• Can’t leave until obligation (value of back edges) fulfilled.
• Edge constraints fulfilled in lifo order.
• Stack maintains current iterations.
Example
• (a,((a,b)2|b))2
a a b
b
2
2
Subsumption Checking
• Start as usual.
• When entering head of a back edge, add entry to machine’s stack.
• When both reach repeated state:– Tail of a back edge– Previously seen in list of traversed states
• Determine if there is a matched component• Maximally reduce exponents for matched edges
For Example
• (a,(a,b,a,b)6,b3,c) <= (a,((a,b)2|b)9,c)• (r, b) let (r, b) r b• (0,0) a (1,1) [], []• (1,1) a (2,2) [0], [0,0]• (2,2) b (3,3) [0], [1,0] a a b a b b c• (3,1) a (4,2) [0], [1,0]• (4,2) b (5,3) [1], [2,0]• (5,3) X (5,1) [], [6] b c• (5,1) b (6,3) [1], [] a c• (6,3) X (6,3) [], [] b• (6,3) c (7,4) [], []
2
9
Reducing Exponents
• Find cross-product back-edge (startr and startb)
• Get r and b (number iterations each)
• Get leftover (totalr – startr) = lr• lr div r = quotr and remr, etc.
• newr = lr – (r * min(quotr, quotb)) +startr
Why So Complicated
• Compare (a,a,a)7 and (a, a)12 • Must go 3 rounds of (a,a) for 2 rounds of (a,a,a).
• lr = 7 lb = 12
• dr = 2 db = 3
• lr div dr = 3 rem 1 lb div db = 4 rem 0
• newr=7–(2*3)+0=1 newb=12-(3*3)+1=3
• Hence, max 6 rounds of (a, a, a) and 9 of (a, a).
Generalized Exponents
• Must keep track of minimum and maximum possible transitions.
• Edges can contribute to both min or max.
• Can’t exit until max > min allowed.
• Must exit before min > max allowed.
So….
• Generate as few minr/b as possible.
– If they exceed maxr/b, you’re screwed
• Generate as many maxr/b as possible
– Means you can use a forward transition– Use parsimoniously to maximize the amount
matched
More Complex Machinery
• Back edge constraints have min and max.
• Some back edges increment just max value
• Back edges increment both min and max values.
• Max means maximun possible match.
• Min means minimum possible match.
Example
• ((a, b?){3, 5}, c)
ab
c c
3,5
3,5
Four Kinds of Pairs
• When hitting a min-edge/min-edge:– Calculate min/min values (prev. algorithm with min exponents)– Calculate max/max values (prev. algorithm with max exponents)– Move forward when possible– If min ever exceeds max, fail.
• When hitting a max-edge/max-edge– Calculate min/min values– Calculate max/max values– When max > min, you can progress (when leaving a cycle set
min to passing value)– Else fail.
• Etc.
• After exiting loop, some iterations remain.
• As all “unabsorbed” transitions attempted, all possibilities tried.
• Given ( ){mb
,nb
}
• And ( ) {m’r,n’
r} ,( ) {m”
r,n”
r}
• Ensure m’r+m”r > mb and n’r+n”r < nb
• If “rest of expression” matches longest and shortest (i.e., matched m or matched n) then will match all iterations.
• Matching longest will try all alternatives.
• Matching shortest will try least alternatives.
• As first sets repeat, UPA shows there must be optionality or iteration.
Nested Exponents
• ({m,n}{m’,n’}• (a{m,n} | b){m’, n’}• Edges in machine have multiple
exponents.• Depth of n makes 2(n-1) ranges• Each must be tried• Requires tracking scope.• Requires lookahead.
Cost
• Without nesting, algorithm is exponential in number of exponents – each exponent requires testing min and max.
• With nesting, remains exponential, as this doesn’t affect the number of exponents.
• Still a huge improvement over unrolling.
Example
• ((a?,b{8,9}){2,3},c) > (a,(b,b){3,3},(b,b){6,6},c)
• First 6 b’s at level 2, remaining 12 iterate both levels
• At higher levels ranges overlap – need to check all possibilities
a1
b2
b2c0
c0
{8,9}{2,3}
{8,9}{2,3}
a b b b b c
3 6
• ((a?,b{8,9}){2,9},c) > (a,(b,b){3,3},(b,b){6,6},c)
• 8*9=72, 9*8=72
• Need to check ending of 8 and start of 9
• Need lookahead to choose.
• Represented as ranges at all levels.
a1
b2
b2c0
c0
{8,9}{2,9}
{8,9}{2,9}
a b b b b a
Conclusions
• Numeric exponents are hard to work with for subsumption.
• All-groups are not that difficult.
• Interaction will be even more annoying.
• Need to implement and test.
Recommended