Using Histograms to Estimate Answer Sizes for XML Queriesdbgroup.eecs.umich.edu/files/IS03-histogram.pdfHistograms are by far the most popular sum-mary data structures used for estimating

1

Using Histograms to Estimate Answer Sizes for XML Queries

Yuqing Wua ∗, Jignesh M. Patel a†, H.V. Jagadish a

aUniversity of Michigan, Ann Arbor, USA

Estimating the sizes of query results, and intermediate results, is crucial to many aspects of query processing.In particular, it is necessary for effective query optimization. Even at the user level, predictions of the total resultsize can be valuable in “next-step” decisions, such as query refinement. This paper proposes a technique to obtainquery result size estimates effectively in an XML database.

Queries in XML frequently specify structural patterns, requiring specific relationships between selected ele-ments. Whereas traditional techniques can estimate the number of nodes (XML elements) that will satisfy anode-specific predicate in the query pattern, such estimates cannot easily be combined to provide estimates forthe entire query pattern, since element occurrences are expected to have high correlation.

We propose a solution based on a novel histogram encoding of element occurrence position. With such position

histograms, we are able to obtain estimates of sizes for complex pattern queries, as well as for simpler intermediatepatterns that may be evaluated in alternative query plans, by means of a position histogram join (pH-join)algorithm that we introduce. We extend our technique to exploit schema information regarding allowable structure(the no-overlap property) through the use of a coverage histogram.

We present an extensive experimental evaluation using several XML data sets, both real and synthetic, witha variety of queries. Our results demonstrate that accurate and robust estimates can be achieved, with limitedspace, and at a minuscule computational cost. These techniques have been implemented in the context of theTIMBER native XML database [21] at the University of Michigan.

1. Introduction

XML data [2] is becoming ubiquitous, and anXML document (or database) is naturally mod-eled as a (collection of) node-labeled tree(s). Insuch a tree, each node represents an XML ele-ment, and each tree edge represents an element-subelement inclusion relationship.

A natural way to query such hierarchically or-ganized data is by using small node-labeled trees,referred to as twigs, that match portions of thehierarchical data. Such queries form an integralcomponent of query languages proposed for XML(for example, [4]), and for LDAP directories [11].

Example 1.1 The XQuery expression

FOR $f IN document(“personnel.xml”)

∗H. V. Jagadish and Yuqing Wu were supported in part

by NSF under grant IIS-9986030, DMI-0075447 and IIS-

0208852.†Jignesh M. Patel was supported in part by NSF under

grant IIS-0208852, and by a gift donation from IBM.

//department/facultyWHERE count($f/TA) > 0

AND count($f/RA) > 0RETURN $f

matches all faculty members that has at least oneTA and one RA, in the example data set shownin Figure. 1. This query can be represented asa node-labeled tree, with the element tags depart-ment and faculty as labels of non-leaf nodes in thetree, and the element tags TA and RA as labels ofleaf nodes in the tree, as shown in Figure. 2.

A fundamental problem in this context is toaccurately and quickly estimate the number ofmatches of a twig query pattern against the node-labeled data tree.

An obvious use is in the cost-based optimiza-tion of such queries: knowing selectivity of var-ious sub-queries can help in identifying cheapquery evaluation plans.

Example 1.2 The query of Figure. 2 can be

2

department

faculty staff faculty lecturer

name RA name name secretary RA RA RA name TA TA TA

faculty research scientist

name secretary TA RA RA TA name secretary RA RA RA RA

Figure 1. Example XML document

Tag = department

Tag = RA Tag = TA

Tag = faculty

Figure 2. Example Pattern Tree

evaluated by identifying all faculties with RAs,and joining this set with the set of departments,then joining the result of this with the set of allthe TAs. An alternative query plan is to join thefaculties and RAs first, and then join the resultset with TAs, then, departments. Depending onthe cardinalities of the intermediate result set, oneplan may be substantially better than another.

Accurate estimates for the intermediate join re-sult are essential if a query optimizer is to pick theoptimal plan. Furthermore, if there are multiplejoin algorithms, the optimizer will require accu-rate estimates to enable it to choose the moreefficient algorithm. Similar choices must be madewhether the underlying implementation is a rela-tional or a native XML database.

Result size estimation has additional uses inan Internet context. For instance, there maybe value in providing users with quick feedbackabout expected result sizes before evaluating thefull query result. Even when the query involvedis an on-line query where only partial results arerequested, it is helpful to provide an estimate ofthe total number of results to the user along withthe first subset of results, to help the user choosewhether to request more results of the same queryor to refine the query. Similarly, result size esti-mation can be very useful when space allocationor parallelism are involved.

Histograms are by far the most popular sum-mary data structures used for estimating queryresult sizes in (relational) databases. When usedin the XML context, they could indeed be used to

estimate accurately the number of nodes satisfy-ing a specified node predicate. One could build ahistogram for the predicate associated with eachnode in a query pattern, and obtain individualestimates for the number of occurrences of each.However, structural relationship information isnot captured in traditional histograms, and it isnot obvious how to combine estimates for indi-vidual nodes into estimates for the whole querytree pattern.

The central contribution of this paper is the in-troduction of position histograms to capture thisstructural information. A position histogram isbuilt over “base” predicates, such as “elementtag= faculty”. The position histograms on two basepredicates, P1 and P2, can be used to accuratelyestimate the selectivity of queries with the pat-tern P1//P2, which matches all “P2” nodes thatare descendants of all “P1” nodes in the data tree.Some special features of predicates, such as no-overlap property, which dramatically affects theselectivity of the pattern matching, are also con-sidered. Even though these histograms are two-dimensional, they behave like one-dimensionalhistograms for many purposes, including in theirstorage requirements.

We formally define our problem in Section 2,and summarize our overall solution approach inSection 3. We also establish various properties ofthis new summary data structure, and show howto use this to obtain query result sizes estimatesefficiently in Section 3. Schemata often imposeconstraints on allowed structural relationships. In

3

Section 4, we show how, at least in some keycases, such schema information can be exploitedto obtain better estimates. Variants of the posi-tion histogram technique, levelled histogram forparent-child matching and estimation with non-uniform histogram, will be exploited in Section 6and Section 7. We experimentally demonstratethe value of our proposal in Section 5, consider-ing not just the quality of the estimate, but alsoissues such as computation time and storage re-quirement. And related work is discussed in Sec-tion 8. Conclusions and directions for future workare outlined in Section 9.

2. Problem Definition

We are given a large rooted node-labeled treeT = (VT , ET ), representing the database.We are given a set of boolean predicates, P : {v :v ∈ VT }→{0, 1}. For each predicate α ∈ P , foreach node v ∈ T , we have either α(v) is true orα(v) is false. (See Sec 3.6 for a discussion of howto obtain this set P for a real database).A query is a smaller, rooted, node-labeled treeQ = (VQ, EQ). The goal is to determine thenumber of “matches” of Q in T . The labels atthe nodes of Q are boolean compositions of pred-icates from P .A match of a pattern query Q in a T is a totalmapping h : {u : u ∈ Q}→{x : x ∈ T} such that:

• For each node u ∈ Q, the predicate nodelabel of u is satisfied by h(u) in T .

• For each edge (u, v) in Q, h(v) is a descen-dant of h(u) in T .

Figure. 1 shows a very simple XML document.The personnel of a department can be faculty,staff, lecturer or research scientist. Each of themhas a name as identification. They may or maynot have a secretary. Each faculty may have bothTAs and RAs. A lecturer can have more than oneTAs, but no RA. A research scientist can have nu-merous RAs, but no TA. Consider a simple twigpattern with only two nodes, faculty and TA, withparent-child relationship among them. There arethree faculty nodes and five TA nodes in the XMLdocument. The schema says that a faculty can

have any number of TAs. Without any furtherschema information, the best we can do in esti-mating the result size is to compute the productof the cardinality of these two nodes, which yields15. Consider the fact that faculty nodes are notnested, one TA can only be the child of one facultynode, we can tell that the upper-bound of the re-sult number is the cardinality of TA nodes, whichis 5. But as we can see from the figure, the realresult size is 2. The question we address in thispaper is how to capture the structure informationof the XML document to get a better estimation.Our problem can be stated succinctly as follows:

Define a summary data structure T ′

corresponding to a node-labeled datatree T , and a set of primitive predi-cates of interest P , such that the sizeof T ′ is a small percentage of the sizeof T ; and for any query Q, definedas a structural pattern of nodes satis-fying combinations of predicates fromP , correctly estimate the total num-ber of matches of Q in T , using only Qand the summary data structure T ′.

3. Our Proposal

3.1. The Basic IdeaWe associate a numeric start and end label with

each node in the database, defining a correspond-ing interval between these labels. We require thata descendant node has an interval that is strictlyincluded in its ancestors’ intervals.

This numbering scheme is inspired by, andquite similar to, the node numbering based ondocument position frequently used in informationretrieval and adopted for XML database use byUniv. of Wisconsin researchers in the course ofthe Niagara [17] project.

We obtain these labels as follows. First, wemerge all documents in the database into a singlemega-tree with a dummy element as the root, andeach document as a child subtree. We numbernodes in this tree to obtain the desired labels –the start label by a pre-order numbering and theend label of a node is assigned to be at least aslarge as its own start label and larger than theend label of any of its descendant.

4

department (1,36)

RA name

faculty (30,35) staff (5,7) faculty (1,4)

name name TA TA TA

…...

(2,2) (34,34) (33,33) (32,32) (31,31) (6,6) (3,3)

Figure 3. Example XML Data with NumberingSchema

Example 3.1 Figure 3 shows the same exampleXML date as shown in Figure 1. There are twonumbers in the parenthesis following each elementtag. The first one is the start label and second oneis the end label. We can see that for all nodes, theend label is larger or equal to the start label. Forleaf nodes, the two labels have the same value.

Given a limited set P of predicates of interest,one should expect that there will be index struc-tures that identify lists of nodes satisfying eachpredicate in P . For many, even most, predicates,these lists can be very long. While queries maybe answered through manipulating such lists, theeffort involved is far too great for an answer sizeestimation task. The standard data structure formaintaining summary data in a database is a his-togram. We compress each such list into a two-dimensional histogram summary data structure,as we describe next.

We take the pairs of start and end pair ofvalues associated with the nodes that satisfy apredicate α, and construct a two-dimensional his-togram Histα with them. Each grid cell in thehistogram represents a range of start position val-ues and a range of end position values. The his-togram Histα maintains a count of the numberof nodes satisfying α that have start and end po-sitions within the specified ranges. We call sucha data structure a position histogram.

Position histograms, even though defined overa two-dimensional space, have considerable struc-ture, as shown in Figure. 4.

Since the start position and end position of anode always satisfies the formula that start <=end, none of the nodes can fall into the area below

��

��

��

��

��

��

(x,y)

0

y1

y2

Max(y)

II

I

X2X1Start Position Max(X)

End Position

A

Figure 4. Forbidden Regions in a Position HistogramDue to One Node

the diagonal of the matrix. So, only the grid cellsto the upper left of the diagonal can have countof more than zero.

Given a point A with coordinates (x,y), the re-gions marked I and II are guaranteed to be empty,since the start and end ranges of any two nodescan either have no overlap, or the range of onenode fully contained within the range of the othernode. This leads to the following Lemma:

Lemma 3.1 In a position histogram for anypredicate, a non-zero count in grid cell (i, j) im-plies a zero count in each grid cell (k, l) with (a)i < k < j and j < l, or (b) i < l < j and k < i.

Proof:Assume grid cell (i, j) is non-zero. Then, there

is at least one point (x, y) in this grid cell, suchthat x[i − 1] < x < x[i], and y[j − 1] < y < y[j],where x[i − 1] and x[i] are the boundary of thegrid cell (i, j) on x axis.

The Diagonal line is y=x, so, on Figure 4, wehave x1 = y1 = x, x2 = y2 = y.

Assume that a grid cell (k, l) that satisfies con-dition (a) is non-zero. There is at least one point(x′, y′) in this grid cell such that x1 < x[k − 1] <x′ < x[k] < x2, and y2 < y[l−1] < y′ < y[l]. Withcondition (a), we have x = x1 < x[k − 1] < x′ <x[k] < x2 = y < x[j − 1] < y′ ⇒ x < x′ < y < y′.

5

In a well-formed XML document, the intervalsdefined by the start position and end position ofnodes cannot interleave with each other. So, anXML document with nodes on both points (x, y)and (x′, y′) is not well-formed.

3.2. Primitive Estimation AlgorithmEach document node is mapped to a point

in two dimensional space (in each position his-togram corresponding to a predicate satisfied atthe node) . Node u is an ancestor of node v iff thestart position of u is less than the start positionof v and the end position of u is no less than theend position of v. In other words, u is to the leftof and above every node v that it is an ancestorof, and vice versa.

��

��

��

��

��

� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

��

��

��

��

0

Empty Grid Cells

R8

R1

R2

R3R4

R7R6

R5 (R0)A

EndPosition

Start Position

Figure 5. Layout of Position Histogram

Consider the grid cell labeled A in Figure. 5.There are nine regions in the plane to consider,marked A (R0), R1 through R8 in the Figure.All points v in region R2 are descendants of eachpoint u in the grid cell A. All points v in regionR6 are ancestors of each point u in grid cell A.No point in region R4 and R8 is a descendant orancestor of any point in the grid cell A. Points inregion R1 and R3 may be descendants of points in

grid cell A. Similarly, points in region R5 and R7

may be ancestors of points in grid cell A. To esti-mate how many, we exclude the forbidden region,and then assume a uniform distribution over theremainder of each grid cell. For this purpose, weoverlap Figure. 4 with Figure. 5 to get Figure. 6,assuming that the forbidden ranges are based onnode (x, y), which is the bottom-right node in gridcell A.

End Position

Start Position X

Y A

H

I

F

E

D

B

C

G

0

Figure 6. Estimating Join Counts with Position His-togram

Given predicates P1 and P2, both in P , we showhow to estimate the number of pairs of nodes u, vin the database such that u satisfies P1, v satisfiesP2 and u is an ancestor of v, using two positionhistograms, one for predicate P1 and one for pred-icate P2.

When computing the estimate of a join, wecan compute the estimate based on either the an-cestor or the descendant. When computing anancestor-based estimate, for each grid cell of theancestor we estimate the number of descendantsthat join with the ancestor grid cell. Similarly,for the descendant-based estimate, for each gridcell of the descendant we estimate the number ofancestors that join with the grid cell.

6

Primitive Estimation: ancestor-based

For off-diagonal grid cell A:

EstP12[A] = HistP1

[A] × { 14 × HistP2

[A] + HistP2[B] + HistP2

[C] + HistP2[E]

+ 12 × (HistP2

[D] + HistP2[F ])}

For on-diagonal grid cell A:

EstP12[A] = 1

12 × HistP1[A] × HistP2

[A]Primitive Estimation: descendant-based

For off-diagonal grid cell A:

EstP12[A] = HistP2

[A] × {HistP1[G] + HistP1

[H] + HistP1[I] + 1

4 × HistP1[A]}

For on-diagonal grid cell A:

EstP12[A] = HistP2

[A] × {HistP1[G] + HistP1

[H] + HistP1[I] + 1

12 × HistP1[A]}

Notation:

HistP : position histogram for predicate PEstP12

: estimation histogram of a twig pattern, where the ancestor satisfies

P1 and the descendant satisfies P2.

H[A] : summation of the grid cells in region A in histogram H.

Figure 7. Formulae for Primitive Join Estimation

The formulae for these two types of estimationare different, and are derived in the next two sub-sections. But first, we need the following defini-tion:

Definition 3.1 A grid cell in a position his-togram is said to be on-diagonal if the intersectionof the start-position interval (X-axis) and end-position interval (Y-axis) is non-empty. Other-wise, the grid cell is said to be off-diagonal.

3.2.1. Ancestor-Based Join EstimationIf A is off-diagonal, as shown in Figure. 6, all

points in the grid cells in region B are descen-dants of all points in grid cell A. Using the po-sition histogram for predicate P2, we can simplyadd up the counts of all grid cells in this region.Now consider region E. Each point in grid cell Aintroduces two forbidden regions. No points inregion E can fall in the forbidden regions of theright-most point in A (as shown in Figure. 6), soall points in region E must be descendants of allpoints in grid cell A. Similarly, for a given pointin grid cell A, part of region F is forbidden; the

points that fall in the right triangle of F are de-scendants of A, and the points in the left triangleare not. Integrating over the points in region F,we estimate that half the points in F, on average,are descendants of any specific point in grid cellA. Similar discussions apply to regions C and D.For the points in the same grid cell (grid cell A)in the histogram for predicate P2, for each pointin grid cell A of the histogram for the predicateP1, only the points in the bottom-right region canbe descendants. Assuming a uniform distributionand performing the necessary integrals in each di-mension, we derive on average a quarter chance(see the following proof). Putting all these esti-mates together, the ancestor-based estimation foreach off-diagonal grid cell can be expressed as thefirst formula in Figure. 7.

When grid cell A is on-diagonal, regions B, C,D, E, F don’t exist. Since a diagonal grid cellis a triangle rather than a rectangle, the chancethat a descendant point can join with an ancestorpoint is 1/12 (see the following proof).

7

3.2.2. Descendant-Based Join EstimationReferring to Figure. 6, no matter whether A

is on-diagonal or off-diagonal, all ancestors of apoint in the grid cell A will be in the regions A,G, H or I. Following argument similar to those inthe ancestor-based estimation above, all points inregion G, H and I are guaranteed to be ancestorsof all points in grid cell A. For the points in thesame grid cell (grid cell A), the chance is 1/4 foran off-diagonal grid cell, while it is 1/12 for anon-diagonal grid cell.Proof: Assume that P1 is the predicate associ-ated with the ancestor node, and P2 is the pred-icate associated with the descendant node. H1

and H2 are the position histograms built on P1

and P2 respectively.Taken a grid cell A in H1, and the grid cell

at the same position in H2, let’s denote the nodecount of the grid cell in H1 is A1, and that in H2 isA2. We will estimate how many node pairs (Est)are to be generated from the ancestor-descendantjoin, with the nodes fallen in the grid cell A inboth the ancestor and the descendant.

Assume that the boundary of grid cell A is(x1, x2, y1, y2).

• when A is not on-diagonalEst =

∫ x2

x1

∫ y2

y1(x2 − x)(y2 − y) = 1/4 ×

(x2 − x1)(y2 − y1) × (x2 − x1)(y2 − y1)

• when A is on-diagonalEst =

∫ x2

x1

∫ y2

y11/2 × (y − x)2 = 1/12 ×

1/2×(x2−x1)(y2−y1)×1/2×(x2−x1)(y2−y1)

faculty 0

2

1 TA 0 3

2

Figure 8. Example Position Histograms

Example 3.2 Let’s have a look at the exampleXML document in Figure. 1 again, with the query

pattern we discussed in section 2. The 2 × 2 his-tograms of predicates “element tag = faculty” and“element tag = TA” are shown in Figure. 8. Us-ing the primitive estimation algorithm introducedabove, we estimate the result size to be 0.6, muchcloser to the real result size. Note that the posi-tion histograms we used here are 2×2. By refiningthe histogram to use more buckets, we can get amore accurate estimate.

3.3. AnalysisThe primary concern with any estimation tech-

nique, of course, is how good the estimates are.We will perform an extensive evaluation in Sec-tion 5. However, there are two other metrics ofconcern: the storage required and the time tocompute the estimate. We address these issuesnext.

3.3.1. Storage RequirementThere can only be O(g) non-zero grid cells in a

g × g grid, unlike the O(g2) one might expect ingeneral. Therefore the storage requirements arequite modest. We establish this result as a the-orem here, and verify experimentally in Section5.

Lemma 3.2 In a g× g position histogram, thereare at most g 2-off-diagonal non-zero grid cells.

Proof:Given a g × g histogram, Hg, we call the grid

cell (i, j) “2-off-diagonal” if j − i ≥ 2. They arethe grid cells that are not on diagonal and not 1from diagonal. Let’s denote the number of non-zero grid cells amongst all 2-off-diagonal cells inhistogram Hg as N2off (Hg).We prove it by induction.

• Base case:

When n=1 or 2, there are no 2-off-diagonalcells.N2off (H1) = N2off (H2) = 0

When n=3, there is one 2-off-diagonal celland it can be non-zero.N2off (H3) = 1 < 3 = n

• Inductive case:Assume that the lemma holds for all grid

8

sizes up to g × g. For (n + 1) × (n + 1)histogram,

– If there are no non-zero 2-off-diagonalcells, N2off (Hn+1) = 0 < n + 1

– If there are m non-zero 2-off-diagonalcells (i1, j1)...(im, jm). Pick the gridcell (ia, ja) such that ia + ja =mininum(ik + jk)(k = 1..m).

As shown in Figure 6, the whole his-togram can be divided into six ranges.For the two forbidden regions, all gridcells that fall into these ranges havezero count. Due to the way (ia, ja) ischosen, all the grid cells in the rectan-gle to the up-left of the two forbiddenregions have zero count. The three tri-angles are of size ia, ja − ia + 1, andn + 1 − ja + 1, respectively. Each ofthem can be regarded as a smaller his-togram itself. The non-zero grid cell(ia, ja) is included in the second smallhistogram. so,N2off (Hn+1)= N2off (Hia

) + N2off (Hja−ia+1)+N2off (H(n+1)−ja+1)≤ (ia − 1) + (ja − ia + 1 − 1)+((n + 1) − ja + 1 − 1)= ia − 1 + ja − ia + (n + 1) − ja

= n < n + 1

Theorem 3.1 In a g× g grid, the number of po-sition histogram grid cells with non-zero counts isO(g)

Proof: Use Lemma 3.2, in an g × g positionhistogram, the number of 2-off-diagonal non-zerogrid cells is at most g.

For a g×g position histogram, in the worst casewhen all the grid cells that are not 2-off-diagonalare non-zero, the total number of non-zero gridcells is at most 3g.

3.3.2. Time RequiredBased on the formulae for both ancestor-based

estimation and descendant-based estimation, theprocedure to compute the expected size of resultfor a simple 2-node pattern is to loop through allgrid cells for counts of nodes satisfying the outer

predicate, and for each grid cell loop through thehistogram for the inner predicate, adding up theregions as described in the preceding section, andmultiplying by the count of the outer grid cell.The grand summation of all these is the desiredresult. We have a choice of which of the two nodesin the pattern is the inner loop, and the other isthe outer.

The summation work in the inner loop is re-peated several times in the simple nested loop al-gorithm outlined above. A small amount of stor-age for intermediate results can result in the muchmore efficient algorithm shown in Figure. 9.

Algorithm pH-Join is a three-pass algorithm.In the first pass, column partial summations (ason region E and columns in region B in Figure. 6)are obtained. In the second pass, row partial sum-mations (as in region C) , as well as region partialsummations (as in region B using column par-tial summations) are obtained. In the third pass,these partial summations are used, along with thematrix entries themselves, to obtain the necessarymultiplicative coefficients derived from the innermatrix operand and these can be multiplied bythe corresponding elements of the outer operandmatrix and the summation taken.

Algorithm pH-Join, as stated, computes coef-ficients assuming that the inner operand is thedescendant node in the pattern. Obvious minorchanges are required if the inner operand is theancestor node.

Observe also that all of Algorithm pH-Join, ex-cept for the final multiplication, deals with thehistogram of only one predicate in the join oper-ation. In consequence, it is possible to run thealgorithm on each position histogram matrix inadvance, pre-computing the multiplicative coeffi-cients at each grid cell. The additional storage re-quired is approximately equal to that of the origi-nal position histogram. So such pre-computationmay provide a useful space-time tradeoff in somesituations. In any event, the time required for thecomputation is simply O(g) for a g × g grid.

3.4. Complex PatternsThus far we have dealt with a simple query pat-

tern involving a single ancestor-descendant pair.In general, of course, one can have query pat-

9

Algorithm pH Join (histA, histB)

// Inputs: Two histograms histA and histB,

// Output: Estimation of answer of A join with B .

for (i=0; i<grid size; i++)

for (j=i; j<grid size; j++)

{pSum[i][j].self = HistB[i][j];

if (j == i) pSum[i][j].down = 0; // column summation

else if (j == i+1) pSum[i][j].down = pSum[i][j-1].self;

else pSum[i][j].down = pSum[i][j-1].self + pSum[i][j-1].down;

}

for (j=grid size-1; j>=0; j--)

for (i=j; i>=0; i--)

{if (i == j)

{pSum[i][j].right = 0;

pSum[i][j].desc = 0;

}else if (i == j-1)

{pSum[i][j].right = pSum[i+1][j].self; // row summation

pSum[i][j].desct = pSum[i+1][j].down; // region summation

}else

{pSum[i][j].right = pSum[i+1][j].self + pSum[i+1][j].right;

pSum[i][j].desc = pSum[i+1][j].down + pSum[i+1][j].desc;

}}

for (i=0; i<grid size; i++)

for (j=i; j<grid size; j++)

{if (i==j) rHist[i][j] = HistA[i][j] * pSum[i][j].self / 12;

else rHist[i][j] = HistA[i][j] * (pSum[i][j].desc

+ pSum[i][j].self / 4 + pSum[i][j].down - pSum[i][i].self / 2

+ pSum[i][j].right - pSum[j][j].self / 2 );

total+=rHist[i][j]

}

output(total);

Figure 9. Algorithm pH-Join for Computing the Join Estimate

10

terns that are much more complex. The tech-nique presented above can be adapted for thispurpose using the intermediate estimates, in theform of histograms. The basic idea is to constructthe complex query pattern as a sequence of nodesadded to a simple pattern. In other words, onecan choose any one edge in the complex querypattern, estimate how many times that simplequery pattern occurs, and then use the result toestimate how many times a slightly less simplequery pattern, with two edges and three nodesoccurs, and so on, adding one node at a time.

The only catch in being able to do this correctlyis that the final summation to compute the totalestimate should not be carried out. Instead, oneshould obtain a grid cell by grid cell estimate forthe simpler query pattern, in effect constructing ahistogram for the occurrence of this pattern, andthen use this histogram in the next step. Thequestion that arises is what is the start and endposition for a pattern. These are concepts definedfor a node, or an element in the XML document.The answer is that the positions we choose areexactly the positions at which a selected distin-guished node in the pattern, which is the nodeused in the following estimation, occurs. It is tothis distinguished node that the next node addedto the pattern, in the sequential expansion of thepattern, must join.

There are many different ways in which a com-plex pattern can be decomposed, leading to dif-ferent estimation formulae. However, careful al-gebraic manipulations can be used to establishthe following reassuring theorem:

Theorem 3.2 The count estimate for any pat-tern of predicates is independent of the join orderfor algorithm pH-join.

Proof:Let’s denote a tree pattern as (V,E), where V

is the set of nodes in the pattern, and E is theset of edges. Each element in E is representedby a pair of nodes in V . Histv is the positionhistogram of node v (in V ). Let Est(V,E) bethe count estimation of the pattern (V,E). Provethat for any pattern (V,E), Est(V,E) obtainedin any order can be written in one formula.

Randomly choose one grid cell (ik, jk) fromHistvk

, the position histogram of node vk inV . Let ChV = {(i1, j1)..(in, jn)} denote theset of grid cells chosen in this manner, one cellfor each node from node set V. The estimationof the selected grid cells joined with respectto the pattern (V,E) is denoted by E(ChV ).Est(V,E) =

∑

E(ChV ) for all combination ofi1, j1..in, jn. We are going to prove that for theselected grid cells ChV ,E(ChV ) =

∏

p∈1..n Histp[ip][jp]×

∏

q∈1..n−1 Factor(eq, ieqa, jeqa

, ieqd, jeqd

),regardless of the order in which the estimationis computed and the node on which the resulthistogram is based.

Here, Factor(eq, ieqa, jeqa

, ieqd, jeqd

) is definedas following:

• eq ∈ E, is an edge in the pattern, with an-cestor node eqa

and descendant node eqd.

• (ieqa, jeqa

) ∈ ChV , it is a grid cell randomlychosen from Histeqa

; and (ieqd, jeqd

) ∈ChV , it is a grid cell randomly chosen fromHisteqd

.

• The value of Factor(e, ia, ja, id, jd) is de-fined as:

– value= 0, when ia < id or ja < jd

– value= 1/4, when ia = id, ja = jd, andia 6= ja

– value= 1/12, when ia = ja = id = jd

– value= 1, otherwise.

We prove by induction.

• Base case:

For a two-node twig join pattern, the resultof ancestor-based estimation is the same asdescendant-based estimation.(V,E) is a two-node twig pattern, where theV = {v1, v2}, E = {e}, e = (v1, v2). Twogrid cells are randomly chosen from Histv1

and Histv2, they are Histv1

[i1, j1] andHistv2

[i2, j2]. So, ChV = {(i1, j1), (i2, j2)}.

Based on the formulas for pH-Join, we have:for ancestor based estimation,

11

E A(ChV ) = Histv1[i1][j1]

×Histv2[i2][j2] × Factor(e, i1, j1, i2, j2)

for descendant based estimation,E D(ChV ) = Histv2

[i2][j2]×Histv1

[i1][j1] × Factor(e, i1, j1, i2, j2)so, E A(ChV ) = E D(ChV ) = E(ChV ).

• Inductive case:

Assume that for any tree pattern (V,E)with t (< n) nodes, ChV is the set of ran-domly selected grid cells, then,E(ChV ) =

∏

p∈1..t Histp[ip][jp]×

∏

q∈1..t−1 Factor(eq, ieqa, jeqa

, ieqd, jeqd

)regardless of the join order and the node onwhich the result histogram is based.

For a pattern (V,E) with n nodes, ChV isthe set of selected grid cells. Given a joinplan, e is the edge on which the last joinoperation is based on. e divides the patterninto two sub-pattern (V1, E1) and (V2, E2),both of them have less than n nodes, andV1 ∪ V2 = V , E1 ∪ E2 ∪ e = E. Assumethat the two nodes at the end of edge e areva(∈ V1), vb(∈ V2).

With the assumption, the estimation of thesub-pattern (V1, E1), based on va, regard-less of the join order, isE(ChV1

) =∏

(i,j)∈ChV1Histp[i][j]

×∏

e∈E1Factor(e, iea

, jea, ied

, jed)

Similarly, the estimation of the sub-pattern(V2, E2), based on vb, regardless of the joinorder, isE(ChV2

) =∏

(i,j)∈ChV2Histp[i][j]

×∏

e∈E2Factor(e, iea

, jea, ied

, jed).

Join the intermediate result above on edgee, the final result of this set of join plan isE′ = E(ChV1

) × E(ChV2)

×Factor(e, iea, jea

, ied, jed

)=

∏

p∈1..t Histp[ip][jp]×

∏

q∈1..t−1 Factor(eq, ieqa, jeqa

, ieqd, jeqd

)= E(ChV ) as we defined at the very begin-ning.

3.5. Compound PredicatesOften, the predicates applied at a node may

not belong to the set of basic predicates P . In

such a case, there may be no precomputed posi-tion histogram of start and end positions for nodessatisfying the specified predicate. However, if thespecified predicate can be expressed as a booleancombination of basic predicates, we can estimatea position histogram assuming independence (be-tween basic predicate components of the com-pound predicate) within a grid cell. Note that thebasic predicates do not have to be independentover the entire data set – their respective posi-tion histograms will capture any correlation. Ourindependence assumption here is within a singlegrid cell, and is justifiable in exactly the same wayas a uniformity assumption within a single his-togram bucket is justified without requiring thatthe entire data set be uniform globally.

To be able to manipulate counts, we need toconvert these into the appropriate probabilities.What we require is the probability of a node sat-isfying some basic predicate, given that it is ina specific histogram grid cell (start and end po-sition). For this purpose, we can compute a po-sition histogram for the predicate “TRUE”, in-cluding all elements in the database, and simplyusing their start and end positions to obtain theneeded grid cell counts. For each grid cell, thiscount is the appropriate normalization constant.Now, given a specific grid cell, we can “normal-ize” the count associated with any basic predi-cate, dividing by the normalization constant, toobtain the probability of the basic predicate be-ing satisfied by a point in that grid cell. For acompound predicate, we can manipulate proba-bilities assuming independence (taking the prod-uct for intersection, the sum minus the productfor union, and one minus the original for nega-tion). The final probability for the compoundpredicate can be “de-normalized” by multiplyingit with the normalization constant.

For example, consider two histogram for thepredicates P1 and P2 corresponding to the predi-cates “author contains ‘Jane’ ” and “author con-tains ‘Doe’ ” respectively. To compute the posi-tion histogram for a predicate “author contains‘Jane’ or author contains ‘Doe’ ”, we use the po-sition histograms on P1 and P2, and the positionhistogram on the predicate “TRUE”, as shown inFigure. 10.

12

Formula for Computing Histogram of Compound Predicate:

HistP3[i][j] = HistP1∪P2

[i][j]

= HistTRUE [i][j] ×{

HistP1[i][j]

HistT RUE [i][j] +HistP2

[i][j]

HistT RUE [i][j] − [HistP1

[i][j]

HistT RUE [i][j] ×HistP2

[i][j]

HistT RUE [i][j] ]}

Figure 10. Formula for Computing Histogram of Compound Predicate

3.6. Predicate Set SelectionCompound predicates can arise not only be-

cause the query expression has a compound pred-icate in it, but also because of the choices made indefining the set P of basic predicates. Predicatesin XML queries fall into two general categories:

Element-Tag Predicates These predicates aredefined on the element tags. An exampleof such predicate is elementtag = faculty.Element tag predicates are likely to be com-mon in XML queries, and are a good can-didates for building position histograms on.Usually, there are not many element tagsdefined in an XML document, so it is easyto justify the storage requirement of one his-togram for each such predicate and build ahistogram on each one of these distinct ele-ment tags.

Element-Content Predicates These predi-cates specify either an exact or partialmatch on the contents of element. Forexample, text nodes with a parent nodeyear are numerical values (integer) withina small range. It is not unreasonable tobuild histogram for each of these values.In some cases, some part of the contenthas some general meaning, and tends tobe queried extensively. It would be helpfulto set a predicate that evaluates to true ifthe prefix (suffix) of the content of a textelement matches to a certain value. We willsee some examples of both in Section 5.

It is likely that such predicates far outnum-ber the element-tag predicates, and positionhistograms will only be built on element-content predicates that occur frequently. In

any event, minimizing error in the estima-tion of these values is likely to be more im-portant than errors in estimates of less fre-quent items. The value of this general con-cept has been amply demonstrated in thecontext of end-biased histograms [8].

Also, in cases where the predicate is over anattribute that takes a large number of val-ues, standard histogram bucketizing tech-niques can be used to capture the number ofelements that posses attributes in appropri-ate regions of values, and these can be usedto estimate the number that satisfy any cho-sen range. In our case, each such histogrambucket becomes a “basic predicate”. Givenany predicate over values of this attribute,standard histogram estimation techniquescan be used to evaluate the necessary “com-pound predicate”. As an example, considera bibliography database where each bookelement has an attribute for the numberof pages in the book. We may bucketizethis attribute value into disjoint adjacentranges, such as 0-100, 101-200, 201-300, andso on. Now, given a query predicate thatasks for books that have between 180 and350 pages, we know how to estimate thisnumber. For the position histograms intro-duced in this paper, we build one for eachattribute value range bucket above. For anyposition histogram grid cell in question, theappropriate value histogram numbers canbe combined to produce an appropriate es-timate for the specific compound predicateposition histogram.

13

4. Factoring in Schema Information

Up to this point, we assumed that the datawas uniformly distributed within any grid cell,and this is indeed a reasonable thing to do if noother information is available. However, we mayfrequently have information from the schema thatcan substantially modify our estimate.

For instance, if we know that no node that sat-isfies predicate P2 can be a descendant of a nodethat satisfies P1, then the estimate for the num-ber of results for a query that asks for P1 satis-fied at a node that is an ancestor of P2 is simplyzero – there is no need to compute histograms.Similarly, if we know that each element with tagauthor must have a parent element with tag book,then the number of pairs with book as ancestorand author as descendant is exactly equal to thenumber of author elements.

We recommend that such schema informationbe brought to bear when possible. Our work hereconcerns itself with the vast majority of the caseswhere schema information alone is insufficient.

4.1. No OverlapWe frequently know, for a given predicate, that

two nodes satisfying the predicate cannot haveany ancestor-descendant relationship. For in-stance, in Figure. 1, a faculty node cannot con-tain another faculty node. It follows that therecan be no node that is a descendant of two dis-tinct faculty nodes. (For instance, a particularTA node can appear under at most one facultynode). In such situations, the uniformity assump-tion within a histogram grid cell can lead to er-roneous estimates. We present, in this section,an alternative estimation technique appropriatewhen the ancestor node predicate in a primitivetwo-node pattern has the no-overlap property. Itturns out that there is no impact on the estima-tion of the descendant node in the pattern havinga no-overlap property since multiple descendantscould still pair with the same (set of nested) an-cestor node(s).

Definition 4.1 A predicate P is said to havethe no-overlap property if for all elements x,y such that P (x) and P (y) are TRUE, wehave: endpos(x) < startpos(y) or endpos(y) <

startpos(x).

4.2. Summary Data Structure for Predi-cates with No-Overlap

For a primitive pattern with a no-overlap an-cestor node a, the number of occurrences is upper-bounded by the count of the descendant node din the pattern. (Since each descendant node mayjoin with at most one ancestor node). The ques-tion is how to estimate the fraction of the descen-dant nodes that participate in the join pattern.Within any one grid cell, the best one can do isto determine what fraction of the total nodes inthe cell are descendants of a, and assume thatthe same fraction applies to d nodes. We callthis fraction, the coverage of a in that particu-lar cell. Thus, our technique for dealing with theno-overlap situation is to keep additional infor-mation, in the form of coverage. Formally, wedefine the coverage histogram for predicate P :CvgP [i][j][m][n] to be the fraction of nodes in gridcell (i, j) that are descendants of some node thatsatisfies P and fall in grid cell (m,n).

At first glance, it may appear that the storagerequirements here are exorbitant – rather thanstore counts for each grid cell, we are keeping in-formation for cell pairs. However, for a given gridcell r in the position histogram, and consider itscoverage in grid cell s, the coverage fraction isguaranteed to be one whenever cell s is both tothe right of and below r. And the coverage frac-tion is obviously zero for cells that cannot includedescendants of elements in r. As such, it is onlythe cells s along the “border” for which one isrequired explicitly to keep coverage information.In fact, one can establish the following theorem:

Lemma 4.1 In a position histogram for a no-overlap predicate, every off-diagonal grid cell hascount at most one.

Proof: Let [smin, smax) and [emin, emax) be thestart and end position ranges respectively. For anoff-diagonal grid cell, we must have emin ≥ smax.Suppose there are two nodes u and v in thiscell, with position (su, eu) and (sv, ev) respec-tively. Due to the no-overlap property, we musthave sv > eu. But we cannot then have both

14

sv < smax and eu ≥ emin. Hence proved by con-tradiction.

��

��

��

��

��

��

Start Position

PositionEnd

0

Grid cell on focus

Coverage = 1

Coverage = 0

0<Coverage<1

Figure 11. Coverage Histogram for No-overlap Pred-icate

Theorem 4.1 In a g×g grid, the number of cov-erage histogram cell pairs with partial (non-zeroand non-one) coverage is O(g). In other words,the coverage histogram requires only O(g) storage.

Proof:For a non-zero grid cell (i, j), we say that it

covers the grid cells (p,q) if p ≥ i, q ≤ j. Amongall the grid cells covered by (i, j), let’s denote thenumber of on diagonal grid cells covered by (i, j)as D(i, j).

D(i, j) = |{(p, q)|(p, q) is covered by (i, j)}|= j − i + 1.

Consider a non-zero grid cell (i, j) in the posi-tion histogram. In the coverage histogram,

• Cvg[p][q][i][j] count zero, when p < i or q >j.

• Cvg[p][q][i][j] count one, when p > i andq < j.

• Partial coverage can happen only in the re-maining grid cells (that is, p = i, q ≤ j orp ≥ i, q = j), the number of these cells is2(j − i) + 1 ≤ 2 × D(i, j).

With the no-overlap property, the on-diagonalgrid cells covered by two non-zero grid cell can-not over lap, except on the boundary. That is,for any two non-zero grid cells (i1, j1), (i2, j2),|{(p, q|(p, q) is covered by (i1, j1) ∧ (p, q)is cov-

ered by (i2, j2)}| ≤ 1.From this, we have: for any two non-zero grid

cells (i1, j1), (i2, j2),i1 ≤ i2 ⇒ j1 ≤ i2.In an g × g position histogram,

∑

non−zerogridcell(i,j)

D(i, j) ≤ 4n.

P (Cvg) =∑

(i,j)

Hist[i][j]6=0

2 × D(i, j) ≤ 8n = O(g)

4.3. Estimation Algorithm for No-overlapPredicates

Consider descendant-based estimation (withthe descendant node as the outer loop) first. Foreach grid cell s in the grid with a non-zero countof nodes satisfying P2, as determined from the po-sition histogram, we must determine the fractionof these nodes that have an ancestor that satis-fies P1. (Since P1 has the no-overlap property,there can be no more than one such ancestor).This is easily obtained as the sum of the coveragehistograms for cell pairs (r, s) and nodes in r sat-isfying P1, for all the grid cell r’s to the up left ofgrid cell s. The assumption here is that for anynode in grid cell r, the fraction of P2 satisfyingnodes in s that are descendants of a P1 satisfy-ing node in r is the same as the fraction of allnodes in s that are descendants of a P1 satisfyingnode in r. This gives us a position histogram ofjoined pairs for the two predicates, with counts ofjoined pairs attributed to the grid cell in whichthe descendant node occurs.

Turn now to ancestor-based estimation (withthe ancestor node as the outer loop), where thecounts of joined pairs in the result histogram areto be attributed to the grid cell in which the an-cestor occurs. For a grid cell r, this is obtainedby summing over each s the predicate P1 cover-

15

Ancestor-based pattern count estimate:

EstAB [i][j] = Jn FctA P1[i][j] ×

∑

m=i..j

n=m..j

(CvgA P1[m][n][i][j] × HistB P2

[[m][n]

×Jn FctB P2[m][n])

Descendant-based pattern count estimate:

EstAB [i][j] = HistB P2[i][j] × Jn FctB P2

[i][j] ×∑

m=0..i

n=j..max y

(CvgA P1[i][j][m][n]

×Jn FctA P1[m][n])

Join factor estimate:

Jn FctAB Px[i][j] = EstAB [i][j]

HistAB Px [i][j] if HistAB Px[i][j] > 0, = 0 otherwise

Participation Estimation:

case1: the node (Px) that the estimation is based on can overlap

HistAB Px= EstAB

case2: P1 is no-overlap, estimation is ancestor-based

N [i][j] = HistA P1[i][j]

M [i][j] =∑

m=i..j

n=m..j

HistB P2[[m][n]

HistAB P1[i][j] = N [i][j] × (1 − (N [i][j]−1

N [i][j] )M [i][j])

case3: item P2 is no-overlap, estimation is descendant-based

HistAB P2[i][j] = HistB P2

[i][j] ×∑

m=0..i

n=j..max y

(notzero(HistA P1[m][n])

×CvgA P1[i][j][m][n])

Here, function notzero(x) = 1 if x 6= 0, = 0 otherwise

Coverage Estimation:

case1: P1 is no-overlap, the join is ancenstor-based

CvgAB P1[i][j][m][n] = CvgA P1

[i][[j][m][n] ×HistAB P1

[m][n]

HistA P1[m][n]

case2: P2 is no-overlap, and the join is descendant-based

CvgAB P2[i][j][m][n] = CvgB P2

[i][[j][m][n] ×HistAB P2

[i][j]

HistB P2[i][j]

×notzero(HistA P1[m][n])

* Notation:

EstAB: estimation histogram of the pattern obtained by joining

subpattern A,BHistAB Px

: number of nodes that satisfy Px and participate in

the join of A and B.

Jn FctAB P1: number of nodes satisfying P2 that join with each

distinct node, that satisfy P1, in the join of subpattern A and B.

CvgAB P1: the coverage histogram of the distinct nodes that

satisfy P1 and participate in the join of subpattern A and B.

Figure 12. Estimation Formulae for No-Overlap Predicates

16

age histogram for cell pairs (r, s) multiplied bythe count in grid cell s of the predicate P2 posi-tion histogram, for all the grid cell s’s to the rightand below grid cell r. Once again, this gives usa position histogram of joined pairs for the twopredicates, but this time with counts of joinedpairs attributed to the grid cell in which the an-cestor node occurs.

To be able to evaluate estimates for complexpatterns, by building up the patterns one edgeat a time, we need to create as intermediate re-sults, not just estimates of position (count) his-tograms but also of coverage histograms. In addi-tion, we now have a further complication havingto do with multiplicity of the join node versus thecount of the (intermediate) pattern as a whole.These two are not necessarily the same, and theirratio is called the Join Factor.

Consider a simple three-node twig join query,with node A,B and C shown in Figure 14 (a) andnode A has no-overlap property. Assume thatwe compute the A-B pair estimate first. Afterwe get the estimate for the count of A-B pairs,we need to estimate the answer size of the joinbetween the result of the A-B pair and C. Here,what we need is the number of distinct A’s par-ticipating in A-B join (HistAB A, called partic-ipation histogram), the number of B’s (on aver-age) that join with each distinct A node in A-B pair(Jn FctAB A), and the coverage informa-tion of the distinct A’s participating in A-B join(CvgAB A), all at grid cell level. An ancestor-based position histogram estimation, describedtwo paras above, would merely compute the prod-uct of HistAB A and Jn FctAB A. We can dividethis product by HistAB A, estimated using a Bi-nomial distribution, to get the needed join factorvalue, to be used in the subsequent (A-C) join.These ideas, modulo a few messy details, lead tothe estimation formulae presented in Fig 12.

In the formulae in Fig 12, what we are dealingwith is not just a three-node twig as stated above,but a more general one where two sub-patternsnamed A and B are joined, with ancestor-descendent relationship between nodes satisfyingP1 in subpattern A and nodes satisfying P2 insubpattern B. What we have are the partici-pation histogram, join factor and coverage his-

togram for the node that satisfy P1 in the sub-pattern A, as well as those histograms for nodesthat that satisfy P2 in the subpattern B. Allof them come from previous estimation compu-tation. What we are estimating here is the esti-mation histogram of AB (sub-pattern A join B),the participation histogram of Px (x in 1,2), andthe join factor and coverage histogram of Px inAB.

Coverage for (0.0)

Coverage for (1,1)

0.3

0.5

Figure 13. Example Coverage Histogram for faculty

Example 4.1 Let’s go back to the example XMLdocument again, and estimate the result size forthe same query pattern. This time, the no-overlapestimation algorithm is used. The Coverage His-togram of predicate “element tag = faculty” isshown in Figure. 13. The estimate we get is 1.9,almost the same as the real result size.

5. Experimental Evaluation

We tested our estimation techniques exten-sively on a wide variety of both real and syntheticdata sets. First, we report on the accuracy of theestimates obtained. Later, we present results onthe storage size and the impact of storage size onthe accuracy of the estimate.

5.1. The DBLP Data SetWe ran experiments on several well-known

XML data sets, including the XMark Bench-mark [15] and the Shakespeare play data set [19].Results obtained in all cases were substantiallysimilar. In the interests of space, we present re-sults only for the DBLP data set [18] that is prob-ably most familiar to readers of this paper. TheDBLP data set is 9M bytes in size and has ap-proximately 0.5M nodes.

17

For the DBLP data set, we picked a mix ofelement-tag and element-content predicates andbuilt histograms on exact matching of all the el-ement tags, the content value of years, and theprefix matching of the content of ‘cite’ (e.g conf,journal, etc.). A few of these predicates, alongwith the count of the nodes that match eachpredicate, and the overlap property of the pred-icate is summarized in Table 1. Note that thepredicates 1990’s and 1980’s are compound pred-icates, obtained by adding up 10 correspondingprimitive histograms for element-content predi-cate (e.g. 1990, 1991 ...). In all, there are 63 pred-icates; and the total size of all the correspondinghistograms added up to about 6K bytes in all –roughly 0.7% of the data set size. (We used 10×10histograms in all experiments, except where ex-plicitly stated otherwise.)

5.1.1. Estimating Simple Query AnswerSizes

We tested the effectiveness of position his-tograms on a number of queries using a combina-tions of predicates from Table 1. In the interestof space, we only present results for a few rep-resentative queries in Table 2. The first row ofthis table considers a query pattern where an el-ement with author tag appears below an elementwith article tag. Other rows consider similar othersimple queries.

Without the position histograms, and withoutany schema information, a (very) naive estimatefor the answer size is the product of the cardi-nalities of the node counts for the two predicates(i.e., article and author). The naive estimate is farfrom the real result, since it does not consider thestructural relationship between nodes. With theschema information and no position histogram,if the ancestor node has no-overlap property, thebest (upper-bound) estimate of the result size isthe number of descendants involved in the join.When position and coverage histograms are avail-able, overlap or no-overlap estimation algorithmscan be used. When no schema information isavailable, using position histograms and the prim-itive pH-Join estimation algorithm brings the es-timate closer to the real answer size. In somecases, the primitive estimation is better than the

upper-bound estimation using only the schemainformation, while the no-overlap estimation us-ing position histogram and coverage histogramgives almost exactly the right answer size.

Finally, the time spent on estimating the re-sult size of a simple twig query pattern, in allcases, using both the overlap algorithm and theno-overlap algorithm, is only a few tenths of amillisecond, which is very small compared to mostdatabase operations.

5.2. Answer Size Estimation for ComplexQueries

For complex queries the final answer size es-timation is computed by building the complexpattern one edge at a time and computing theestimates as we keep adding new edges. A set ofsupporting information (histogram for the joinednodes, join factor, etc.) is estimated and carriedon from one simple join to another. In order tounderstand how the structure and the size of thepattern may affect the outcome of the estimation,we designed a set of queries with three-node-twigpattern, and a more complex pattern (with sixnodes). The results of these cases are presentedbelow.

A

B

A

C B F

E C D

(a) Three-Node Pattern

(b) A Complex Query Pattern

Figure 14. Example Query Pattern

5.2.1. Three-Node Twig QueriesA three-node twig query pattern, as shown in

Figure. 14 (a), is an archetype for more complexpatterns. We are required to compute at least oneancestor-based estimate, and use this in conjunc-tion with the other descendant to obtain the final

18

Pred Name Predicate Node Count Overlap Propertyarticle element tag = “article” 7,366 no overlapauthor element tag = “author” 41,501 no overlapbook element tag = “book” 408 no overlapcdrom element tag = “cdrom” 1,722 no overlapcite element tag = “cite” 33,097 no overlaptitle element tag = “title” 19,921 no overlapurl element tag = “url” 19,542 no overlapyear element tag = “year” 19,914 no overlapconf text start-with “conf” 13,609 N/Ajournal text start-with “journal” 7,834 N/A1980’s compound 13,066 N/A1990’s compound 3,963 N/A

Table 1Characteristics of Some Predicates on the DBLP Data Set

Naive Desc Overlap No-Overlap RealAnce Desc Estimate Num Estimate Est Time Estimate Est Time Resultarticle author 305,696,366 41,501 2,415,480 0.000344 14,627 0.000263 14,644article cdrom 12,684,252 1,722 4,379 0.000290 112 0.000261 130article cite 243,792,502 33,097 671,722 0.000229 3,958 0.000261 5,114book cdrom 702,576 1,722 179 0.000142 4 0.000259 3

Table 2Result Size Estimation for Simple Queries on DBLP Data Set

estimate. When that ancestor node has the no-overlap property, we not only estimate the num-ber of answers of the sub-query, but also estimatethe required supporting information. A numberof three node twig queries were run against theDBLP data set. The results for a few represen-tative queries are shown in Table 3. The nodenames A, B, C refer to positions in Figure. 14(a). Results are presented not just for the en-tire pattern, but also for the intermediate resultsobtained if the AB edge or the AC edge is eval-uated first. As expected, the final estimates ob-tained are virtually identical irrespective of howthey were obained: the (AB)C estimate first com-putes AB and then joins C to the result, whereasthe (AC)B estimate first computes AC. The timeto obtain these estimates is once again very small(approximately 1 millisecond each).

5.2.2. Complex QueriesIntuitively, when the number of nodes in the

query pattern increases, the accuracy of the an-swer size estimation is expected to decrease sinceerrors propagate multiplicatively. We ran a num-ber of complex queries on the DBLP data set.Surprisingly, we found that even with very com-plex queries, the estimation error is not bad(within a factor of two). The reason is that eachsimple join brings in some additional selectivityfactors. If the data nodes for a specific predi-cate are clustered in a small region of the his-togram space (e.g. element book appears only inthe first 1/10 of the DBLP data set), then the his-togram produced after joining with this predicatewill have many empty grid cells. Any predicate inthe complex pattern that has this property bringsabout not just a numeric reduction but also a “fo-cussing” on the region of interest. In the queriesthat we present, the article predicate in Query1

19

Node Node Node AB AC (AB)C (AC)B ABCA B C Estimate Real Estimate Real Est Est Real

article author cite 14,627 14,644 3,960 5,114 11,619 11,619 14,259article title author 7,323 7,366 14,627 14,644 15,321 15,341 14,644book title year 610 408 607 408 470 471 408book url cdrom 493 33 5 3 3 3 3proc title url 19 32 11 28 11 11 28

Table 3Result Size Estimation for Three Node Twig Queries on the DBLP Data Set

and the book predicate in Query2 exhibit this be-havior.

In the interest of space, the results of onlytwo complex queries are shown in table 4. Thestructure of the query pattern is shown in Fig-ure 14 (b). Note that the computation time forthe entire complex estimate is still only about 3milliseconds.

5.3. Synthetic Data SetWhereas our tests on real data give us confi-

dence, real data sets like DBLP are limited insize and complexity. We wanted to understandhow our techniques would do given a more com-plex situation, with deeply nested and repeatingelement tags. For this purpose we used the IBMXML generator[20] to create synthetic data usinga realistic DTD involving managers, departmentsand employees, as shown in Figure 15.

<!ELEMENT manager (name, (manager |department | employee)+)>

<!ELEMENT department (name, email?,

employee+, department*)><!ELEMENT employee (name+, email?)><!ELEMENT name (#PCDATA)><!ELEMENT email (#PCDATA)>

Figure 15. DTD Used in Generating SyntheticData Set

The predicates that we consider for this DTDare summarized in Table 5.

On the synthetic data set, we ran all types ofqueries we presented above. Here, for lack ofspace, we present only the results of some rep-resentative simple queries in Table 6.

In this data set, some of the nodes have theno-overlap property, some don’t. We obtain theestimate with the pH-Join algorithm for all thequeries, and use no-overlap estimation algorithmwhenever possible. From Table 6 we can see thatwhenever there is no-overlap property, the no-overlap estimation algorithm gives an estimatethat is much closer to the real answer size thanthose obtained by using the primitive pH-Joinalgorithm. For joins where the ancestor nodedoesn’t have the no-overlap property, the prim-itive pH-Join algorithm computes an estimatethat is very close to the real answer size. In spiteof the deep recursion, the time to compute esti-mates remains a small fraction of a millisecond.

5.4. Storage RequirementsIn this section, we present experimental results

for the storage requirements of both position his-tograms and coverage histograms (recall as perTheorem 4.1, we expect the storage requirementto be O(n)). We also consider the impact of stor-age space on the accuracy of the estimates.

Figure. 16 shows the effect of increasing gridsize on the storage requirement and the accuracyof the estimate, for the department-email queryon the synthetic data set. Since the predicate de-partment does not have the no-overlap property,the department-email pair join does not requireany coverage information, therefore, only position

20

Query A B C D E F Estimate Real Size Est TimeQuery1 article title author year cite conf 2,421 5,826 0.003232Query2 book title author url year 1990’s 52 44 0.002540

Table 4Result Size Estimation for Complex Query on DBLP Data Set

Pred Name Predicate Node Count Overlap Propertymanager element tag = “manager” 44 overlapdepartment element tag = “department” 270 overlapemployee element tag = “employee” 473 no overlapemail element tag = “email” 173 no overlapname element tag = “name” 1,002 no overlap

Table 5Characteristics of Predicates on the Synthetic Data Set

histograms are built on predicate department andpredicate email. The storage requirement for thetwo predicates are all linear to the grid size, witha constant factor close to 2. The result estimate isnot very good when the histogram is very small.However, the ratio of the estimate to the real an-swer size drops rapidly and is close to 1 for gridsizes larger than 10-20.

Article-cdrom join is an example of query withno-overlap property. Here, both predicates (ar-ticle, cdrom) have the no-overlap property, andconsequently, we store both a position histogramand a coverage histogram for each of them. Thestorage requirement of these two predicates, aswell as the accuracy of the estimation is shown inFigure. 17. Note that the storage requirement forboth the position histogram and the coverage his-togram are linear to the grid size, which results inthe total storage requirement grow linearly with aconstant factor between 2 and 3. Another obser-vation is that the estimate is not good when thegrid size is very small, but it very quickly con-verges to the correct answer. Starting from thepoint where the grid size is larger than 5, the ra-tio of estimate to the real answer size is within1 ± 0.05, and keeps in this range thereafter. Thereason is that more information is caught by thecoverage histogram than by only the position his-togram.

Storage Requirement

0 100 200 300 400 500 600

0 5 10 15 20 25 30 35 40 45 50

grid size

sto

rag

e r

eq

uir

e-

men

t (b

yte

s) department

email

Estimation Accuracy

0

1

2

3

4

5

0 5 10 15 20 25 30 35 40 45 50

grid size

esti

mate

/real

an

sw

er

siz

e estimate/read answer size

Figure 16. Storage Requirement and EstimationAccuracy for Overlap Predicates (department-email)

21

Des- Naive Overlap No-Overlap RealAncestor cendant Estimate Estimate Est Time Estimate Est Time Resultmanager dept 11,880 656 0.000070 N/A N/A 761manager employee 20,812 1,205 0.000054 N/A N/A 1,395manager email 7,612 429 0.000052 N/A N/A 491dept employee 127,710 2,914 0.000050 N/A N/A 1,663dept email 46,710 1,082 0.000054 N/A N/A 473employee name 473,946 8,070 0.000062 559 0.000082 688employee email 81,829 1,391 0.000054 96 0.000080 99

Table 6Synthetic Data Set: Result Size Estimation for Simple Queries

Storage Requirement

0

200

400

600

800

1000

1200

0 5 10 15 20 25 30 35 40 45 50

grid size

sto

rag

e r

eq

uir

e-

men

t (b

yte

s)

pHist article Coverage aritcle pHist Cdrom Coverage Cdrom

Estimation Accuracy

0

1

2

3

4

5

0 5 10 15 20 25 30 35 40 45 50

grid size

esti

mate

/real

an

sw

er

siz

e estimate/read answer size

Figure 17. Storage Requirement and Estima-tion Accuracy for No-Overlap Predicates (article-cdrom)

6. Parent-Child Queries

With the node numbering scheme used so far,ancestor-descendant relationships can be deter-mined, but one cannot be sure if any given pairof nodes have a direct parent-child relationshipor have an intervening other node. To be able toanswer this question, we introduce a level numberwith each node. The root is level 0, each of itschildren is level 1, and so on. Every child of anyparent node is at a level exactly one greater thanthe parent. Thus, we associate a triple, ratherthan a pair, of values with each node.

For each predicate selected, a separate two-dimensional histogram is created for each level.The number of parent child relationships is de-termined by computing the number of ancestor-descendant relationships between a node at levelk and a node at level k +1, for all possible valuesof k. Since there is a different histogram for eachvalue of k, this is easy to be accomplished, usingexactly the techniques discussed above.

The storage required now could increase by afactor equal to the number of levels in the XML-tree. However, in practice, many predicates aresatisfied only (or mostly) at a few levels in thetree, so that the storage blow-up is much smaller.Furthermore, since the number of entries in eachlevel is fewer, one can use coarser grid cells toobtain the same accuracy. As a consequence, thestorage required to handle parent-child queries,while larger than for ancestor-descendant querieswith the same estimate error, is not larger bya huge factor, as the experiments below demon-

22

Node A Node B A-D Estimate P-C Estimate P-C Real Sizemanager department 1,355 15 24manager employee 2,509 20 31

department employee 3,858 199 442department email 1,452 73 74department name 8,137 348 270

Table 7Synthetic Data Set: Result Size Estimation for Simple Queries. Histogram size 5x5

strate.Finally, a couple of observations regarding ac-

curacy. Irrespective of the schema, it is not pos-sible for two nodes at the same level ever to over-lap. As such, the non-overlap criteria apply to allpredicates in the parent-child case, and must beused to get good estimates. Also, for ancestor-descendant queries, since level numbers are avail-able, they can be used to prune out many can-didate nodes. Making an estimate by summa-tion across levels, each paired only with higher(or lower, as the case may be) levels of the otherpredicate, one can frequently obtain ancestor-descendant estimates superior to those obtainedwith a single level-unaware structure.

For the level histogram, experiments were runon the synthetic data set only, since the DBLPdata set is never more than 5-level deep. Theresult of the answer size estimation is shown inTable 7, and the storage requirement for thehistograms is shown in Table 8. When a parent-child query is asked, without the level histogram,one can only compute the ancestor-descendantestimation. The result of this estimation is shownunder the column “A-D Estimation”. As shownin this Table, with the level histogram we cancompute a parent-child estimation more accu-rately. The extra storage cost, however, is linear(with a factor smaller than the number of levelsin the database) to the grid size and level.

7. Estimation Using Non-Uniform His-tograms

From the experiments on different data set,we observe that nodes satisfy a certain predicatetends to be clustered in a small portion of the

Predicate pHist LvHist Cvgmanager 9 30 43

department 8 38 72employee 6 37 79

name 6 46 87email 5 25 87

Table 8Storage Requirement for Leveled Histogram andLevel-wise Coverage

XML document. If uniform histograms are builtfor all predicates selected, (that is, the grid cellboundary settings are the same for all the his-togram), the result is that most of the histogramshave only a few non-zero-count grid cells (or evenone such grid cell).

In relational databases, there is considerableliterature on the construction of good histogramin a one-dimensional case. There is even some lit-erature on the construction of good histograms inthe multi-dimensional case. While the specifics ofthese techniques are unimportant for our purposehere, being able to build a non-uniform histogramand estimate answer size using non-uniform his-tograms (histograms with unequal grid cells)would be helpful to obtain estimates that arecloser to the real answer sizes.

7.1. Summary Data StructureIn a uniform histogram, the scale is defined by

the number of grid cells boundaries on both X-axis and Y-axis. It is much more complex for anon-uniform histogram.

Definition 7.1 The scale of a non-uniform his-togram Hist is (Nx, V ecx, Ny, V ecy), where Nx is

23

the number of grid cell boundaries on the start po-sition (X-axis), and V ecx is the list of boundaryvalue. Ny and V ecy represent the same scalingvalue for end position (Y-axis).

A histogram is a uniform histogram when Nx =Ny, and V ecx[i] = V ecy[i], for all 0 ≤ i ≤ Nx.

When the two histograms involved in estimat-ing the answer size for a pair join are both uni-form histograms, but their scales are not exactlythe same, the technique for estimation using non-uniform histogram is still in need.

Example 7.1 The histogram shown in Fig-ure. 18 (a) is a non-uniform histogram. Thenumbers of boundaries on start position and endposition are not the same. And the intervals be-tween the boundaries are not uniform. Note thatwith non-uniform boundaries, the shapes of theoff-diagonal grid cells are no longer square, butrectangular, and the shapes of the on-diagonalgrid cells are no longer triangular.

7.2. Transformation TechniqueGiven two histograms with different scales, the

estimation technique we stated above cannot beapplied directly, since the grid cells from differ-ent histogram no longer exactly overlap. To beable to estimate the answer size of the pair join,we need to transform both histograms into his-tograms with same scale. Then, the estimationtechniques introduced in the sections above canbe used to obtain the answer size estimate.

Definition 7.2 The intersection scale (sc)of a set of histogram setHist is a vector(Nx, V ecx, Ny, V ecy), which satisfies: foreach Hist ∈ setHist, Hist.V ecx ⊂ V ecx ∧Hist.V ecy ⊂ sc.V ecy.

Definition 7.3 For a given predicate P , the po-sition histogram associated with the predicate,which is stored as the summery information of thedata in the database, is called original histogram(oHist(P )). A histogram which is equivalent tothe original histogram, but in some intersectionscale, is called a derived histogram (dHistsc(P )).

Given a set of original histograms, intersectionscale can be found easily by merging their respec-tive scales. Now, the problem is how to find the

derived histogram of a given histogram, such thatthe derived histogram is at the intersection scale,and is an approximation of the original histogram.

Lemma 3.1 states that appearance of nodes inone grid cell leads to zero count of some othergrid cells. Here, we generalize the lemma to anon-uniform scenario.

Theorem 7.1 In a position histogram for anypredicate, with scale (Nx, V ecx, Ny, V ecy), a non-zero count in grid cell (x1, y1, x2, y2) implies azero count in each grid cell (x′

1, y′1, x

′2, y

′2) with (a)

x′2 < x1, y′

1 > x1 and y′2 < y1, or (b) x′

1 > x2,x′

2 < y2 and y′1 > y2.

This theorem can be further generalized to a setof histogram built on the same XML database.

Theorem 7.2 Given a set of position histogramsbuilt on an XML database, a non-zero count ingrid cell (x1, y1, x2, y2) implies a zero count ineach grid cell (x′

1, y′1, x

′2, y

′2) with (a) x′

2 < x1,y′1 > x1 and y′

2 < y1, or (b) x′1 > x2, x′

2 < y2 andy′1 > y2, in all the histograms in the set.

Given a set of original histogram, The de-rived histograms of the intersection scale couldbe found recursively, following the pseudo-codeas shown in Figure 19

This algorithm takes two histogram with differ-ent scale definition, find the minimum-commonscale and transform them into derived histogramof that scale. It can be generalized to handle morehistograms, just by going through the same pro-cess for each histogram in the loop.

The following is a very simple example, whichillustrates how this transformation is done usingthe algorithm in Figure 19.

Example 7.2 Given Histogram A and His-togram B, in different scale definition, as shownin Figure 18 (a) and Figure 18 (c). Assume thatthere are two grid cells in Histogram A havingnon-zero count, A1 and A2. The grid cell withnon-zero count in Histogram B is B1.

First, the minimum-common scale is computedby combining the scale definition of Histogram Aand B. The derived histogram should be in thisnew scale, as shown in Figure 18(b) and Figure 18

24

��

��

��

��

� � � � � � � �

!�!�!!�!�!!�!�!!�!�!

"�"�""�"�""�"�""�"�"

#�#�##�#�##�#�##�#�#

0

End Position

Start Position

A1

A2F−A1

$�$�$�$$�$�$�$$�$�$�$$�$�$�$$�$�$�$$�$�$�$

%�%�%�%%�%�%�%%�%�%�%%�%�%�%%�%�%�%%�%�%�%

&�&�&&�&�&&�&�&&�&�&&�&�&&�&�&&�&�&&�&�&&�&�&

'�'�''�'�''�'�''�'�''�'�''�'�''�'�''�'�''�'�'

0

End Position

Start Position

A1’

A2’

(a) Original Histogram A. (b) Derived Histogram A’.

()()()()()()()()()(()()()()()()()()()(()()()()()()()()()(()()()()()()()()()(()()()()()()()()()(()()()()()()()()()(

*)*)*)*)*)*)*)*)**)*)*)*)*)*)*)*)**)*)*)*)*)*)*)*)**)*)*)*)*)*)*)*)**)*)*)*)*)*)*)*)**)*)*)*)*)*)*)*)*

+)+)+)+)++)+)+)+)++)+)+)+)++)+)+)+)+

,),),),),,),),),),,),),),),,),),),),

-)-)-)-)-)--)-)-)-)-)--)-)-)-)-)--)-)-)-)-)-

.).).).).)..).).).).)..).).).).)..).).).).).

0

End Position

Start Position

B1

F−B1

/�/�//�/�//�/�//�/�//�/�//�/�//�/�//�/�//�/�//�/�//�/�//�/�//�/�//�/�/

0�0�00�0�00�0�00�0�00�0�00�0�00�0�00�0�00�0�00�0�00�0�00�0�00�0�00�0�0

1�1�1�1�1�1�1�1�11�1�1�1�1�1�1�1�11�1�1�1�1�1�1�1�11�1�1�1�1�1�1�1�11�1�1�1�1�1�1�1�11�1�1�1�1�1�1�1�1

2�2�2�2�2�2�2�2�22�2�2�2�2�2�2�2�22�2�2�2�2�2�2�2�22�2�2�2�2�2�2�2�22�2�2�2�2�2�2�2�22�2�2�2�2�2�2�2�2

3�3�3�3�33�3�3�3�33�3�3�3�33�3�3�3�33�3�3�3�33�3�3�3�33�3�3�3�33�3�3�3�33�3�3�3�33�3�3�3�3

4�4�4�4�44�4�4�4�44�4�4�4�44�4�4�4�44�4�4�4�44�4�4�4�44�4�4�4�44�4�4�4�44�4�4�4�44�4�4�4�4

5�5�5�55�5�5�55�5�5�55�5�5�56�6�6�66�6�6�66�6�6�66�6�6�6

7�7�7�77�7�7�77�7�7�77�7�7�77�7�7�77�7�7�7

8�8�88�8�88�8�88�8�88�8�88�8�8

0

End Position

Start Position

B1’

F−B1’

F−B1’

(c) Original Histogram B. (d) Derived Histogram B’.

Figure 18. Transforming Non-Uniform Histograms to Uniform Histograms.

25

Algorithm HistogramTransform(HistA, HistB, HistA’, HistB’)

// Inputs: Two original histograms HistA and HistB,

// Output: Two derived histogram HistA’ and HistB’

Combine the scale of HistA and HistB, define intersection scale sc.Initialize HistA’ and HistB’ with scale sc.

For each non-zero grid cell in HistA (HistB)

Mark the set of grid cells in HistA’ (HistB’) that are covered by the

grid cell in HistA. (we call such a set a block here.)

while (1)

{Bool no change = true;

for each none-zero block blkA in HistA’

{find its forbidden regions F blkA;

for each non-zero block blkB in HistB’

{find the intersection of blkB with F blkA;

if the intersection is not empty

mark the grid cells in the intersection zero.if the range of blkB can be shrinked

shrink the range of blkB;

set no change = false;

}}for each none-zero block blkB in HistB’

go through similar process as above.

if no change

break from the loop;

}

for each none-zero block blkA in HistA’

{if blkA contains more than one grid cell

divide the node count associated with blkA to those grid cells

according to their area.

otherwise, set the node count of the grid cell to be the node account

associated with blkA

}

return HistA’, HistB’;

Figure 19. Algorithm for Transforming Two Non-uniform histograms to Histograms with Same Scale

26

(d). We map the non-zero grid cells in both orig-inal histograms into new scale.

In Histogram A’, non-zero grid cell A1 intro-duces one forbidden-region F − A1. We mapF − A1 to Histogram B’ and shrink the non-zeroblock to B′

1.In Histogram B’, the new non-zero block B ′

1

introduce two forbidden regions. Each of thesetwo forbidden regions intersects with one of thenon-zero blocks in Histogram A’. Cutting off thezero-count grid cells, the two non-zero blocks inHistogram A’ also shrink.

8. Related Work

In [5], estimation techniques have been sug-gested for “twig” pattern queries in a hierarchicaldatabase, including XML. These techniques gen-eralize the work on pruned suffix trees, presentedin [16,10], and the notion of set hashing [3,6].These techniques, while powerful where they ap-ply, suffer from some limitations. For one thing,the techniques only apply to fully specified twigpatterns, involving only parent-child links. How-ever, one expects many, perhaps even the ma-jority, of XML queries to involve patterns withancestor-descendant links. For another thing, thecomputation techniques only provide the selectiv-ity estimate for the entire query pattern. If es-timates are required for sub-patterns, represent-ing intermediate results in a potential query plan,these have to be computed separately. Finally,the entire technique relies on notions of pruningin suffix trees, and on maintaining small hashesfor set similarity. These space-saving solutionsobviously lose information, and much ingenuityis used to minimize this loss. In contrast, ourtechniques are insensitive to depth of tree, andrequire no pruning and do not admit the possi-bility of a non-local information loss. (However,our techniques are sensitive to the size of “sym-bol alphabet”, and the techniques in the referenceare probably more appropriate if a large numberof basic node predicates are required).

McHugh and Widom [13] describe Lore’s cost-based query optimizer, which maintains statisticsabout subpaths of length ≤ k, and uses it to inferselectivity estimates of longer path queries. Esti-

mating selectivity of path queries has also beenthe focus of a recent paper by Aboulnaga et al.[1], in which the authors propose two techniquesfor estimating the selectivity of path expressions.The first technique called path trees are similar tothe pruned suffix trees of [5], but are more accu-rate for estimating the selectivity of certain pathexpressions. The second technique uses a Markovtable to maintain statistics about all paths up toa certain length. The Markov table approach issimilar to [13], but can be aggressively summa-rized, thereby reducing the amount of memoryused to maintain statistics. The techniques pre-sented in these two papers do not maintain cor-relations between paths, and consequently, thesetechniques do not allow them to accurately esti-mate the selectivity of tree query patterns, whichare very natural in XML query languages.

Histograms of various types, including multi-dimensional histograms, have been used forquery estimation in databases [14,12,7–9]. How-ever, XML queries often involve an ancestor-descendant or parent-child relationships amongnodes. Traditional one dimensional histogramsare not enough to catch the position informationof each single node, relationship among nodes, aswell as other structure information of the XMLdata. Therefore, a novel histogram is introducedhere which can capture the structure informationnative to XML data and estimate the result sizeeffectively and accurately.

9. Conclusions and Future Work

As XML continues to grow in popularity, largerepositories of XML documents are likely toemerge, and users are likely to pose complexqueries on these data sets. Efficient evaluation ofthese complex queries will require accurate esti-mates. Queries in XML frequently specify struc-tural patterns that specify specific relationshipsbetween the selected elements. Obtaining ac-curate estimates for these is not easy, by tradi-tional means. In this paper we have proposeda novel histogram technique called position his-togram, and estimation algorithms using the po-sition histograms, that can be used for accuratelyestimating the answer size for arbitrarily com-

27

plex pattern queries. While the histograms wedevelop are two-dimensional, they are sparse andonly require storage that grows linearly (ratherthan quadratically) with grid size. The estima-tion algorithms are computationally efficient andrequire only a very small running time.

In many cases, schema information may beavailable, and frequently can be used to set anestimate to zero or (through uniqueness) equalto some other simpler estimate. We identifyone specific schema restriction that occurs fre-quently in XML, namely the no-overlap property.We exploit this property in a modified estima-tion algorithm, which produces estimates thatare more accurate than the estimates producedwithout factoring in this schema information. Anopen question is which other schema informationcan be considered together with the position his-togram to further improve the accuracy.

Extensive experimental evaluation using bothreal and synthetic data sets demonstrates the ef-fectiveness of the proposed techniques, for differ-ent type of queries, simple or complex, and onXML documents of different structure, shallow ordeep and nested. The summary data structuresand estimation techniques developed in this paperare an important piece of the query optimizer inthe TIMBER [21] native XML database systemunder development at the University of Michigan.

Theoretical and experimental studies have alsobe done on how to exploit the estimation tech-nique, using both position histograms and cov-erage histogram, to estimate the answer size forquery patterns that are arbitrarily complex. Is-sues on estimation for queries with ordered se-mantics and queries with value-based join opera-tions are also looked into.

REFERENCES

1. A. Aboulnaga, A.R. Alameldeen andJ.F. Naughton. Estimating the Selectivityof XML Path Expressions for Internet ScaleApplications. VLDB, 2001

2. T. Bray, J. Paoli, and C. M. Sperberg-McQueen. Extensible markup language(XML) 1.0. W3C Recommendation. Avail-able at http://www.w3.org/TR/1998/REC-

xml-19980210, Feb. 1998.3. A. Broder. On the Resemblance and Contain-

ment of Documents. IEEE SEQUENCES ’97,pages 21–29, 1998.

4. D. Chamberlin, J. Clark, D. Florescu, J. Ro-bie, J. Simon and M. Stefanescu XQuery1.0: An XML Query Language. W3C Work-ing Draft, http://www.w3.org/TR/xquery/,June 7, 2001.

5. Z. Chen, H. V. Jagadish, F. Korn, N. Koudas,S. Muthukrishnan, R.T. Ng, D. Srivastava.Counting Twig Matches in a Tree. ICDE,2001.

6. Z. Chen, F. Korn, N. Koudas, andS. Muthukrishnan. Selectivity estimation forboolean queries. In Proceedings of the ACMSymposium on Principles of Database Sys-tems, 2000.

7. Yannis E. Ioannidis. Universality of SerialHistograms. In VLDB, pages 256-267, 1993.

8. Y.E. Ioannidis, V. Poosala. Balancing His-togram Optimality and Practicality for QueryResult Size Estimation. In SIGMOD Confer-ence, pages 233-244, 1995.

9. H. V. Jagadish, N. Koudas, S. Muthukrish-nan, V. Poosala, K.C. Sevcik, T. Suel. Op-timal Histograms with Quality Guarantees.VLDB, pages 275-286, 1998.

10. H. V. Jagadish, O. Kapitskaia, R. T. Ng, andD. Srivastava. One-dimensional and multi-dimensional substring selectivity estimation.In VLDB Journal, 9(3), pp.214–230, 2000.

11. H. V. Jagadish, L. V. S. Lakshmanan,T. Milo, D. Srivastava, and D. Vista. Query-ing network directories. In Proceedings of theACM SIGMOD Conference on Managementof Data, Philadelphia, PA, June 1999.

12. R. J. Lipton and J. F. Naughton. Querysize estimation by adaptive sampling. InProceedings of the ACM SIGACT-SIGMOD-SIGART Symposium on Principles ofDatabase Systems, March 1990.

13. J. McHugh and J. Widom. Query optimiza-tion for XML. In Proceedings of the Interna-tional Conference on Very Large Databases,pages 315–326, 1999.

14. M. Muralikrishna and D.J. DeWitt. Equi-Depth Histograms For Estimating Selectiv-

28

ity Factors For Multi-Dimensional Queries. InSIGMOD Conference, pages 28-36, 1988.

15. A.R. Schmidt, F. Waas, M.L. Kersten, D. Flo-rescu, I. Manolescu, M.J. Carey and R. Busse.The XML Benchmark Project. Technical Re-port INS-R0103, CWI, Amsterdam, TheNetherlands, April 2001.

16. M. Wang, J. S. Vitter, and B. Iyer. Selectiv-ity estimation in the presence of alphanumericcorrelations. In Proceedings of the IEEE In-ternational Conference on Data Engineering,pages 169–180, 1997.

17. C. Zhang, J.F. Naughton, D.J. DeWitt,Q. Luo and G.M. Lohman. On SupportingContainment Queries in Relational DatabaseManagement Systems. SIGMOD, 2001

18. DBLP data set. Available at http://www.informatik.uni-trier.de/ley/db/index.html.

19. ibiblio Organization. XML dataset for Shake-sapeare drama. Available at http://sunsite.unc.edu/pub/sun-info/xml/eg/shakespeare.1.10.xml.zip.

20. IBM. XML generator. Available at http://www.alphaworks.ibm.com/tech/xmlgenerator.

21. TIMBER Group. TIMBER Project at Univ.of Michigan. Available at http://www.eecs.umich.edu/db/timber/.