1 Indexing and Querying XML Data for Regular Path Expressions A Paper by Quanzhong Li and Bongki Moon Presented by Amnon Shochot

1

Indexing and Querying XML Data for Regular Path Expressions

A Paper by Quanzhong Li and Bongki Moon

Presented by Amnon Shochot

2

Our Objective

• Developing a system that will enable us to perform XML data queries efficiently.

3

XML Queries Languages

• Used for retrieving data from XML files.

• Use a regular path expression syntax.

• e.g. XPath, XQuery.

4

Queries Today - Inefficient

• Usually XML tree traversals – Inefficient.– Top-Down Approach– Bottom-Up Approach– An example:

the query:

/chapter/_*/figure

(finding all figures in all chapters.)

5

Our Objective - Refined

• Developing a system that will enable us to perform XML data queries efficiently

• Developing such a system consists of:– Developing a way to efficiently store XML data.– Developing efficient algorithms for processing

regular path expressions (e.g. XQuery expressions).

6

Storing XML Documents

• Question: What would we need from a data structure to be able to perform an efficient query?

• Answer: A mechanism for:– Efficiently finding all elements/attributes with a

given name.– Efficiently finding all values with a given name.– Efficiently resolving ancestor-descendant

relationship.

7

Storing XML Documents - XISS

• XISS - XML Indexing and Storage System.

• Provides us with ways to:– efficiently find all elements or attributes with the

same name string grouped by document which they belong to.

– quickly determine the ancestor-descendant relationship between elements and/or attributes in the hierarchy of XML data hierarchy.

8

Determining Ancestor-Descendent Relationship

• According to Dietz’s: for two given nodes x and y of a tree T, x is an ancestor of y iff x occurs before y in the preorder traversal and after y in the postorder traversal.

• Example:

9

Determining Ancestor-Descendent Relationship – cont.

• Advantage: the ancestor-descendent relationship can be determined in constant time.

• Disadvantage: a lack of flexibility.– e.g. inserting a new node requires recomputation

of many tree nodes.

10

• A new numbering scheme:– Each node is associated with a <order, size> pair:

• For a tree node y and its parent x:

[order(y), order(y) + size(y)] (order(x), order(x) + size(x)]

• For two sibling nodes x and y, if x is the predecessor of y in preorder traversal holds:

order(x) + size(x) < order(y).


exclusive

11


• Fact: for two given nodes x and y of a tree T, x is an ancestor of y iff:

order(x) < order(y) order(x) + size(x)

12


• Properties:– the ancestor-descendent relationship can be

determined in constant time.– flexibility – node insertion usually doesn’t require

recomputation of tree nodes.– an element can be uniquely identified in a

document by its order value.

13

XISS System Overview

14

XISS System Overview

• How the system works:– XML documents are loaded into the XISS system.– These documents are added to the XISS data

structures.• Each document is assigned a document id (did).

• Index structures are organized as paged files for efficient disk IO.

– When a query is performed the query processor interacts with XISS in order to obtain the information required for the query.

15

XISS - cont.

• XISS consists of 5 components:– Name Index– Value Table– Element Index– Attribute Index– Structure Index

16

Name Index and Value Table

• Objective: minimizing the storage and computation overhead by eliminating replicated strings and string comparisons.

• Name Index - mapping distinct name strings into unique name identifiers (nid).

• Value Table - mapping distinct value strings (i.e. attribute value and text value) into unique value identifiers (vid).

• Both implemented as a B+-tree.

17

The Element Index

• Objective: quickly finding all elements with the same name string.

• Structure:

18

The Element Index – cont.

• Structure:– B+-tree using nid as a key.– Leaf nodes: pointers to a set of records for elements

(or attributes) having an identical name string, grouped by the document they belong to.

– Element Record = {<order,size>, Depth, Parent ID}• where Depth is the depth of the element in the XML tree.

– Element Records are ordered by <order,size>.

19

The Attribute Index

• Objective: quickly finding all elements with the same name string.

• Structure:– Same structure as the Element Index except that the

record in attribute index has a value identifier vid which is a key used to obtain the attribute from the value table.

20

The Structure Index

• Objectives:– Finding the parent element and child elements (or

attributes) for a given element.– Finding the parent element for a given attribute.

• Structure:

21

The Structure Index – cont.

• Structure:– B+-tree using document identifier (did) as a key.– Leaf nodes: linear arrays with records for all

elements and attributes from an XML document.– Each record: {nid, <order,size>, Parent order, Child

order, Sibling order, Attribute order}.– Records are ordered by order value.

22

Querying Method

• Decomposing path expressions into simple path expressions.

• Applying algorithms on simple path expressions and their intermediate results.

23

Decomposition of Path Expressions

• The main idea: – A complex path expression is decomposed into

several simple path expressions.– Each simple path expression produces an

intermediate result that can be used in the subsequent stage of processing.

– The results of the simple path expressions are than combined or joined together to obtain the final result of the given query.

24

Basic Subexpressions - Example

Decomposition of

(E1/E2)*/ E3 / ((E4[@a=V]) | (E5/_*/E6)):

(1 )Single Element/Attribute

(2 )Element-Attribute

(3 )Element-Element

(4 )Kleene Closure

(5 )Union/

_/*/

* |

] [/

/

(4)

(2)

(3)

(5)

(3)

(3)

(3)

(1) (1) (1)(1) (1) (1)(1)

25

Basic Subexpressions

5 basic subexpressions:

(1) A subexpression with a single element or a single attribute.

(2) A subexpression with an element and an attribute.

• e.g. figure[@caption = “Tree Frogs”]

(3) A subexpression with two elements• e.g. chapter/_*/figure where ‘_’ denotes any kind of

node.

26

Basic Subexpressions - cont.

5 basic subexpressions - cont.:

(4) A subexpression that is a Kleene closure (+,*) of another subexpression.

(5) A subexpression that is a union of two other subexpressions.

27

3 Algorithms

• 3 Algorithms:– EA-Join: Element and Attribute Join.– EE-Join: Element and Element Join– Kleene Closure

28

EA-Join: Element and Attribute Join

Input:

{E1,…,Em}: Ei is a set of elements having a common document identifier (did);

{A1,…,An}: Aj is a set of elements having a common document identifier (did);

Output:

A set of (e,a) pairs such that the element e is the parent of the attribute a.

29

EA-Join: Element and Attribute Join

The Algorithm:

// Sort-merge {Ei} and {Aj} by did.

(1) foreach Ei and Aj with the same did do:

// Sort-merge Ei and Aj by

// PARENT-CHILD relationship

(2) foreach e Ei and a Aj do

(3) if (e is a parent of a) then output (e,a)

end

end

30

EA-Join – Example

• Consider the XML document:

<Ele Att=“A1”>

<Ele Att=“A2”> </Ele>

</Ele>

• And the query: /Ele[@Att=“A1”]

Ele <1,3>

Ele <3,1>

Att <4,0>

Att <2,0>

31

<Ele Att=“A1”>

<Ele Att=“A2”> </Ele>

</Ele>

• Sort-merging “Ele”s and “Att”s by parent-child relation ship will give us the list:<1,3>, <2,0>, <3,1>, <4,0>

• Finding the elements “Ele”s with a child attribute “Att” with a value “A1” from the accepted list is easy using the information in the Element Record.

EA-Join – Querying /Ele[@Att=“A1”]

Ele <1,3>

Ele <3,1>

Att <4,0>

Att <2,0>

32

EA-Join – Comments

• Only a two-stage sort-merge operation without additional cost of sorting:– First merge: by did.– Second merge: by examining parent-child relationship.

• This merge is based on the order values of the element and attribute as defined by the numbering scheme.

• Attributes should be placed before their sibling elements in the order of the numbering scheme.– guarantees that elements and attributes with the same did

can be merged in a single scan.

33

EE-Join: Element and Element Join

Input:

{E1,…,Em} and {F1,…,Fm}: Ei or Fj is a set of elements having a common document identifier (did).

Output:

A set of (e,f) pairs such that element e is an ancestor of element f.

34

EE-Join: Element and Element Join

The Algorithm:

// Sort-merge {Ei} and {Fj} by did.

(1) foreach Ei and Fj with the same did do:

// Sort-merge Ei and Fj by the

// ANCESTOR-DESCENDANT relationship.

(2) foreach e Ei and f Fj do

(3) if (e is an ancestor of f) then output (e,f);

end

end

35

EE-Join – Comments

• Only two-stage sort-merge operation without the additional cost of sorting:– First merge: by did.– Second merge: by examining parent-child

relationship.

• The sets of elements with a matching did cannot be merged in a single scan.

36

Kleene Closure

Input:

{E1,…,Em}, where Ei is a group of elements from an XML document.

Output:

A Kleene closure of {E1,…,Em}.

37

The Algorithm:

(1) Set i 1;

(2) Set KiC {E1,…,Em};

(3) repeat

(4) set i i + 1;

(5) set KiC EE-Join(Ki-1

C, K1C);

until (KiC is empty);

(6) output the union of K1C,K2

C,…, KiC;

Kleene Closure

38

Performance Experiments

• EE-Join:

• Results: – Real World: an order of magnitude faster.– Synthetic Data: 6 to 10 times faster.

39

Performance Experiments

• EA-Join:

• Results:– Compared to Top-Down: a better performance.– Compared to Bottom-Up: no winner - close results.

40

Performance Results - Conclusions

• The proposed algorithms can achieve performance improvement over the conventional methods (top-down and bottom-up tree traversals) by up to an order of magnitude.