41
A Summary of XISS and Index Fabric Ho Wai Shing

A Summary of XISS and Index Fabric Ho Wai Shing. Contents Definition of Terms XISS (Li and Moon, VLDB2001) Numbering Scheme Indices Stored Join Algorithms

Embed Size (px)

Citation preview

Page 1: A Summary of XISS and Index Fabric Ho Wai Shing. Contents Definition of Terms XISS (Li and Moon, VLDB2001) Numbering Scheme Indices Stored Join Algorithms

A Summary of XISS and Index Fabric

Ho Wai Shing

Page 2: A Summary of XISS and Index Fabric Ho Wai Shing. Contents Definition of Terms XISS (Li and Moon, VLDB2001) Numbering Scheme Indices Stored Join Algorithms

Contents Definition of Terms XISS (Li and Moon, VLDB2001)

Numbering Scheme Indices Stored Join Algorithms

Index Fabric (Cooper et al, VLDB2001) Patricia Balanced Trie Raw Path Index

Page 3: A Summary of XISS and Index Fabric Ho Wai Shing. Contents Definition of Terms XISS (Li and Moon, VLDB2001) Numbering Scheme Indices Stored Join Algorithms

Definition of Terms Absolute Path Expression (APE):

the path which start from root, each step is a traversal of child axis or attribute axis, no wildcards

e.g., /, /A/B, /A/@C

Page 4: A Summary of XISS and Index Fabric Ho Wai Shing. Contents Definition of Terms XISS (Li and Moon, VLDB2001) Numbering Scheme Indices Stored Join Algorithms

Definition of Terms Regular Path Expression (RPE):

may start from root or not, may traverse different axes (restricted

to child, descendant-or-self, attribute for discussions since they are the most commonly used ones)

may contain wildcards e.g., //, /A//C, /A/_/B, //A/B//C/D/@E

Page 5: A Summary of XISS and Index Fabric Ho Wai Shing. Contents Definition of Terms XISS (Li and Moon, VLDB2001) Numbering Scheme Indices Stored Join Algorithms

XISS XISS = XML Indexing and Storage

System by Li and Moon, published in VLDB

2001, with title “Indexing and Querying XML Data for Regular Path Expressions”

decomposes and stores XML documents in the indices

can answer regular path expressions

Page 6: A Summary of XISS and Index Fabric Ho Wai Shing. Contents Definition of Terms XISS (Li and Moon, VLDB2001) Numbering Scheme Indices Stored Join Algorithms

XISS - General Idea solve RPE by decomposing RPE into

these 5 basic subexpressions element retrieval attribute retrieval steps involve an element and an

attribute steps involve two elements a Kleene Closure of another

subexpression

Page 7: A Summary of XISS and Index Fabric Ho Wai Shing. Contents Definition of Terms XISS (Li and Moon, VLDB2001) Numbering Scheme Indices Stored Join Algorithms

XISS - General Idea each subexpression is solved by its

own method: element index lookup attribute index lookup EA-join EE-join KC-join

Page 8: A Summary of XISS and Index Fabric Ho Wai Shing. Contents Definition of Terms XISS (Li and Moon, VLDB2001) Numbering Scheme Indices Stored Join Algorithms

XISS - General Idea result lists from the

subexpressions are joined to produce the final result

to make this decomposition and join efficient, an efficient method to determine ancestor-descendant relationship is needed

XISS uses an extended preorder based numbering scheme

Page 9: A Summary of XISS and Index Fabric Ho Wai Shing. Contents Definition of Terms XISS (Li and Moon, VLDB2001) Numbering Scheme Indices Stored Join Algorithms

XISS - Numbering Scheme number all the nodes with a

<order, size> tuple order is assigned based on an

extended preorder traversal size can be imagined as the size of

the subtree rooted at that node

Page 10: A Summary of XISS and Index Fabric Ho Wai Shing. Contents Definition of Terms XISS (Li and Moon, VLDB2001) Numbering Scheme Indices Stored Join Algorithms

XISS - Numbering Scheme The rules for number assignment

if x precedes y in the preorder traversal, x.order < y.order (preorder)

if x and y are siblings, either x.order + x.size < y.order or y.order + y.size < x.order(siblings won’t overlap)

if x is an ancestor of y, x.order < y.order <= x.order + x.size (ancestor contains descendant)

Page 11: A Summary of XISS and Index Fabric Ho Wai Shing. Contents Definition of Terms XISS (Li and Moon, VLDB2001) Numbering Scheme Indices Stored Join Algorithms

XISS - Numbering Scheme Actual Assignment

uses heuristics to reserve some “space” between orders

reserve more space to the sizes for future node insertions

attributes are place before sibling elements

Page 12: A Summary of XISS and Index Fabric Ho Wai Shing. Contents Definition of Terms XISS (Li and Moon, VLDB2001) Numbering Scheme Indices Stored Join Algorithms

XISS - Index Organization There are 5 indices

Name Index Element Index Attribute Index Structure Index Value Table

Page 13: A Summary of XISS and Index Fabric Ho Wai Shing. Contents Definition of Terms XISS (Li and Moon, VLDB2001) Numbering Scheme Indices Stored Join Algorithms

XISS - Name Index maps element or attribute name to

a name identifier (or nid) nid is used for further query

evaluation representing that element or attribute

reduce the time for string comparison in further index lookup

stored in a B+-tree

Page 14: A Summary of XISS and Index Fabric Ho Wai Shing. Contents Definition of Terms XISS (Li and Moon, VLDB2001) Numbering Scheme Indices Stored Join Algorithms

XISS - Name Index

Name

B+-tree

nid

Page 15: A Summary of XISS and Index Fabric Ho Wai Shing. Contents Definition of Terms XISS (Li and Moon, VLDB2001) Numbering Scheme Indices Stored Join Algorithms

XISS - Value Table stores all the string values of the

XML document

vid value

Page 16: A Summary of XISS and Index Fabric Ho Wai Shing. Contents Definition of Terms XISS (Li and Moon, VLDB2001) Numbering Scheme Indices Stored Join Algorithms

XISS - Element Index input: nid, output: list of element

records implemented by a B+-tree leaves are pointers to list of

document ID (did), each list element points to a list of all elements with the same name in the same document

Page 17: A Summary of XISS and Index Fabric Ho Wai Shing. Contents Definition of Terms XISS (Li and Moon, VLDB2001) Numbering Scheme Indices Stored Join Algorithms

XISS - Element Index

nid

B+-tree

did list

element list

element list

<order, size>,Depth,ParentID

element record

Page 18: A Summary of XISS and Index Fabric Ho Wai Shing. Contents Definition of Terms XISS (Li and Moon, VLDB2001) Numbering Scheme Indices Stored Join Algorithms

XISS - Attribute Index Very similar to element index always has a value identifier, vid

Page 19: A Summary of XISS and Index Fabric Ho Wai Shing. Contents Definition of Terms XISS (Li and Moon, VLDB2001) Numbering Scheme Indices Stored Join Algorithms

XISS - Structure Index Input: did, Output: array containing

all the element and attributes in the document

implemented by a B+-tree

Page 20: A Summary of XISS and Index Fabric Ho Wai Shing. Contents Definition of Terms XISS (Li and Moon, VLDB2001) Numbering Scheme Indices Stored Join Algorithms

XISS - Structure Index

did

B+-tree

nid<order, size>,Parent order,Child order,Sibling order,Attribute orderrecord array

Page 21: A Summary of XISS and Index Fabric Ho Wai Shing. Contents Definition of Terms XISS (Li and Moon, VLDB2001) Numbering Scheme Indices Stored Join Algorithms

XISS - Indices When to use which index?

first use Name Index to find nid of the element/attribute to be queried

search Element/Attribute index for the records

if we need values, lookup Value Table use Structure Index to rebuild or

traverse the XML document tree

Page 22: A Summary of XISS and Index Fabric Ho Wai Shing. Contents Definition of Terms XISS (Li and Moon, VLDB2001) Numbering Scheme Indices Stored Join Algorithms

XISS - Join Algorithms After getting the record lists from

each subexpression, we need to find out which are answers to the original query

e.g., to find /A/B, we found a record list of all element A, another list of all element B, and we have to find out which B’s are A/B

Page 23: A Summary of XISS and Index Fabric Ho Wai Shing. Contents Definition of Terms XISS (Li and Moon, VLDB2001) Numbering Scheme Indices Stored Join Algorithms

XISS - Join Algorithms Three join algorithms proposed:

EA-join - merges an element record list and an attribute record list (solves A/@B)

EE-join - merges two element record lists (solves A/B or A//B)

KC-join - self-merge an element record list (solves (E)*)

Page 24: A Summary of XISS and Index Fabric Ho Wai Shing. Contents Definition of Terms XISS (Li and Moon, VLDB2001) Numbering Scheme Indices Stored Join Algorithms

XISS - EA-Join to solve E/@A input: an element record list and an

attribute record list find out the attribute records which

have parents in the element record list

two lists are sorted by did and then order

Page 25: A Summary of XISS and Index Fabric Ho Wai Shing. Contents Definition of Terms XISS (Li and Moon, VLDB2001) Numbering Scheme Indices Stored Join Algorithms

XISS - EA-join 2-stage sort-merge

group by did first merge using order then output criterion: E is a parent of A

single scan on both list is enough

Page 26: A Summary of XISS and Index Fabric Ho Wai Shing. Contents Definition of Terms XISS (Li and Moon, VLDB2001) Numbering Scheme Indices Stored Join Algorithms

XISS - EE-join to solve E/_*/E, e.g., E/E, E//E, E/_/E input: two Element record lists, E, F output: (e,f) where e is an ancestor

of f also use 2-stage sort-merge however, may need scanning of lists

multiple times (for special cases, e.g., the document has /A/A/B/B)

Page 27: A Summary of XISS and Index Fabric Ho Wai Shing. Contents Definition of Terms XISS (Li and Moon, VLDB2001) Numbering Scheme Indices Stored Join Algorithms

XISS - KC-join to solve Kleene Closure of a

subexpression input: a list of element records fits

the base case recursively use EE join on the list,

and stop until no more grow in the result list

Page 28: A Summary of XISS and Index Fabric Ho Wai Shing. Contents Definition of Terms XISS (Li and Moon, VLDB2001) Numbering Scheme Indices Stored Join Algorithms

Index Fabric by Cooper at el, published in VLDB

2001, with title “A fast index for semistructured data”

has 2 subtypes, raw path index and refined path index

use Patricia technique to compress the index

Page 29: A Summary of XISS and Index Fabric Ho Wai Shing. Contents Definition of Terms XISS (Li and Moon, VLDB2001) Numbering Scheme Indices Stored Join Algorithms

Index Fabric - General Idea it is a disk balanced indexing

structure based on Patricia each data node is associated with

a key string and this string is stored in the trie index for retrieval

the layered approach in building the index ensure the number of disk pages accessed per query

Page 30: A Summary of XISS and Index Fabric Ho Wai Shing. Contents Definition of Terms XISS (Li and Moon, VLDB2001) Numbering Scheme Indices Stored Join Algorithms

Index Fabric - General Idea raw path index answers absolute

path queries refined path index answers any

predefined queries the difference is how to generate

the key

Page 31: A Summary of XISS and Index Fabric Ho Wai Shing. Contents Definition of Terms XISS (Li and Moon, VLDB2001) Numbering Scheme Indices Stored Join Algorithms

Patricia Patricia = Practical Algorithm To

Retrieve Information Coded in Alphanumeric

by Morrison, in JACM 1968 a method to store and retrieve

strings in a space efficient way binary, use bit comparisons, has a

“skip” in each internal node

Page 32: A Summary of XISS and Index Fabric Ho Wai Shing. Contents Definition of Terms XISS (Li and Moon, VLDB2001) Numbering Scheme Indices Stored Join Algorithms

Patricia an example Patricia trie

2

5 4

101110 101111 110000 110011

0 1

0 01 1

Page 33: A Summary of XISS and Index Fabric Ho Wai Shing. Contents Definition of Terms XISS (Li and Moon, VLDB2001) Numbering Scheme Indices Stored Join Algorithms

Patricia it’s basically a trie with internal

nodes having single child removed search is done by

branch according to the value of bit at skip

retrieve the string at leaf compare it with the query string

Page 34: A Summary of XISS and Index Fabric Ho Wai Shing. Contents Definition of Terms XISS (Li and Moon, VLDB2001) Numbering Scheme Indices Stored Join Algorithms

Index Fabric - Balanced Trie The number of disk pages

accessed per query is bounded by the number of layers in the layered index

The idea is similar to that of B-tree, The Patricia trie is decomposed into blocks, and there is an upper layer trie which traverse the blocks

Page 35: A Summary of XISS and Index Fabric Ho Wai Shing. Contents Definition of Terms XISS (Li and Moon, VLDB2001) Numbering Scheme Indices Stored Join Algorithms

Index Fabric - Balanced Trie e.g.

2

5 4

101110 101111 110000 110011

0 1

0 01 1

2

1

Layer 0Layer 1

Page 36: A Summary of XISS and Index Fabric Ho Wai Shing. Contents Definition of Terms XISS (Li and Moon, VLDB2001) Numbering Scheme Indices Stored Join Algorithms

Index Fabric - Balanced Trie There are 3 types of links in the

balanced trie: far link: across layer, a result of branching near link: within the same block, a result

of branching direct link: across layer, the root nodes

are the same Each query will access 1 block in 1

layer

Page 37: A Summary of XISS and Index Fabric Ho Wai Shing. Contents Definition of Terms XISS (Li and Moon, VLDB2001) Numbering Scheme Indices Stored Join Algorithms

Index Fabric - Balanced Trie increase the speed by skipping

nodes of original trie using traversals in upper layers

number of page accessed is bounded

Page 38: A Summary of XISS and Index Fabric Ho Wai Shing. Contents Definition of Terms XISS (Li and Moon, VLDB2001) Numbering Scheme Indices Stored Join Algorithms

Index Fabric - Raw Path each data node is associated with a

key key = path (encoded in designators) + value

designators are special characters, each represents a name

APE queries are translated to prefix to keys and submitted to the index trie

Page 39: A Summary of XISS and Index Fabric Ho Wai Shing. Contents Definition of Terms XISS (Li and Moon, VLDB2001) Numbering Scheme Indices Stored Join Algorithms

Index Fabric - Raw Path Example:

<invoice><buyer><name>HKU</name></buyer></invoice> is translated to IBNHKU (bolded & underlined are designators

query of /invoice/buyer/name[“HKU”] is translated to query string IBNHKU

Page 40: A Summary of XISS and Index Fabric Ho Wai Shing. Contents Definition of Terms XISS (Li and Moon, VLDB2001) Numbering Scheme Indices Stored Join Algorithms

Index Fabric - Refined Path Special designators can be

assigned to special queries (can be regular)

e.g., we define P as the path //buyer/name, and PHKU means there is a buyer/name has value HKU in the document

can answer any predefined RPE very quickly

Page 41: A Summary of XISS and Index Fabric Ho Wai Shing. Contents Definition of Terms XISS (Li and Moon, VLDB2001) Numbering Scheme Indices Stored Join Algorithms

Comparison XISS

can solve general RPE

solve APE by dividing it into steps

Index Fabric RPE solved by

compile time expansion of RPE or using predefined Refined Path Index

solve APE by single index lookup